Database week 7

Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays

Week 7

Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.

Midterm: March 22

A brief word on auto-commit

To commit your SQL updates is to write the changes to the permanent database; in this sense, it is like save. So far we've assumed that there is a commit operation performed after every insert or update; this situation is called auto-commit. Generally, auto-commit mode is an attribute of your database connection; if auto-commit is true then a commit is performed after every SQL statement that potentially alters the database.

The alternative to auto-commit is to execute a group of updates, and then explicitly invoke the commit operation at the end, to commit all the updates together. A group of SQL statements between consecutive commits is then called a transaction; all the statements of the transaction are committed together. Usually, though, we want a stronger assurance: that all the statements of the transaction either succeed, or none of them do (this is implicit if we know that commits always succeed, but this is not the case in the real world). This is known as the atomicity requirement, the first part of the ACID test (atomicity, consistency, isolation, durability). The idea is that a transaction should be atomic, that is, indivisible: the individual queries that make it up should be executed as a unit.

Instead of a commit, a user may also issue a rollback, which means to throw away all the actions back to the previous commit, thus discarding the transaction.

We'll stick with auto-commit for a while longer, but be aware of two things:

This is why there is no "save" operation
auto-commit is not universal; sometimes you need manual control

Ternary and other higher-degree relationships

Consider the SUPPLY relationship on a supplier s, project j, and part p. The tuple ⟨s,j,p⟩ is included if s supplies part p for project j.

We might try to model this with three binary relationships, SUPPLIES(s,j), CAN_SUPPLY(s,p), and USES(j,p). It is true that if ⟨s,j,p⟩ is in SUPPLY, then ⟨s,j⟩ is in SUPPLIES, ⟨s,p⟩ is in CAN_SUPPLY, and ⟨j,p⟩ is in USES. But the converse is not true.

See Fig 7.17.

As for binary relationships, a ternary relationship key is a triple of keys from each participating entity.

Ternary relationships can be problematic, and so we often include corresponding binary relationships.

One approach is to model a ternary relationship as a weak entity, with three identifying relationships (Fig 7.17(c)). Or we can give SUPPLY a synthetic key, supply_id, and then relate it to SUPLIER, PROJECT, AND PART by binary relationships.

The next example is OFFERS, for a school database; see Fig 7.18; ⟨i,s,c⟩ belongs to OFFERS if INSTRUCTOR i teaches COURSE c during SEMESTER s. Again, the binary projections fail to adequately model the ternary relationship. E&N suggest that if one of the binary relationships is 1:1 (eg if the CAN_TEACH(i,c) relationship is 1:1), then this does work, but that is seldom if ever the case.

Actually, in a real school database one would not use OFFERS, one would use SECTION. The latter would likely be an entity, complete with synthetic key section_id.

Ternary relationships can have cardinality tags, like binary relationships, but they are not as straightforward. For the SUPPLY relationship, suppose each project,part pair (j,p) can have only one supplier. Then we might put a 1 on the SUPPLIER link. But it might also be the case that each project j uses a unique supplier,part pair (s,p) (that is, each supplier can supply only one part to each project). We now have an unrelated 1 on the PROJECT link.

The (min,max) relationship is more straightforward: E------(m,M)-----R means that for each e in E, there are at least m tuples involving e in R, and at most M.

Translation from higher-degree relationships to SQL table definitions is done the same way as for binary M:N relationships: we create a table consisting of columns for keys for each of the participating entities, and any relationship attributes. The entity keys form the primary key for the new table, and each entity key has a foreign key constraint referring back to its defining entity table.

The Enhanced ER (EER) model

In this model, we allow for some forms of inheritance.

Figure 8.1 is a starting point. Note that there are several kinds of employee; some kinds participate in relationships and some do not.

Ultimately, we may implement the diagram of Fig 8.1 with an EMPLOYEE table, and also tables for SECRETARY, TECHNICIAN, ENGINEER, MANAGER, and HOURLY_EMPLOYEE, each indexed by the ssn and having additional columns for the subclass-specific attributes.

To be in a subclass, you must also be in the superclass.

Note that some subclasses have subclass-specific attributes, and other subclasses have subclass-specific participation in relationships.

The circled d in Fig 8.1 stands for "disjoint"; one cannot be a SECRETARY and a TECHNICIAN. However, one can simultaneously be a SECRETARY, a MANAGER, and an HOURLY_EMPLOYEE. In general, membership in multiple subclasses is to be allowed unless explicitly forbidden with the (d) notation. The alternative to (d) is (o), for overlapping.

Fig 8.2 example (showing disjointness)

Generalization is the process of realizing that two existing entities, CAR and TRUCK, are really both instances of VEHICLE. See fig 8.3.

Sometimes subclass membership is determined by a boolean expression involving the parent class (eg jobtype = engineer). In other examples, subclass membership represents a form of new data; these are user-defined subclasses.

Besides disjoint/overlapping, subclasses may be described as total or partial. Total means that every member of the base class must be in some subclass (ie that the base class is abstract in java notation). Partial means that base-class-only objects may exist.

Multiple inheritance means that we may end up with a lattice of relationships: see Fig 8.6.

More on Foreign Keys

Here's the seven-step ER-to-relation algorithm again, slightly simplified:

create a table for each regular entity
create a table for each weak entity, adding the key field from the owner entity as a foreign key for the new entity.
for binary relationships between entities E1 and E2, pick one of them (eg E1) and add to it a field conntaining the key to E2. Make this a foreign key in E1.
for binary 1:N relationships between E1 and E2, E1---1---R---N---E2, add a column to E2 containing the key of E1. Make this a foreign key in E2.
For binary N:M relationships between E1 and E2, create a new table R consisting of ⟨E1.key, E2.key, R.attributes⟩. Make E1.key and E2.key foreign keys in R.
For multivalued attributes of entity E, create a new relation R. One column of R will be E.key; this should be a foreign key in R.
ternary and higher-degree relationships: like step 5.

Joins arise in steps 2, 3, 4, 5, 6, and 7, for recovering the original relationships (or attribute sets for 6, or entities for 2). In every case, the join field is a key of one relation and a foreign key in the other.

Not all joins are about recovering relations from an ER diagram.

XML

XML is a markup language more structured than HTML. Its strict hierarchical structure makes it suitable for textual encoding of databases; see Fig 12.3 and below.

While the encoding of Fig 12.3 looks promising, note that the <Worker> entries are more complex than they are in our relational database; they contain more than just a key to the EMPLOYEE table and the hours attribute. Arguably, even the employee and hours information really should be in a separate XML section <works_on>....</works_on>.

<?xml version="1.0" standalone="yes">
<Projects>
    <Project>
       <Name>ProductX</Name>
       <Number>1</Number>
       <Location>Bellaire</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Last_name>Smith</Last_name>
          <Hours>32.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <First_name>Joyce</First_name>
          <Hours>20.0</Hours>
       </Worker>
    </Project>

    <Project>
       <Name>ProductY</Name>
       <Number>2</Number>
       <Location>Sugarland</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Hours>7.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <Hours>20.0</Hours>
       </Worker>
       <Worker>
          <Ssn>333445555</Ssn>
          <Hours>10.0</Hours>
       </Worker>
    </Project>
</Projects>

One approach to imposing field structure on XML documents is illustrated by Fig 12.4, which shows a small XML DTD (Document Type Definition) document. Note that #PCDATA essentially means a string here. DTD allows us to specify that

Projects is a sequence of one or more Project objects
A Project has a Name, Number, Location, Dept_no (optional), and Workers. It also has a ProjId atribute.
Workers is a sequence of zero or more Worker objects.
Worker objects have SSn, Last_name (opt), First_name (opt), and Hours.

We can now verify that Fig 12.3 actually fits this form. If we wanted to enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">

This allows us to enforce which fields can be present, but doesn't really spell out the full schema involving other tables. For this we move on to XML schema, illustrated in Fig 12.5. While this format is more complex, it also supports unordered data, and shares a more general XML syntax. See XMLschema1 and XMLschema2. Note in the first of these the three entities department, employee, and project. The locations table will later be listed as a multivalued attribute of Department, and dependents will be a multivalued attribute of Employee. WorksOn will be a multivalued attribute of Employee, and Worker a multivalued attribute of Project.

Design Guidelines and Normalization

Normalization refers to a mathematical process for decomposing tables into component tables; it is part of a broader area of database design.

Our first design point will be to make sure that relation and entity attributes have clear semantics. That is, we can easily explain attributes, and they are all related. E&N spells this out as follows:

E&N Guideline 1: Design a relation schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types and relationship types into a single relation.

As examples, consider

EMP_DEP
    Ename
    Ssn
    Bdate
    Address
    Dnumber
    Dname
    Dmgr_ssn

This mixes employee information with department information.

Another example is

EMP_PROJ
    Ssn
    Pnumber
    Hours
    Ename
    Pname
    Plocation

Both these later records can lead to

insertion anomalies: when adding an employee, we must assign them to a department or else use NULLs. When adding a new department with no employees, we have to use NULLs for the employee Ssn, which is supposed to be the primary key!
deletion anomalies: if we delete the last EMP_DEP record from a department, we have lost the department!
update anomalies: what if we update some EMP_DEPs with the new Dmgr_ssn, but not others?

This leads to

E&N Guideline 2: design your database so there are no insertion, deletion, or update anomalies. If this is not possible, document any anomalies clearly so that the update software can take them into account.

The third guideline is about reducing NULLs, at least for frequently used data. Consider the example of a disjoint set of inherited entity types, implemented as a single fat entity table. Secretaries, Engineers, and Managers each have their own attributes; every record has NULLs for two out of these three.

E&N Guideline 3: NULLs should be used only for exceptional conditions. If there are many NULLs in a column, consider a separate table.

The fourth guideline is about joins that give back spurious tuples. Consider

EMP_LOCS
    Ename
    Plocation

EMP_PROJ1
    Ssn
    Pnumber
    Hours
    Pname
    Plocation

If we oin these two tables on field Plocation, we do not get what we want! (We would if we made EMP_LOCS have Ssn instead of Ename, and then joined the two on the Ssn column.)

E&N Guideline 4: Design relational schemas so that they can be joined with equality conditions on attributes that are appropriatedly related ⟨primary key, foreign key⟩ pairs in a way that guarantees that no spurioius tuples are generated. Avoid relations that contain matching attributes that are not ⟨foreign key, primary key⟩ combinations.

Functional Dependencies

A functional dependency is a kind of semantic constraint. If X and Y are sets of attributes (column names) in a relation, a functional dependency X⟶Y means that if two records have equal values for X attributes, then they also have equal values for Y.

For example, if X is a set including the key attributes, then X⟶{all attributes}.

Like key constraints, FD constraints are not based on specific sets of records. For example, in the US, we have {zipcode}⟶{city}, but we no longer have {zipcode}⟶{areacode}.

In the earlier EMP_PROJ, we have
    Ssn ⟶ Ename
    Pnumber ⟶ Pname, Plocation
    {Ssn, Pnumber} ⟶ Hours

Draw this with arrows. Same for EMP_DEPT.

Along the same lines, a superkey (or key superset) of a relation schema is a set of attributes S so that no two tuples of the relationship can have the same values on S. A key is thus a minimal superkey: it is a superkey with no extraneous attributes that can be removed. For example, {Ssn, Dno} is a superkey for EMPLOYEE, but Dno doesn't matter (and in fact contains little information); the key is {Ssn}.

Note that, as with FDs, superkeys are related to the sematics of the relationships, not to particular data in the tables.

Relations can have multiple keys, in which case each is called a candidate key. For example, in table DEPARTMENT, both {dnumber} and {dname} are candidate keys. For arbitrary performance-related reasons we designated one of these the primary key; other candidate keys are known as secondary keys.

A prime attribute is an attribute (ie column name) that belongs to some candidate key. A nonprime attribute does not.

Normal Forms and Normalization

Normal Forms are rules for well-behaved relations. Normalization is the process of converting poorly behaved relations to better behaved ones.

First Normal Form (1NF) means that a relation has no composite attributes or multivalued attributes. Note that dealing with the multi-valued location attribute of DEPARTMENT meant that we had to create a new table LOCATIONS. Composite attributes were handled by making each of their components a separate attribute.

Alternative ways for dealing with the multivalued location attribute would be making ⟨dnumber, location⟩ the primary key, or supplying a fixed number of location columns loc1, loc2, loc3, loc4. For the latter approach, we must know in advance how many locations we will need; this method also introduces NULL values.

Second Normal Form (2NF) means that, if K represents the set of attributes making up the primary key, every nonprime attribute A (that is attribute not a member of any key) is functionally dependent on K (ie K⟶A), but that this fails for any proper subset of K (no proper subset of K functionally determines A). If K contains a single attribute, this is automatic.

We say that X⟶Y is a full functional dependency if for every proper subset X' of X, X' does not functionally determine Y. Thus, 2NF means that for every nonprime attribute A, the dependency K⟶A is full.

In the earlier EMP_PROJ relationship, the primary key K is {Ssn, Pnumber}. 2NF fails because {Ssn}⟶Ename, and {Pnumber}⟶Pname, {Pnumber}⟶Plocation.

To put a table in 2NF, decompose it into sets of attributes which all have a common full dependency on some subset K' of K. For EMP_PROJ, this becomes:
    ⟨Ssn, Pnumber, Hours⟩
    ⟨Ssn, Ename⟩
    ⟨Pnumber, Pname, Plocation⟩
Note that Hours is the only attribute with a full FD on {Ssn,Pnumber}.

Note that we might have a table ⟨K1, K2, K3, A1, A2, A3⟩, where
    {K1,K2,K3}⟶A1 is full
    {K1,K2}⟶A2 is full
    {K2,K3}⟶A2 is full
    {K1,K3}⟶A3 is full
    {K2,K3}⟶A3 is full

The decomposition could be
    ⟨K1, K2, K3, A1⟩
    ⟨K1, K2, A2⟩
    ⟨K1, K3, A3⟩

or it could be
    ⟨K1, K2, K3, A1⟩
    ⟨K2, K3, A2, A3⟩

Remember, dependency constrants can be arbitrary!