Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 7
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
Midterm: March 22
A brief word on auto-commit
To commit your SQL updates is to write the changes to the permanent database; in this sense, it is like save. So far we've assumed that there is a commit operation performed after every insert or update; this situation is called auto-commit. Generally, auto-commit mode is an attribute of your database connection; if auto-commit is true then a commit is performed after every SQL statement that potentially alters the database.
The alternative to auto-commit is to execute a group of updates, and then explicitly invoke the commit
operation at the end, to commit all the updates together. A group of
SQL statements between consecutive commits is then called a transaction; all the statements of the transaction are committed
together. Usually, though, we want a stronger assurance: that all the
statements of the transaction either succeed, or none of them do (this
is implicit if we know that commits always succeed, but this is not the case in the real world). This is known as the atomicity
requirement, the first part of the ACID test (atomicity, consistency,
isolation, durability). The idea is that a transaction should be atomic, that is, indivisible: the individual queries that make it up should be executed as a unit.
Instead of a commit, a user may also issue a rollback, which means to throw away all the actions back to the previous commit, thus discarding the transaction.
We'll stick with auto-commit for a while longer, but be aware of two things:
- This is why there is no "save" operation
- auto-commit is not universal; sometimes you need manual control
Ternary and other higher-degree relationships
Consider the
SUPPLY relationship on a supplier s, project j, and part p. The tuple
⟨s,j,p⟩ is included if s supplies part p for project j.
We might try to model this with three binary relationships,
SUPPLIES(s,j), CAN_SUPPLY(s,p), and USES(j,p). It is true that if
⟨s,j,p⟩ is in SUPPLY, then ⟨s,j⟩ is in SUPPLIES, ⟨s,p⟩ is in
CAN_SUPPLY, and ⟨j,p⟩ is in USES. But the converse is not true.
See Fig 7.17.
As for binary relationships, a ternary relationship key is a triple of keys from each participating entity.
Ternary relationships can be problematic, and so we often include corresponding binary relationships.
One approach is to model a ternary relationship as a weak entity, with
three identifying relationships (Fig 7.17(c)). Or we can give SUPPLY a
synthetic key, supply_id, and then relate it to SUPLIER, PROJECT, AND
PART by binary relationships.
The next example is OFFERS, for a school database; see Fig 7.18;
⟨i,s,c⟩ belongs to OFFERS if INSTRUCTOR i teaches COURSE c during
SEMESTER s. Again, the binary projections fail to adequately model the
ternary relationship. E&N suggest that if one of the binary
relationships is 1:1 (eg if the CAN_TEACH(i,c) relationship is 1:1),
then this does work, but that is seldom if ever the case.
Actually, in a real school database one would not use OFFERS, one would
use SECTION. The latter would likely be an entity, complete with
synthetic key section_id.
Ternary relationships can have cardinality tags, like binary
relationships, but they are not as straightforward. For the SUPPLY
relationship, suppose each project,part pair (j,p) can have only one
supplier. Then we might put a 1 on the SUPPLIER link. But it might also
be the case that each project j uses a unique supplier,part pair (s,p)
(that is, each supplier can supply only one part to each project). We
now have an unrelated 1 on the PROJECT link.
The (min,max) relationship is more straightforward: E------(m,M)-----R
means that for each e in E, there are at least m tuples involving e in
R, and at most M.
Translation from higher-degree relationships to SQL table definitions
is done the same way as for binary M:N relationships: we create a table
consisting of columns for keys for each of the participating entities,
and any relationship attributes. The entity keys form the primary key
for the new table, and each entity key has a foreign key constraint
referring back to its defining entity table.
The Enhanced ER (EER) model
In this model, we allow for some forms of inheritance.
Figure 8.1 is a starting point. Note that there are several kinds of employee; some kinds participate in relationships and some do not.
Ultimately, we may implement the diagram of Fig 8.1 with an EMPLOYEE
table, and also tables for SECRETARY, TECHNICIAN, ENGINEER, MANAGER,
and HOURLY_EMPLOYEE, each indexed by the ssn and having additional
columns for the subclass-specific attributes.
To be in a subclass, you must also be in the superclass.
Note that some subclasses have subclass-specific attributes, and other
subclasses have subclass-specific participation in relationships.
The circled d in Fig 8.1 stands for "disjoint"; one cannot be a SECRETARY and a TECHNICIAN. However, one can
simultaneously be a SECRETARY, a MANAGER, and an HOURLY_EMPLOYEE. In
general, membership in multiple subclasses is to be allowed unless
explicitly forbidden with the (d) notation. The alternative to (d) is
(o), for overlapping.
Fig 8.2 example (showing disjointness)
Generalization is the process of realizing that two existing entities,
CAR and TRUCK, are really both instances of VEHICLE. See fig 8.3.
Sometimes subclass membership is determined by a boolean expression
involving the parent class (eg jobtype = engineer). In other examples,
subclass membership represents a form of new data; these are user-defined subclasses.
Besides disjoint/overlapping, subclasses may be described as total or
partial. Total means that every member of the base class must be in
some subclass (ie that the base class is abstract in java notation). Partial means that base-class-only objects may exist.
Multiple inheritance means that we may end up with a lattice of relationships: see Fig 8.6.
More on Foreign Keys
Here's the seven-step ER-to-relation algorithm again, slightly simplified:
- create a table for each regular entity
- create a table for each weak entity, adding the key field from the owner entity as a foreign key for the new entity.
- for binary relationships between entities E1 and E2, pick one of
them (eg E1) and add to it a field conntaining the key to E2. Make this
a foreign key in E1.
- for binary 1:N relationships between E1 and E2,
E1---1---R---N---E2, add a column to E2 containing the key of E1. Make
this a foreign key in E2.
- For binary N:M relationships between E1 and E2, create a new
table R consisting of ⟨E1.key, E2.key, R.attributes⟩. Make E1.key and
E2.key foreign keys in R.
- For multivalued attributes of entity E, create a new relation R.
One column of R will be E.key; this should be a foreign key in R.
- ternary and higher-degree relationships: like step 5.
Joins arise in steps 2, 3, 4, 5, 6, and 7, for recovering the original relationships (or attribute sets for 6, or entities for 2). In every case, the join field is a key of one relation and a foreign key in the other.
Not all joins are about recovering relations from an ER diagram.
XML
XML is a markup language more structured than HTML. Its strict
hierarchical structure makes it suitable for textual encoding of
databases; see Fig 12.3 and below.
While the encoding of Fig 12.3 looks promising, note that the
<Worker> entries are more complex than they are in our relational
database; they contain more than just a key to the EMPLOYEE table and
the hours attribute. Arguably,
even the employee and hours information really should be in a separate
XML section <works_on>....</works_on>.
<?xml version="1.0" standalone="yes">
<Projects>
<Project>
<Name>ProductX</Name>
<Number>1</Number>
<Location>Bellaire</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Last_name>Smith</Last_name>
<Hours>32.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<First_name>Joyce</First_name>
<Hours>20.0</Hours>
</Worker>
</Project>
<Project>
<Name>ProductY</Name>
<Number>2</Number>
<Location>Sugarland</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Hours>7.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>20.0</Hours>
</Worker>
<Worker>
<Ssn>333445555</Ssn>
<Hours>10.0</Hours>
</Worker>
</Project>
</Projects>
One approach to imposing field structure on XML documents is illustrated by Fig 12.4,
which shows a small XML DTD (Document Type Definition) document. Note
that #PCDATA essentially means a string here. DTD allows us to specify
that
- Projects is a sequence of one or more Project objects
- A Project has a Name, Number, Location, Dept_no (optional), and Workers. It also has a ProjId atribute.
- Workers is a sequence of zero or more Worker objects.
- Worker objects have SSn, Last_name (opt), First_name (opt), and Hours.
We can now verify that Fig 12.3 actually fits this form. If we wanted to enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">
This allows us to enforce which fields can be present, but doesn't
really spell out the full schema involving other tables. For this we
move on to XML schema,
illustrated in Fig 12.5. While this format is more complex, it also
supports unordered data, and shares a more general XML syntax. See XMLschema1 and XMLschema2.
Note in the first of these the three entities department, employee, and
project. The locations table will later be listed as a multivalued
attribute of Department, and dependents will be a multivalued attribute
of Employee. WorksOn will be a multivalued attribute of Employee, and
Worker a multivalued attribute of Project.
Design Guidelines and Normalization
Normalization refers to a mathematical process for decomposing tables
into component tables; it is part of a broader area of database design.
Our first design point will be to make sure that relation and entity
attributes have clear semantics. That is, we can easily explain
attributes, and they are all related. E&N spells this out as
follows:
E&N Guideline 1: Design a relation schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types and relationship types into a single relation.
As examples, consider
EMP_DEP
Ename
Ssn
Bdate
Address
Dnumber
Dname
Dmgr_ssn
This mixes employee information with department information.
Another example is
EMP_PROJ
Ssn
Pnumber
Hours
Ename
Pname
Plocation
Both these later records can lead to
- insertion anomalies: when adding an employee, we must assign them
to a department or else use NULLs. When adding a new department with no
employees, we have to use NULLs for the employee Ssn, which is supposed
to be the primary key!
- deletion anomalies: if we delete the last EMP_DEP record from a department, we have lost the department!
- update anomalies: what if we update some EMP_DEPs with the new Dmgr_ssn, but not others?
This leads to
E&N Guideline 2: design your
database so there are no insertion, deletion, or update anomalies. If
this is not possible, document any anomalies clearly so that the update
software can take them into account.
The third guideline is about reducing NULLs, at least for frequently
used data. Consider the example of a disjoint set of inherited entity
types, implemented as a single fat entity table. Secretaries,
Engineers, and Managers each have their own attributes; every record
has NULLs for two out of these three.
E&N Guideline 3: NULLs should be
used only for exceptional conditions. If there are many NULLs in a
column, consider a separate table.
The fourth guideline is about joins that give back spurious tuples. Consider
EMP_LOCS
Ename
Plocation
EMP_PROJ1
Ssn
Pnumber
Hours
Pname
Plocation
If we oin these two tables on field Plocation, we do not get what we want! (We would if we made EMP_LOCS have Ssn instead of Ename, and then joined the two on the Ssn column.)
E&N Guideline 4: Design relational
schemas so that they can be joined with equality conditions on
attributes that are appropriatedly related ⟨primary key, foreign key⟩
pairs in a way that guarantees that no spurioius tuples are generated.
Avoid relations that contain matching attributes that are not ⟨foreign
key, primary key⟩ combinations.
Functional Dependencies
A functional dependency is a kind of semantic constraint.
If X and Y are sets of attributes (column names) in a relation, a
functional dependency X⟶Y means that if two records have equal values
for X attributes, then they also have equal values for Y.
For example, if X is a set including the key attributes, then X⟶{all attributes}.
Like key constraints, FD constraints are not based on specific sets of
records. For example, in the US, we have {zipcode}⟶{city}, but we no
longer have {zipcode}⟶{areacode}.
In the earlier EMP_PROJ, we have
Ssn ⟶ Ename
Pnumber ⟶ Pname, Plocation
{Ssn, Pnumber} ⟶ Hours
Draw this with arrows. Same for EMP_DEPT.
Along the same lines, a superkey (or key superset) of a relation schema is a set of attributes S so that no two tuples of the relationship can have the same values on S. A key is thus a minimal superkey:
it is a superkey with no extraneous attributes that can be removed. For
example, {Ssn, Dno} is a superkey for EMPLOYEE, but Dno doesn't matter
(and in fact contains little information); the key is {Ssn}.
Note that, as with FDs, superkeys are related to the sematics of the relationships, not to particular data in the tables.
Relations can have multiple keys, in which case each is called a candidate key.
For example, in table DEPARTMENT, both {dnumber} and {dname} are
candidate keys. For arbitrary performance-related reasons we designated
one of these the primary key; other candidate keys are known as secondary keys.
A prime attribute is an attribute (ie column name) that belongs to some candidate key. A nonprime attribute does not.
Normal Forms and Normalization
Normal Forms are rules for well-behaved relations. Normalization is the
process of converting poorly behaved relations to better behaved ones.
First Normal Form (1NF) means that a relation has no composite attributes or multivalued attributes. Note that dealing with the multi-valued location
attribute of DEPARTMENT meant that we had to create a new table
LOCATIONS. Composite attributes were handled by making each of their
components a separate attribute.
Alternative ways for dealing with the multivalued location attribute
would be making ⟨dnumber, location⟩ the primary key, or supplying a
fixed number of location columns loc1, loc2, loc3, loc4. For the latter
approach, we must know in advance how many locations we will need; this
method also introduces NULL values.
Second Normal Form (2NF) means that, if K represents the set of attributes making up the primary key, every nonprime
attribute A (that is attribute not a member of any key) is functionally
dependent on K (ie K⟶A), but that this fails for any proper subset of K
(no proper subset of K functionally determines A). If K contains a
single attribute, this is automatic.
We say that X⟶Y is a full functional dependency if for every proper subset X' of X, X' does not functionally determine Y. Thus, 2NF means that for every nonprime attribute A, the dependency K⟶A is full.
In the earlier EMP_PROJ relationship, the primary key K is {Ssn,
Pnumber}. 2NF fails because {Ssn}⟶Ename, and {Pnumber}⟶Pname,
{Pnumber}⟶Plocation.
To put a table in 2NF, decompose it into sets of attributes which all
have a common full dependency on some subset K' of K. For EMP_PROJ,
this becomes:
⟨Ssn, Pnumber, Hours⟩
⟨Ssn, Ename⟩
⟨Pnumber, Pname, Plocation⟩
Note that Hours is the only attribute with a full FD on {Ssn,Pnumber}.
Note that we might have a table ⟨K1, K2, K3, A1, A2, A3⟩, where
{K1,K2,K3}⟶A1 is full
{K1,K2}⟶A2 is full
{K2,K3}⟶A2 is full
{K1,K3}⟶A3 is full
{K2,K3}⟶A3 is full
The decomposition could be
⟨K1, K2, K3, A1⟩
⟨K1, K2, A2⟩
⟨K1, K3, A3⟩
or it could be
⟨K1, K2, K3, A1⟩
⟨K2, K3, A2, A3⟩
Remember, dependency constrants can be arbitrary!