Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 10
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
A superkey (or key superset) of a relation schema is a set of attributes S so that no two tuples of the relationship can have the same values on S. A key is thus a minimal superkey:
it is a superkey with no extraneous attributes that can be removed. For
example, {Ssn, Dno} is a superkey for EMPLOYEE, but Dno doesn't matter
(and in fact contains little information); the key is {Ssn}.
Note that, as with FDs, superkeys are related to the sematics of the relationships, not to particular data in the tables.
Relations can have multiple keys, in which case each is called a candidate key.
For example, in table DEPARTMENT, both {dnumber} and {dname} are
candidate keys. For arbitrary performance-related reasons we designated
one of these the primary key; other candidate keys are known as secondary keys.
A prime attribute is an attribute (ie column name) that belongs to some candidate key. A nonprime attribute is not part of any key.
A dependency X⟶A is full if the dependency fails for every proper subset X' of X; the dependency is partial if not, ie if there is a proper subset X' of X such that X'⟶A.
Third Normal Form
Third Normal Form (3NF) means that the relation is in 2NF and also
there is no dependency X⟶A for nonprime attribute A and for attribute
set X that does not contain
a candidate key (ie X is not a superkey). In other words, if X⟶A holds
for some nonprime A, then X must be a superkey. (For comparison, 2NF
says that if X⟶A for nonprime A, then X cannot be a proper subset of
any key, but X can still overlap with a key or be disjoint from a key.)
If X is
a superkey, then X⟶A is automatic for all A. The remaining case is
where X may contain some (but not all) key attributes, and also some
nonkey attributes. An example might be a relation with attributes ⟨K1,
K2, A, B⟩, where K1,K2 is the key. If we have a dependency K1,A⟶B,
then this violates 3NF. A dependency A⟶B would also violate 3NF.
Either of these can be fixed by factoring out:
if X⟶A is a functional dependency, then the result of factoring out by
this dependency is to remove column A from the original table, and to
create a new table ⟨X,A⟩. For example, if the ⟨K1,
K2, A, B⟩ has dependency K1,A⟶B, we create
two new tables ⟨K1,
K2, A⟩ and ⟨K1, A, B⟩. If we were factoring out A⟶B, we would create new tables ⟨K1,
K2, A⟩ and ⟨A,B⟩. Both the resultant tables are projections of the original; in the second case, we also have to remove duplicates.
One question that comes up when we factor is whether it satisfies the nonadditive join property (or lossless join property):
if we join the two resultant tables on the "factor" column, are we
guaranteed that we will recover the original table exactly. Consider
the decomposition of R = ⟨K1,
K2, A, B⟩ above on the dependency K1,A⟶B into R1 = ⟨K1,
K2, A⟩ and R2 = ⟨K1, A,
B⟩, and then we form the join R1⋈R2 on the columns K1,A. If ⟨k1,k2,a,b⟩
is a record in R, then ⟨k1,k2,b⟩ is in R1 and ⟨k1,a,b⟩ is in R2 and so
⟨k1,k2,a,b⟩ is in R1⋈R2; this is the easy direction and does not
require any hypotheses about constraints.
The harder question is making sure R1⋈R2 does not contain added
records. If ⟨k1,k2,a,b⟩ is in R1⋈R2, we know that it came from
⟨k1,k2,a⟩ in R1 and ⟨k1,a,b⟩ in R2. Each of these partial records came
from the decomposition, so there must be b' so ⟨k1,k2,a,b'⟩ is in R,
and there must be k2' so ⟨k1,k2',a,b⟩ is in R, but in the most general
case we need not have b=b' or k2=k2'. Here we use the key constraint,
though: if ⟨k1,k2,a,b⟩ is in R, and ⟨k1,k2,a,b'⟩ is in R, and k1,k2 is
the key, then b=b'. Alternatively we could have used the dependency K1,A⟶B: if this dependency holds, then it means that if R contains ⟨k1,k2,a,b⟩ and ⟨k1,k2,a,b'⟩, then b=b'.
This worked for either of two reasons: R1 contained the original key,
and R2's new key was the lefthand side of a functional dependency that
held in R.
In general, if we factor a relation R=⟨A,B,C⟩ into R1=⟨A,B⟩ and
R2=⟨A,C⟩, by projection, then the join R1⋈R2 on column A will be much
larger. As a simple example, consider Works_on = ⟨essn,pno,hours⟩; if
we factor into ⟨essn,pno⟩ and ⟨pno,hours⟩ and then rejoin, then
essn,pno will no longer even be a key. Using two records of the
original data,
123456789
|
1
|
32.5
|
453453453
|
1
|
20
|
we see that the factored tables would contain ⟨123456789,1⟩ and ⟨1,20⟩,
and so the join would contain ⟨123456789,1,20⟩ violating the key
constraint.
Multiple factoring outcomes
Consider a relation ⟨K1, K2, A, B, C⟩ where K1,K2 is the key and we have dependencies K1⟶B and B⟶C. If we try to put into 2NF first,
by "factoring out" K1⟶B, we get tables ⟨K1,K2,A,C⟩ and ⟨K1,B⟩; the
dependency B⟶C is no longer expressible in terms of the tables. But if
we start by factoring out B⟶C, we get ⟨K1,K2,A,B⟩ and ⟨B,C⟩; we can now
factor out K1⟶B which yields relations ⟨K1,K2,A⟩, ⟨K1,B⟩ and ⟨B,C⟩; all
functional dependencies have now been transformed into key constraints. Factoring can lose dependencies, or, more accurately, make them no longer expressible except in terms of the re-joined tables.
The relationship EMP_DEPT of E&N fig 15.11 (p 524) is not 3NF, because of the dependency dnumber ⟶ dname,dmgr_ssn.
The LOTS1 relation generated last week (E&N fig 15.12, p 527) is not 3NF, because of Area ⟶ Price. So we
factor on Area ⟶ Price, dividing into LOTS1A(property_ID, county,lot_num,area) and
LOTS1B(area,price). Another approach would be to drop price entirely,
if it is in fact proportional to area.
Boyce-Codd Normal Form
BCNF requires that whenever there is a nontrivial functional dependency
X⟶A, then X is a superkey. It differs from 3NF in that 3NF requires either that X be a superkey or that A be prime (a member of some key). To put it another way, BCNF bans all nontrivial nonsuperkey dependencies X⟶A; 3NF makes an exception if A is prime.
As for 3NF, we can use factoring to put a set of tables into BCNF. However, there is now a serious problem: by factoring out a prime attribute A, we can destroy an existing key constraint! This is undesireable.
The canonical example of a relation in 3NF but not BCNF is ⟨A, B, C⟩ where we also have C⟶B. Factoring as above leads to ⟨A, C⟩ and ⟨C, B⟩. We have lost the key A,B! However, this isn't quite all it appears, because from C⟶B we can conclude A,C⟶B, and thus that A,C is also a key, and might be a better choice of key than A,B.
LOTS1A from last week was 3NF and BCNF. But now let us suppose that DeKalb county lots
have sizes <= 1.0 acres, while Fulton county lots have sizes >1.0
acres; this means we now have an additional dependency FD5: area⟶county. This
violates BCNF, but not 3NF as county is a prime attribute.
If we fix LOTS1A as in Fig 15.13, dividing into LOTS1AX(property_ID,area,lot_num) and LOTS1AY(area,county), then we lose the functional dependency FD2: (county,lot_num)⟶property_ID. Where has it gone? This was more than just a random FD; it was a candidate key for LOTS1A.
All databases enforce primary-key constraints. One coud use a CHECK
statement to enforce the lost FD2 statement, but it is often a lost
cause.
CHECK (not exists (select ay.county, ax.lot_num, ax.property_ID, ax2.property_ID
from LOTS1AX ax, LOTS1AX ax2, LOTS1AY ay
where ax.area = ay.area and ax2.area = ay.area // join condition
and ax.lot_num = ax2.lot_num
and ax.property_ID <> ax2.property_ID))
We might be better off ignoring FD5 here, and just allowing for the
possibility that area does not determine county, or determines it only
"by accident".
PHP and MySQL
For connections to MySQL, we will use the PEAR library, known as MDB2.
Now let's look at some examples where we actually interact with the
database. First, a handy utility. Data from a query will come back as a
resultSet, essentially a sequence of rows, ie a result table. We want
to print this table as an html table, with column headers and divider lines. This basically means
- beginning and ending the whole thing with <table> ... </table>
- beginning and ending each row with <tr> ... </tr>
- beginning and ending each data item with <td> ... </td>
Here is the code. We retrieve column names with getColumnNames(), and
use a foreach loop with "key" mode. We then retrieve data rows with
fetchRow(), in ORDERED (versus ASSOC) mode. Rows themselves are
int-indexed arrays. To fetch columns of a row, we use an ordinary php
while loop; if $row is the row, then its length is N=count($row) and
its components are $row[0] through $row[N-1]. Note that we define
$nullval to be the string to be printed whenever NULL data values are encountered.
function table_format($queryres) {
$nullval="null"; // can use "" as desired
print "<table border=\"1\">";
$colnames = $queryres -> getColumnNames(); // $colnames is an array here
print "<tr>";
foreach ($colnames as $colname => $colnum) { // binds $colname to key, $colnum to value
print "<th>";
print $colname;
print "</th>";
}
print "</tr>\n";
while ($row = $queryres -> fetchRow(MDB2_FETCHMODE_ORDERED)) {
print "<tr>";
$j=0;
while ($j < count($row)) {
print "<td>";
if ($row[$j] == NULL) print $nullval; else print $row[$j];
$j++;
print "</td>";
}
print "</tr>\n";
$i++;
}
print "</table>";
}
Next, we need to connect to the database. That's here:
$user='pld';
$password='blue';
$database="office";
$added=0;$fail=0;
$db=MDB2::connect("mysqli://$user:$password@localhost/$database");
if (MDB2::isError($db)) {
die("cannot connect: " . $db->getMessage());
}
What happens if the password or database name is wrong?
Next, here's a simple query (empquery.php)
$query = "select e.fname, e.lname, e.ssn, e.bdate, e.salary, d.dname
from employee e, department d where e.dno=d.dnumber and e.dno=? and e.lname=?";
$types = array('integer', 'text');
$qstmt = $db->prepare($query, $types, MDB2_PREPARE_RESULT);
$queryargs = array(5, "O'Hara");
$qres = $qstmt->execute($queryargs);
if (MDB2::isError($qres)) {
print("query not successful: " . $qres->getMessage());
$fail=1;
}
print "data table:<p>";
table_format($qres);
All this does is retrieves the results of the given (parameterized)
query, and prints it as a table. There is no "form-building" per se.
But notice the prepared statement, and the use of the array $types to
indicate the parameter types.
Let's merge it with the previous example to get a query that starts
with a box asking for a last name, and then prints the employee table
for that last name in department 5 (empquery2.php). The main part is
if ($_POST['lname']) {
print("Here is the data");
runquery($_POST['lname']);
} else {
// create the form
$action=$_SERVER['PHP_SELF'];
print <<<_HTML_
<FORM method="post" action="$action">
Enter the employee last name: <input type="text" name="lname">
<BR>
<INPUT type="submit" value="Submit name">
</FORM>
_HTML_;
}
Then runquery() does this:
function runquery($lname) {
$user='pld';
$password='blue';
$database="office";
$added=0;
$fail=0;
$db=MDB2::connect("mysqli://$user:$password@localhost/$database");
if (MDB2::isError($db)) {die("cannot connect: " . $db->getMessage());}
$query = "select e.fname, e.lname, e.ssn, e.bdate, e.salary, d.dname
from employee e,department d where e.dno=d.dnumber and e.dno=? and e.lname=?";
$types = array('integer', 'text');
$qstmt = $db->prepare($query, $types, MDB2_PREPARE_RESULT);
$queryargs = array(5, $lname);
$qres = $qstmt->execute($queryargs); // supply parameters
if (MDB2::isError($qres)) {
print("query not successful: " . $qres->getMessage());
$fail=1;
}
print "data table:<p>";
table_format($qres);
}
Notice the array $types, specifying the type of each of the ? placeholders. Notice also how we use $queryargs.
PHP errors and debugging
Let's create the following kinds of errors, and see what happens to the web page. The first group are syntax errors:
-
missing semicolons
-
missing require 'MDB2.php' statement
-
unbalanced '(' or '{'
- malformed heredoc (eg trailing tag not at left margin)
The second group are, as far as PHP is concerned, semantic errors:
-
bad password, eg 'teal' instead of 'blue'
-
query missing a FROM clause
-
other malformed query
Note that we can run php in the command line (at least under linux):
php empquery.php
But I don't know how to pass post variables (except I believe you can do this with additional command-line parameters, perhaps involving &).
Now let's look at employees.php, the first of four pages that make up your assignment.
Notice also the practical importance of the "heredoc" sections, in which we can write multi-line HTML code interspersed with occasional php variables. Compare this with generating HTML with a multitude of print statements.
Note that despite the arrangement of HTML lines, you still need
<p> or <br> tags to create new lines for the browser client.
Features of employees.php:
- connect()
- main program (kind of buried), in which we distinguish between
- submit button
- update button
- no button (initial page load)
- update button, which refreshes the database view (for when I add or delete records through some other MySQL access)
- get_employees(), which displays the table of employees
- submit_employees()
- OUTER JOIN in get_employees
- input attributes type, name, and value
- heredoc syntax