Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 9
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
Functional Dependencies and Normalization
(See the material on Design Guidelines and Normalization from week 7.)
A functional dependency is a kind of semantic constraint.
If X and Y are sets of attributes (column names) in a relation, a
functional dependency X⟶Y means that if two records have equal values
for X attributes, then they also have equal values for Y.
For example, if X is a set including the key attributes, then X⟶{all attributes}.
Like key constraints, FD constraints are not based on specific sets of
records. For example, in the US, we have {zipcode}⟶{city}, but we no
longer have {zipcode}⟶{areacode}.
In the earlier EMP_PROJ, we have
Ssn ⟶ Ename
Pnumber ⟶ Pname, Plocation
{Ssn, Pnumber} ⟶ Hours
In EMP_DEPT we had dependencies
Ssn ⟶ Ename, Bdate, Address, Dnumber
Dnumber ⟶ Dname, Dmgr_ssn

A superkey (or key superset) of a relation schema is a set of attributes S so that no two tuples of the relationship can have the same values on S. A key is thus a minimal superkey:
it is a superkey with no extraneous attributes that can be removed. For
example, {Ssn, Dno} is a superkey for EMPLOYEE, but Dno doesn't matter
(and in fact contains little information); the key is {Ssn}.
Note that, as with FDs, superkeys are related to the sematics of the relationships, not to particular data in the tables.
Relations can have multiple keys, in which case each is called a candidate key.
For example, in table DEPARTMENT, both {dnumber} and {dname} are
candidate keys. For arbitrary performance-related reasons we designated
one of these the primary key; other candidate keys are known as secondary keys.
A prime attribute is an attribute (ie column name) that belongs to some candidate key. A nonprime attribute is not part of any key.
A dependency X⟶A is full if the dependency fails for every proper subset X' of X; the dependency is partial if not, ie if there is a proper subset X' of X such that X'⟶A.
Normal Forms and Normalization
Normal Forms are rules for well-behaved relations. Normalization is the
process of converting poorly behaved relations to better behaved ones.
First Normal Form
First Normal Form (1NF) means that a relation has no composite
attributes or multivalued attributes. Note that dealing with the
multi-valued location
attribute of DEPARTMENT meant that we had to create a new table
LOCATIONS. Composite attributes were handled by making each of their
components a separate attribute.
Alternative ways for dealing with the multivalued location attribute
would be making ⟨dnumber, location⟩ the primary key, or supplying a
fixed number of location columns loc1, loc2, loc3, loc4. For the latter
approach, we must know in advance how many locations we will need; this
method also introduces NULL values.
Second Normal Form
Second Normal Form (2NF) means that, if K represents the set of attributes making up the primary key, every nonprime
attribute A (that is an attribute not a member of any key) is functionally
dependent on K (ie K⟶A), but that this fails for any proper subset of K
(no proper subset of K functionally determines A).
Note that if a relation has a single-attribute primary key, as does
EMP_DEPT, then 2NF is automatic. (Actually, the general definition of
2NF requires this for every candidate key; a relation with a
single-attribute primary key but with some multiple-attribute other key
would still have to be checked for 2NF.)
We say that X⟶Y is a full functional dependency if for every proper subset X' of X, X' does not functionally determine Y. Thus, 2NF means that for every nonprime attribute A, the dependency K⟶A is full: no nonprime attribute depends on less than the full key.
In the earlier EMP_PROJ relationship, the primary key K is {Ssn,
Pnumber}. 2NF fails because {Ssn}⟶Ename, and {Pnumber}⟶Pname,
{Pnumber}⟶Plocation.
To put a table in 2NF, decompose it into sets of attributes which all
have a common full dependency on some subset K' of K. For EMP_PROJ,
this becomes:
⟨Ssn, Pnumber, Hours⟩
⟨Ssn, Ename⟩
⟨Pnumber, Pname, Plocation⟩
Note that Hours is the only attribute with a full FD on {Ssn,Pnumber}.
The table EMP_DEPT is in 2NF.
Note that we might have a table ⟨K1, K2, K3, A1, A2, A3⟩, where
{K1,K2,K3}⟶A1 is full
{K1,K2}⟶A2 is full (neither K1 nor K2 alone determines A2)
{K2,K3}⟶A2 is full
{K1,K3}⟶A3 is full
{K2,K3}⟶A3 is full
The decomposition could be
⟨K1, K2, K3, A1⟩
⟨K1, K2, A2⟩
⟨K1, K3, A3⟩
or it could be
⟨K1, K2, K3, A1⟩
⟨K2, K3, A2, A3⟩
Remember, dependency constraints can be arbitrary! Dependency
constraints are often best thought of as "externally imposed rules";
they come out of theuser-input-and-requirements phase of the DB process.
(However, trying to pretend that there is not a dependency constraint is sometimes a bad idea.
Consider the LOTS exampe of Fig 15.12. Attributes are
- property_ID
- county
- lot_num
- area
- price
- tax_rate
The primary key is property_ID, and ⟨county,lot_num⟩ is also a key.
These are functional dependencies FD1 and FD2 respectively. We also
have
FD3: county ⟶ tax_rate
FD4: area ⟶ price
(For farmland, FD4 is not completely unreasonable, at least if price
refers to the price for tax purposes. In Illinois, the formula is price
= area * factor_determined_by_soil_type).
2NF fails because of the dependency county⟶tax_rate. E&N suggest
the decomposition into LOTS1(property_ID, county, lot_num, area, price)
and LOTS2(county, tax_rate).
We can algorithmically use a 2NF-violating FD to define a decomposition
into new tables. If X⟶A is the FD, we remove A from table R, and
construct a new table with attributes those of X plus A.
Before going further, perhaps two points should be made about
decomposing too far. The first is that all the functional dependencies
should still appear in the set of decomposed tables; the second is that
reassembling the decomposed tables with the "obvious" join should give
us back the original table, that is, the join should be lossless.
A lossless join means no information is lost; typically, if the join is
not lossless then we get back all the original records and then some.
In Fig 15.5 there was a proposed decomposition into EMP_LOCS(ename,
plocation) and EMP_PROJ1(ssn,pnumber,hours,pname,plocation). We can
create each of these as a view:
create view emp_locs as select e.lname, p.plocation from employee e, works_on w, project p
where e.ssn=w.essn and w.pno=p.pnumber;
create view emp_proj1 as select e.ssn, p.pnumber, w.hours, p.pname, p.plocation
from employee e, works_on w, project p where e.ssn=w.essn and w.pno = p.pnumber;
Now let us join these on plocation:
select * from emp_locs el, emp_proj1 ep where el.plocation = ep.plocation;
Oops.
Third Normal Form
Third Normal Form (3NF) means that the relation is in 2NF and also
there is no dependency X⟶A for nonprime attribute A and for attribute
set X that does not contain a candidate key (ie X is not a superkey). In other words, if X⟶A holds for some nonprime A, then X must be a superkey.
If X is a proper subset of a key, then we've ruled out X⟶A for nonprime A in the 2NF step. If X is
a superkey, then X⟶A is automatic for all A. The remaining case is
where X may contain some (but not all) key attributes, and also some
nonkey attributes. An example might be a relation with attributes K1,
K2, A, and B, where K1,K2 is the key. If we have a dependency K1,A⟶B,
then this violates 3NF. A dependency A⟶B would also violate 3NF.
(How would we fix either of these?)
The relationship EMP_DEPT is not 3NF, because of the dependency dnumber ⟶ dname.
The LOTS1 relation above is not 3NF, because of Area ⟶ Price. So we
divide into LOTS1A(property_ID, county,lot_num,area) and
LOTS1B(area,price). Another approach would be to drop price entirely,
if it is in fact proportional to area.
Prepared Statements
Last week we looked at Java's PreparedStatement objects:
PreparedStatement p = connection.prepareStatement(
"select
e.fname, e.lname, e.ssn, e.bdate, e.salary, d.dname from employee e,
department d "
+ "where e.dno=d.dnumber and e.dno=? and e.lname=?"
);
Prepared statements, or parameterized statements/queries, can be either client-side or server-side.
Client-side means that the client replaced the prepared-statement '?'
placeholders with appropriately escaped strings, and then passed the
entire query to the database. The client must be sure to handle the
escaping of embedded quotes correctly or risk a SQL injection attack.
Server-side prepared statements mean that the query with embedded '?'s
is sent to the database server, and then later the values to be plugged
in are sent. Ideally, the value strings are sent as an array of raw
string objects, so there is no ambiguity as to how escaping of
quotation marks is to work.
To see some of this in action, turn on the general_log_file in the
MySQL configuration file (/etc/mysql/my.cnf for me), and enable query
logging
general_log_file = /var/log/mysql/mysql.log
general_log = 1
After doing this, let's run employee2.java again, with e.lname of "O'Hara". We get something like
52 Query SHOW COLLATION
52 Query SET NAMES latin1
52 Query SET character_set_results = NULL
52 Query SET autocommit=1
52 Query SET sql_mode='STRICT_TRANS_TABLES'
52 Query select e.fname, e.lname, e.ssn, e.bdate,
e.salary, d.dname from employee e, department d where e.dno=d.dnumber
and e.dno=5 and e.lname='O\'Hara'
Note that the escaped value for e.lname does show up here, 'O\'Hara',
but that was done by the client, that is, by the JDBC connection. The
last line above is also what appears when we print the
preparedStatement object p after the final p.setString(), suggesting
that at this point (before query execution) java has already replaced
the '?' with its corresponding argument.
By comparison, here's the same sort of thing after using PHP preparedStatements, which use server-side prepares (I've omitted a few unnecessary items)
56 Query PREPARE
MDB2_STATEMENT 'select e.fname, e.lname, e.ssn, e.bdate, e.salary,
d.dname from employee e, department d where e.dno=d.dnumber and e.dno=?
and e.lname=?'
56 Query SET @0 = 5
56 Query SET @1 = 'O\'Hara'
56 Query EXECUTE MDB2_STATEMENT USING @0, @1
This time it is the server doing the escaping where 'O\'Hara' is displayed.
The PHP code for the above is
$query = "select e.fname, e.lname,
e.ssn, e.bdate, e.salary, d.dname from employee e, department d where
e.dno=d.dnumber and e.dno=? and e.lname=?";
$types = array('integer', 'text'); // provide argument types
$qstmt = $db->prepare($query, $types, MDB2_PREPARE_RESULT);
$queryargs = array(5, "O'Hara");
$qres = $qstmt->execute($queryargs);
If instead of $qstmt->execute($queryargs) we split it up as
$qstmt->bindValueArray($argArray);
print_r($qstmt);
$qstmt->execute();
we see that, after binding but before execution, the '?' placeholders have not been expanded. That won't happen until everything reaches the server side.
When prepared statements are used, it helps if something is known about
the types involved. For example, when replacing a '?' with
ssn=888776666, this is a string and so should be quoted. However, when
replacing ? with dno=5, this is an integer and so should not be quoted.
Also, unspecified values at the application side need to be replaced
with NULLs somewhere.
Server-side preparation can infer the type of any column value from the
appropriate table definition. Client-side preparation must do a little
more work; this is the origin of the $types array in the PHP code above
(ironically, not used here for client-side type inference).
PHP basics
The main purpose of PHP is actually to create and process forms, using
server-side scripting. Databases are just central examples of this.
Today's demo is on my laptop: I have apache, php, and the PEAR db
library installed. I have php files in /var/www; when I access these through the browser they execute.
The first example is info.php:
This produces quite a lot of output; note in particuar the magic_quotes_gpc value.
Here is the greeting demo from E&N (greeting.php):
<?php
// greeting demo from Elmasri & Navathe 14.1
// this prints a welcome message if the user enters their name
if ($_POST['user_name']) {
print("Welcome, ");
print($_POST['user_name']);
} else {
// create the form
$action=$_SERVER['PHP_SELF'];
print <<<_HTML_
<FORM method="post" action="$action">
Enter your name: <input type="text" name="user_name">
<BR>
<INPUT type="submit" value="Submit name">
</FORM>
_HTML_;
}
?>
The associative-array variable $_POST contains data sent as the result
of an HTML POST method. The value for key "submit" (the name of the
submit button) would definitely be set. In this case, we print the
username, $_POST['user_name'].
Before the submit button is clicked the $_POST structure is empty, and so we initially execute the else
part above. $action is set to the name of the current script, which is
also the name of the script to process our input. Everything between
the print <_HTML_ and the following _HTML_
(which must begin in column 1) is treated as static text (a so-called
"here doc"), and is printed "as is", except that anything beginning
with $ is treated as a PHP variable and its value is printed instead.
The only variable is $action. We could also have used ordinary print statements here.
My version is slightly different from the book's version, which had
<FORM method="post" action="$_SERVER['PHP_SELF']">
The part in bold here is a "superglobal" variable, and my installation
of php does not allow superglobals in "here docs". I can only assume
this is a security issue; note the workaround above.
It can be confusing to figure out at what point we "start over", and
forget the existing $_POST. Doing a browser reload resends the POST
data and so we stay on the second page; going to the location bar and
hitting return reloads without reposting and so we go back to the first page.
Note we simply ignore <html><body>, etc. I really
should put these in. Note also that printing output works as desired,
except your output is formatted according to where it is in the html.
Next is selection.php, which replaces the input box for a name with a selection for department:
<FORM method="post" action="$action">
Select your department:
<select name="department">
<option value="administration">Administration</option>
<option value="headquarters">Headquarters</option>
<option value="research" selected="selected">Research</option>
</select>
<BR>
<INPUT type="submit" value="Submit">
</FORM>
Before going further, let's introduce some syntax errors into the php file, and see what happens.
As our final forms-only example, here is emp.php, which creates a form with multiple fields, and two buttons. Here is the form-creation part (as a here doc):
print <<<FORMEND
Enter new employee:<p>
<form method="post" action="">
<input type="text" name="fname" value="ralph">
<input type="text" name="minit" size="1" value="j">
<input type="text" name="lname" value="wiggums"> required
<p>
<input type="text" name="ssn" value="abcdefghi"> required
<br><input type="text" name="sex" value="M"> M/F
<br><input type="text" name="bdate" value="1980-07-04"> YYYY-MM-DD
<br><input type="text" name="salary" value="9999"> (annual salary)
<br><input type="text" name="super_ssn" value="999887777"> supervisor (by SSN)
<br><input type="text" name="address" value="no fixed abode"> address
<br><input type="text" name="dno" value="4"> department number
<p><input type="submit" name="submit" value="submit data">
<p><input type="submit" name="update" value="update">
</form>
FORMEND;
Default values have been provided. When the user clicks "submit", we
simply print this. Note the button "update", not used here. Both
buttons are of type "submit", but only the first is named "submit", meaning that only the first will trigger the if ($_POST['submit']).
Last week we looked briefly at the problem of SQL injection:
allowing unchecked user-supplied strings to be used as SQL fields in
dynamically executed strings. Prepared statements pretty much make this
impossible. There is, however, also the concern about HTML injection, more commonly known as cross-site scripting or XSS.
We should be more careful about echoing back our HTML. When we're the
only viewer, all we can do is feed ourselves our own script, but if we
post something on a social networking site like this:
I agree! <script>do_bad_things()</script>
then we've, well, done bad things. In this class we are not going to attempt to sanitize our HTML (though we really should).