Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 11
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
Fourth Normal Form
Suppose we have tables ⟨X,Y⟩ and ⟨X,Z⟩. If we join on X, we get
⟨X,Y,Z⟩. Now choose a particular value of x, say x0, and consider all
tuples ⟨x0,y,z⟩. If we just look at the y,z part, we get a cross product
Y0×Z0, where Y0={y in Y | ⟨x0,y⟩ is in ⟨X,Y⟩} and Z0={z in Z
| ⟨x0,z⟩ is in ⟨X,Z⟩}. As an example, consider tables
EMP_DEPENDENTS = ⟨ename,depname⟩ and EMP_PROJECTS = ⟨ename,projname⟩:
EMP_DEPENDENTS
ename
|
depname
|
Smith
|
John
|
Smith
|
Anna
|
EMP_PROJECTS
ename
|
projname
|
Smith
|
projX
|
Smith
|
projY
|
Joining gives
ename
|
depname
|
projname
|
Smith
|
John
|
X
|
Smith
|
John
|
Y
|
Smith
|
Anna
|
X
|
Smith
|
Anna
|
Y
|
Fourth normal form attempts to recognize this in reverse, and undo it.
The point is that we have a table ⟨X,Y,Z⟩ (where X, Y, or Z may be a set of attributes), and it turns out to be possible to decompose it into ⟨X,Y⟩ and ⟨X,Z⟩ so the join is lossless. Furthermore, neither Y nor Z depend on X, as was the case with our 3NF/BCNF decompositions.
Specifically, for the "cross product phenomenon" above to occur, we
need to know that if t1 = ⟨x,y1,z1⟩ and t2 = ⟨x,y2,z2⟩ are in ⟨X,Y,Z⟩,
then so are t3 = ⟨x,y1,z2⟩ and t4 = ⟨x,y2,z1⟩. (Note that this is the
same condition as in E&N, 15.6.1, p 533, but stated differently.)
If this is the case, then X is said to multidetermine Y (and Z). More to the point, it means that if we decompose into ⟨X,Y⟩ and ⟨X,Z⟩ then the join will be lossless.
Are you really supposed even to look for things like this? Probably not.
Fifth Normal Form
5NF is basically about noticing any other kind of lossless-join
decomposition, and decomposing. But noticing such examples is not easy.
16.4 and the problems of NULLs
"There is no fully satisfactory relational design theory as yet that includes NULL values"
[When joining], "particular care must be devoted to watching for potential NULL values in foreign keys"
Ch 17: basics of disks
Databases are often too big to fit everything into memory, even today,
and disks work differently from memory. Disks are composed of blocks. At the hardware level the basic unit of the sector
is typically 512 bytes, but the operating system clusters these
together into blocks of size 5K-10K. In the Elder Days applications
would specify their blocksize, but that is now very rare.
A disk looks to the OS like an array of blocks, and any block can be
accessed independently. To access a block, though, we must first move
the read head to the correct track
(seek time) and then wait for the correct block to come around under
the head (rotational latency). Typical mean seek times are ~3-4ms
(roughly proportional to how far the read head has to move, so seeks of
one track over are more like a tenth that). For rotational latency we
must wait on average one-half revolution; at 6000 rpm that is 5ms. 6000
rpm is low nowadays, a reasonable value here is again 3-4 ms and so
accessing a random block can take 6-8ms.
Managing disk data structures is all about arranging things so as to minimize the number of disk-block fetches.
When processing a file linearly, a common technique is read-ahead,
or double-buffering. As the CPU begins to process block N, the IO
subsystem requests block N+1. Hopefully, by the time the CPU finishes
block N, block N+1 will be available. Reading ahead by more than 1
block is also possible (and in fact is common). (Unix algorithm?)
When handling requests from multiple processes, disks usually do not retrieve blocks in FIFO order. Instead, typically the elevator algorithm
is used: the disk arm works from the low-numbered track upwards; at
each track, it handles all requests received by that time for blocks on
that track. When there are no requests for higher-numbered tracks, the
head moves back down.
Records can take up variable space in a block; this is annoying, as it
makes finding the kth record on the block more tedious. Still, once the
block is in memory, accessing all the records is quick. It is rare for
blocks to contain more than a hundred records.
BLOBs (binary large objects) are usually not stored within records; the records instead include a pointer to the BLOB.
File organizations
The simplest file is the heap
file, in which records are stored in order of addition. Insertion is
efficient; search takes linear time. Deletion is also slow, so that
sometimes we just mark space for deletion.
Another format is to keep the records ordered by some field, the ordering field.
This is not necessarily a key; for example, we could keep file Employee
ordered by Dno. If the ordering field is a key, we call it the ordering key. Ordered access is now fast, and search takes log(N) time (where N is the length of the file in blocks
and we are counting only block accesses). Note that the actual
algorithm for binary search is slightly different from the classic
array version: if we have blocks lo and hi,
and know that the desired value X must, if present, lie between these
two blocks, then we retrieve the block approximately in between, mid. We then check to see one of these cases:
- X < key of 1st record on block mid
- X > key of last record on block mid
- X is between the keys, inclusive, and so either the record is on mid or is not found
See Fig 17.7
Insertion and deletion are expensive. We can improve insertion by
keeping some unused space at the end of each block for new records (or
the moved-back other records of the block). We can improve deletion by
leaving space at the end (or, sometimes, right in the middle).
Another approach is to maintain a transaction file:
a sequential file consisting of all additions and deletions.
Periodically we update the master file with all the transactions, in
one pass.
Hashing
Review of internal bucket hashing, chain hashing
For disk files, we typically use full blocks as buckets. However, these
will often be larger than needed. We also don't use block addresses
directly, but provide a single-block map of hash values to block
addresses. See Fig 17.10, which also shows some overflow buckets. When
a single bucket approaches two blocks, it can be given its own overflow
block.
Extendible Hashing
This technique manages buckets more efficiently. We hash on the first d bits of the hash values; d is called the global depth. We keep a directory of all 2d possible values for these d bits, each with a pointer to a bucket block. Sometimes, two neighboring buckets are consolidated
in the directory; for example, if d=3, hash prefix values 010 and 011
might point to the same block. That block thus has a reduced local depth of 2.
As we fill blocks, we need to create new ones. If a block with a
reduced local depth overflows, we split it into two blocks with a
greater depth (still ≤ d). If a block with local depth d overflows, we
need to increment d by 1, double the directory size, and double up all
existing blocks except for the one causing the overflow.
See Fig 17.11.
Extendible hashing grew out of dynamic hashing,
in which we keep a tree structure of hashcode bits, stopping when we
have few enough values that they will all fit into one block.
Ch 18: indexing
It is common for databases to provide indexes for files. An index can be on either a key field or a non-key field; in the latter case it is called a clustering index. The index can either be on a field by which the file is sorted or not. An index can have an entry for every record, in which case it is called dense; if not, it is called sparse.
Primary Index
A primary index is an index on the primary key of a sorted file. The
index consists of an ordered list of pairs <k,p>, where k is the
first key value to appear in the block pointed to by p (this first
record of each block is sometimes called the anchor record).
To find a value k in the file, we find consecutive <k1,p> and
<k2,p+1> where k1≤k<k2; in that case, the record with key k
must be on block p. This is an example of a sparse index. A primary index is usually much smaller than the file itself. See Fig 18.1.
Example 1 on E&N p 635: 30,000 records, 10 per block, for 3000
blocks. Direct binary search takes 12 block accesses. The index entries
are 9+6 bytes long, so 68 fit per 1024-byte block. The index has 45
blocks; binary search requires 6 block accesses, plus one more for the
actual data block itself.
Clustering index
We can also imagine the file is ordered on a nonkey field (think Employee.dno). In this case we create a clustering index.
The index structure is the same as before, except now the block pointer
points to the first block that contains any records with that value;
see Fig 18.2. Clustering indexes are of necessity sparse.
Another approach to clustering indexes may be seen in Fig 18.3, in
which blocks do not contain records with different cluster values.
Secondary Indexes
Now suppose we want to create an index for Employee by (fname, lname),
assumed for the moment to be a secondary key. The file itself is
ordered by Ssn. An index on a secondary key
will necessarily be dense; we cannot use block anchors. A common
arrangement is simply to have the index list <key,block> pairs
for every key value appearing; if there are N records in the file then
there will be N in the index and the only savings is that the index
records are smaller. See Fig 18.4. If B is the number of blocks in the
original file, and BI is the number of blocks in the index, then BI ≤B,
but not by much, and log(BI) ≃ log(B), the search time. But note that
unindexed search is linear now, because the file is not ordered on the secondary key.
Example 2, E&N6, p 640: 30,000 records, 10 per block. Without an
index, searching takes 1500 blocks on average. Blocks in the index hold
68 records, as before, so the index needs 30,000/68 = 442 blocks;
log(442) ≃ 9.
Secondary indexes can also be created on nonkey fields. We can create
dense indexes as above, but now with multiple key entries, one for each
record. Another approach is to have the index point to blocks of record
pointers, as in Fig 18.5.
Multilevel indexes
We're still doing binary search on the ordered indexes above; we'd like
to do multi-way search (eg binary = 2-way). Consider again the primary
index, with an index entry for the anchor record of each block. Call
this the base level, or first level, of the index. We now create a second level index containing an index entry for the anchor record of each block of the first-level index. See Fig 18.6. This is called an indexed sequential file, or an ISAM file.
This technique works as well on secondary keys, except that the first level is now much larger.
Example 3: 30,000 records, first-level index is 442 blocks,
second-level index is 442/68 = 7 blocks; third-level index is 1 block.
Next time: B-trees (Bayer trees)
PHP programming
I had a change of heart on htmlspecialchars(); demo of employee.php v employee2.php
I also split out connect() and table_format() into a separate file.
revised table_format()
new lib353.php;
include 'lib353.php';
employees.php has been split into employees2.php and lib353.php.
I fixed the <html><title> ... </title><body> ... </body></html> tags
Newlines in html output: these come from <p> or <br>
What if you have more prepared ("?") parameters than arguments?
testing for errors in prepare()
running from command line
lamp.cslabs.luc.edu
logic for Employee_Update
buttons in home form for:
-
fetching a record, given that something has been pasted into the Ssn box, and filling in all the form fields
-
updating a record in the database, using fields supplied by the form
php development: you want
display_errors = On
You don't really care about
display_startup_errors = On
You want:
error_reporting = E_ALL & ~E_NOTICE
This is too strict:
error_reporting = E_ALL & ~E_NOTICE | E_STRICT
lamp.cslabs.luc.edu demo
Now let's look again at employees2.php
Notice also the practical importance of the "heredoc" sections, in which we can write multi-line HTML code interspersed with occasional php variables. Compare this with generating HTML with a multitude of print statements.
Note that despite the arrangement of HTML lines, you still need
<p> or <br> tags to create new lines for the browser client.
- connect()
- main program (kind of buried), in which we distinguish between
- submit button
- update button
- no button (initial page load)
- update button, which refreshes the database view (for when I add or delete records through some other MySQL access)
- get_employees(), which displays the table of employees
- submit_employees()
- OUTER JOIN in get_employees; joining on columns that can be null is often problematic.
- input attributes type, name, and value
- heredoc syntax
empupdate.php
Basics of forms
A form is enclosed between <form method="post" action="program">
and </form> tags. The usual value for action is "", meaning to
re-invoke the same program (in our case, a php program), but now with posted data values.
Inside the form you can have input items, of the form <input type="input-type" ... >. Input items can be
- text (one-line text boxes)
- button (clickable buttons)
- radio (radio buttons)
- checkbox (checkboxes)
- select: menus with options
- select ... multiple: allows selecting several items simultaneously
- textarea (scrolling text boxes)
- submit (submit buttons, semantically different from ordinary buttons)
- more
Here's a big form. You can see it in action at form1.html.
<form action="form1.cgi" method="get" enctype="multipart/form-data">
<input type = "button" value="click me!" name="b1">
<input type = "button" value="me too!" name="b2">
<p> Are we having fun yet?
<center>
<input type="radio" name="fun" value = "yes">Yep!
<input type="radio" name="fun" value = "no">Nope!
<input type="radio" name="fun" value = "maybe" CHECKED>I dunno!
</center>
<p><b>Why are you doing this?</b>
<input type = "checkbox" name="cbox" value="credit" CHECKED>For course credit
<input type = "checkbox" name="cbox" value="money" CHECKED>For the money!
<br>
<p><b>Who are you?</b>
<input type = "text" name="name1" value="I dunno" size=20>
<p>Now for a menu:
<p>
<select size = 1 name="menu1">
<option> first item
<option> second item
<option> number 3!
<option> fourth
<option> last
</select>
<p> now with size=4</p>
<select size = 4 name="menu2" multiple>
<option> first item
<option> second item
<option> number 3!
<option> fourth
<option> last
</select>
<p>Here's a text box</p>
<textarea name="mybox" rows=5 cols = 30>
This message will be in the box. It's plain text, <b>not</b> html!
I'm going to make it long enough that a scrollbar appears,
through the miracle of cut-and-paste.
I'm going to make it long enough that a scrollbar appears,
through the miracle of cut-and-paste.
</textarea>
<p>
<input type="submit">
</form>
Understand the form action, and the idea tha tone page can have multiple <form ...> ... </form> objects.
invoice:
why we might initially want to avoid SQL
how to do the math in PHP
why we don't really want to take the PHP approach