Database week 11

Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays

Week 11

Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.

Fourth Normal Form

Suppose we have tables ⟨X,Y⟩ and ⟨X,Z⟩. If we join on X, we get ⟨X,Y,Z⟩. Now choose a particular value of x, say x0, and consider all tuples ⟨x0,y,z⟩. If we just look at the y,z part, we get a cross product Y0×Z0, where Y0={y in Y | ⟨x0,y⟩ is in ⟨X,Y⟩} and Z0={z in Z | ⟨x0,z⟩ is in ⟨X,Z⟩}. As an example, consider tables EMP_DEPENDENTS = ⟨ename,depname⟩ and EMP_PROJECTS = ⟨ename,projname⟩:

EMP_DEPENDENTS

ename	depname
Smith	John
Smith	Anna

EMP_PROJECTS

ename	projname
Smith	projX
Smith	projY

Joining gives

ename	depname	projname
Smith	John	X
Smith	John	Y
Smith	Anna	X
Smith	Anna	Y

Fourth normal form attempts to recognize this in reverse, and undo it. The point is that we have a table ⟨X,Y,Z⟩ (where X, Y, or Z may be a set of attributes), and it turns out to be possible to decompose it into ⟨X,Y⟩ and ⟨X,Z⟩ so the join is lossless. Furthermore, neither Y nor Z depend on X, as was the case with our 3NF/BCNF decompositions.

Specifically, for the "cross product phenomenon" above to occur, we need to know that if t1 = ⟨x,y1,z1⟩ and t2 = ⟨x,y2,z2⟩ are in ⟨X,Y,Z⟩, then so are t3 = ⟨x,y1,z2⟩ and t4 = ⟨x,y2,z1⟩. (Note that this is the same condition as in E&N, 15.6.1, p 533, but stated differently.)

If this is the case, then X is said to multidetermine Y (and Z). More to the point, it means that if we decompose into ⟨X,Y⟩ and ⟨X,Z⟩ then the join will be lossless.

Are you really supposed even to look for things like this? Probably not.

Fifth Normal Form

5NF is basically about noticing any other kind of lossless-join decomposition, and decomposing. But noticing such examples is not easy.

16.4 and the problems of NULLs

"There is no fully satisfactory relational design theory as yet that includes NULL values"

[When joining], "particular care must be devoted to watching for potential NULL values in foreign keys"

Ch 17: basics of disks

Databases are often too big to fit everything into memory, even today, and disks work differently from memory. Disks are composed of blocks. At the hardware level the basic unit of the sector is typically 512 bytes, but the operating system clusters these together into blocks of size 5K-10K. In the Elder Days applications would specify their blocksize, but that is now very rare.

A disk looks to the OS like an array of blocks, and any block can be accessed independently. To access a block, though, we must first move the read head to the correct track (seek time) and then wait for the correct block to come around under the head (rotational latency). Typical mean seek times are ~3-4ms (roughly proportional to how far the read head has to move, so seeks of one track over are more like a tenth that). For rotational latency we must wait on average one-half revolution; at 6000 rpm that is 5ms. 6000 rpm is low nowadays, a reasonable value here is again 3-4 ms and so accessing a random block can take 6-8ms.

Managing disk data structures is all about arranging things so as to minimize the number of disk-block fetches.

When processing a file linearly, a common technique is read-ahead, or double-buffering. As the CPU begins to process block N, the IO subsystem requests block N+1. Hopefully, by the time the CPU finishes block N, block N+1 will be available. Reading ahead by more than 1 block is also possible (and in fact is common). (Unix algorithm?)

When handling requests from multiple processes, disks usually do not retrieve blocks in FIFO order. Instead, typically the elevator algorithm is used: the disk arm works from the low-numbered track upwards; at each track, it handles all requests received by that time for blocks on that track. When there are no requests for higher-numbered tracks, the head moves back down.

Records can take up variable space in a block; this is annoying, as it makes finding the kth record on the block more tedious. Still, once the block is in memory, accessing all the records is quick. It is rare for blocks to contain more than a hundred records.

BLOBs (binary large objects) are usually not stored within records; the records instead include a pointer to the BLOB.

File organizations

The simplest file is the heap file, in which records are stored in order of addition. Insertion is efficient; search takes linear time. Deletion is also slow, so that sometimes we just mark space for deletion.

Another format is to keep the records ordered by some field, the ordering field. This is not necessarily a key; for example, we could keep file Employee ordered by Dno. If the ordering field is a key, we call it the ordering key. Ordered access is now fast, and search takes log(N) time (where N is the length of the file in blocks and we are counting only block accesses). Note that the actual algorithm for binary search is slightly different from the classic array version: if we have blocks lo and hi, and know that the desired value X must, if present, lie between these two blocks, then we retrieve the block approximately in between, mid. We then check to see one of these cases:

X < key of 1st record on block mid
X > key of last record on block mid
X is between the keys, inclusive, and so either the record is on mid or is not found

See Fig 17.7

Insertion and deletion are expensive. We can improve insertion by keeping some unused space at the end of each block for new records (or the moved-back other records of the block). We can improve deletion by leaving space at the end (or, sometimes, right in the middle).

Another approach is to maintain a transaction file: a sequential file consisting of all additions and deletions. Periodically we update the master file with all the transactions, in one pass.

Hashing

Review of internal bucket hashing, chain hashing

For disk files, we typically use full blocks as buckets. However, these will often be larger than needed. We also don't use block addresses directly, but provide a single-block map of hash values to block addresses. See Fig 17.10, which also shows some overflow buckets. When a single bucket approaches two blocks, it can be given its own overflow block.

Extendible Hashing

This technique manages buckets more efficiently. We hash on the first d bits of the hash values; d is called the global depth. We keep a directory of all 2^d possible values for these d bits, each with a pointer to a bucket block. Sometimes, two neighboring buckets are consolidated in the directory; for example, if d=3, hash prefix values 010 and 011 might point to the same block. That block thus has a reduced local depth of 2.

As we fill blocks, we need to create new ones. If a block with a reduced local depth overflows, we split it into two blocks with a greater depth (still ≤ d). If a block with local depth d overflows, we need to increment d by 1, double the directory size, and double up all existing blocks except for the one causing the overflow.

See Fig 17.11.

Extendible hashing grew out of dynamic hashing, in which we keep a tree structure of hashcode bits, stopping when we have few enough values that they will all fit into one block.

Ch 18: indexing

It is common for databases to provide indexes for files. An index can be on either a key field or a non-key field; in the latter case it is called a clustering index. The index can either be on a field by which the file is sorted or not. An index can have an entry for every record, in which case it is called dense; if not, it is called sparse.

Primary Index

A primary index is an index on the primary key of a sorted file. The index consists of an ordered list of pairs <k,p>, where k is the first key value to appear in the block pointed to by p (this first record of each block is sometimes called the anchor record). To find a value k in the file, we find consecutive <k1,p> and <k2,p+1> where k1≤k<k2; in that case, the record with key k must be on block p. This is an example of a sparse index. A primary index is usually much smaller than the file itself. See Fig 18.1.

Example 1 on E&N p 635: 30,000 records, 10 per block, for 3000 blocks. Direct binary search takes 12 block accesses. The index entries are 9+6 bytes long, so 68 fit per 1024-byte block. The index has 45 blocks; binary search requires 6 block accesses, plus one more for the actual data block itself.

Clustering index

We can also imagine the file is ordered on a nonkey field (think Employee.dno). In this case we create a clustering index. The index structure is the same as before, except now the block pointer points to the first block that contains any records with that value; see Fig 18.2. Clustering indexes are of necessity sparse.

Another approach to clustering indexes may be seen in Fig 18.3, in which blocks do not contain records with different cluster values.

Secondary Indexes

Now suppose we want to create an index for Employee by (fname, lname), assumed for the moment to be a secondary key. The file itself is ordered by Ssn. An index on a secondary key will necessarily be dense; we cannot use block anchors. A common arrangement is simply to have the index list <key,block> pairs for every key value appearing; if there are N records in the file then there will be N in the index and the only savings is that the index records are smaller. See Fig 18.4. If B is the number of blocks in the original file, and BI is the number of blocks in the index, then BI ≤B, but not by much, and log(BI) ≃ log(B), the search time. But note that unindexed search is linear now, because the file is not ordered on the secondary key.

Example 2, E&N6, p 640: 30,000 records, 10 per block. Without an index, searching takes 1500 blocks on average. Blocks in the index hold 68 records, as before, so the index needs 30,000/68 = 442 blocks; log(442) ≃ 9.

Secondary indexes can also be created on nonkey fields. We can create dense indexes as above, but now with multiple key entries, one for each record. Another approach is to have the index point to blocks of record pointers, as in Fig 18.5.

Multilevel indexes

We're still doing binary search on the ordered indexes above; we'd like to do multi-way search (eg binary = 2-way). Consider again the primary index, with an index entry for the anchor record of each block. Call this the base level, or first level, of the index. We now create a second level index containing an index entry for the anchor record of each block of the first-level index. See Fig 18.6. This is called an indexed sequential file, or an ISAM file.

This technique works as well on secondary keys, except that the first level is now much larger.

Example 3: 30,000 records, first-level index is 442 blocks, second-level index is 442/68 = 7 blocks; third-level index is 1 block.

Next time: B-trees (Bayer trees)

PHP programming

I had a change of heart on htmlspecialchars(); demo of employee.php v employee2.php

I also split out connect() and table_format() into a separate file.

revised table_format()
new lib353.php;
include 'lib353.php';

employees.php has been split into employees2.php and lib353.php.

I fixed the <html><title> ... </title><body> ... </body></html> tags
Newlines in html output: these come from <p> or <br>
What if you have more prepared ("?") parameters than arguments?
testing for errors in prepare()
running from command line
lamp.cslabs.luc.edu
logic for Employee_Update
buttons in home form for:

fetching a record, given that something has been pasted into the Ssn box, and filling in all the form fields
updating a record in the database, using fields supplied by the form

php development: you want
    display_errors = On

You don't really care about
    display_startup_errors = On

You want:
    error_reporting = E_ALL & ~E_NOTICE
This is too strict:
    error_reporting = E_ALL & ~E_NOTICE | E_STRICT

lamp.cslabs.luc.edu demo

Now let's look again at employees2.php

Notice also the practical importance of the "heredoc" sections, in which we can write multi-line HTML code interspersed with occasional php variables. Compare this with generating HTML with a multitude of print statements.

Note that despite the arrangement of HTML lines, you still need <p> or <br> tags to create new lines for the browser client.

connect()
main program (kind of buried), in which we distinguish between

submit button
update button
no button (initial page load)

update button, which refreshes the database view (for when I add or delete records through some other MySQL access)
get_employees(), which displays the table of employees
submit_employees()
OUTER JOIN in get_employees; joining on columns that can be null is often problematic.
input attributes type, name, and value
heredoc syntax

Basics of forms

A form is enclosed between <form method="post" action="program"> and </form> tags. The usual value for action is "", meaning to re-invoke the same program (in our case, a php program), but now with posted data values.

Inside the form you can have input items, of the form <input type="input-type" ... >. Input items can be

text (one-line text boxes)
button (clickable buttons)
radio (radio buttons)
checkbox (checkboxes)
select: menus with options
select ... multiple: allows selecting several items simultaneously
textarea (scrolling text boxes)
submit (submit buttons, semantically different from ordinary buttons)
more

Here's a big form. You can see it in action at form1.html.

<form action="form1.cgi" method="get" enctype="multipart/form-data">
<input type = "button" value="click me!" name="b1">
<input type = "button" value="me too!" name="b2">
<p> Are we having fun yet?
<center>
<input type="radio" name="fun" value = "yes">Yep!
<input type="radio" name="fun" value = "no">Nope!
<input type="radio" name="fun" value = "maybe" CHECKED>I dunno!
</center>
<p><b>Why are you doing this?</b>
<input type = "checkbox" name="cbox" value="credit" CHECKED>For course credit
<input type = "checkbox" name="cbox" value="money" CHECKED>For the money!
<br>
<p><b>Who are you?</b>
<input type = "text" name="name1" value="I dunno" size=20>
<p>Now for a menu:
<p>
<select size = 1 name="menu1">
<option> first item
<option> second item
<option> number 3!
<option> fourth
<option> last
</select>
<p> now with size=4</p>
<select size = 4 name="menu2" multiple>
<option> first item
<option> second item
<option> number 3!
<option> fourth
<option> last
</select>
<p>Here's a text box</p>
<textarea name="mybox" rows=5 cols = 30>
This message will be in the box. It's plain text, <b>not</b> html!
I'm going to make it long enough that a scrollbar appears,
through the miracle of cut-and-paste.
I'm going to make it long enough that a scrollbar appears,
through the miracle of cut-and-paste.
</textarea>
<p>
<input type="submit">
</form>

Understand the form action, and the idea tha tone page can have multiple <form ...> ... </form> objects.

invoice:
    why we might initially want to avoid SQL
    how to do the math in PHP
    why we don't really want to take the PHP approach