Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 12
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
Hashing
Hashing can be used to create a form of index. We can
use hashing with equality comparisons, but not order comparisons, which
is sto say hashing can help us find a record with ssn=123456789, but
not records with salary between 40000 and 50000. Hash indexes are, in
theory, a good index for joins. Consider the join
select e.lname, w.pno, w.hours from employee e, works_on w where e.ssn = w.essn;
we might choose each record e from employee, and want to find all
records in works_on with e.ssn = w.essn. A hash index for works_on, on
the field essn, can be useful here. Note this can work even though
works_on.essn is not a key field.
There is one problem: we are likely to be retrieiving blocks of
works_on in semi-random order, which means one disk access for each
record. However, this is still faster than some alternatives.
Ch 18: indexing
ISAM index
To create the ISAM index, we start with the primary
index, with an index entry for the anchor record of each block, or a secondary index, with an entry for each record. Call
this the base level, or first level, of the index. We now create a second level index containing an index entry for the anchor record of each block of the first-level index. See Fig 18.6. This is called an indexed sequential file, or an ISAM file.
What is involved in inserting
new records into an ISAM structure? The first-level index has to be
"pushed down"; unless we have left space ahead of time, most blocks
will need to be restructured. Then the second and higher-level indexes
will also need to be rebuilt, as anchor records have changed.
Example: we have two records per block, and the first level is
1 2 3
5 7 9 20
1 3 7 20
What happens when we insert 8? 4? 6?
What happens when we "push up a level", that is, add an entry that forces us to have one higher level of index?
B-trees (Bayer trees)
Consider a binary search tree for the moment. We decide which of the two leaf nodes to pursue at each point based on comparison.
We can just as easily build N-ary search trees, with N leaf nodes and N-1 values stored in the node. Consider the ternary example for a moment.
Next, note that we can have a different N at each node!
Bayer trees: let's have a max of 4 data items per node.
1 4 8 9 20 22 25 28
How can we add so as to remain balanced? We also want to minimize partial blocks. A B-tree of order p means that each block has at most p tree pointers, and p-1 key values. In addition, all but the top node has at least (p-1)/2 data values.
"push-up" algorithm: we add a new value to a leaf block. If there is room, we are done. If not, we split the block and push up the middle value to the parent block.
The parent block may now also have to be split.
B+ trees: slight technical improvement where we replicate all the key values on the bottom level.
Databases all support some mechanism of creating indexes, of specific types. For example, MySQL allows
CREATE INDEX indexname ON employee(ssn) USING BTREE; // or USING HASH
Example query (E&N6, p 660):
select e.fname, e.lname from employee e where e.dno = 4 and e.age=59;
If we have an index on dno, then access that cluster using the index and search for age.
If we have an index on age, we can use that.
If we have an index on (dno, age), we have a single lookup! But we also have expensive indexes to maintain.
Indexes can be ordered or not; ordered indexes help us find employees
with e.age >= 59. Ordered and nonordered indexes both support
lookup; ordered indexes also in effect support return of an iterator yielding all DB records starting at the specified point. BTREE and ISAM indexes are ordered; HASH indexes are not.
Query processing
First, note it is relatively easy to sort
records. Typically we will use merge-sort, where we subdivide into
blocks small enough to fit into memory. We sort those with, say,
quicksort. We then make passes until each pair of consecutive blocks is
merged.
Select
Some examples
- select * from employee e where e.ssn = '123456789'
- select * from department where dnumber >=5
- select * from empoyee where dno = 5
- select * from employee where dno = 5 and salary > 30000 and sex = 'F'
- select * from works_on where essn='123456789' and pno=10
Methods of implementation:
Linear search: we read each disk block once.
Binary search: this is an
option if the selection condition is an equality test on a key
attribute, and the file is ordered by that attribute.
Primary index: same requirement as above, plus we need an index
Hash search: equality comparison on any attribute for which we have a hash index (need not be a key attribute)
Primary index and ordered comparison: use the index to find all departments with dnumber > 5
Clustered index (includes hash indexes)
B-tree index
If we are searching for "dno=5 AND salary=4000" we can use an index for either attribute, or a composite index if one exists.
In general, we'd like to choose the condition with the fewest results, and then continue with the remaining conditions.
For OR conditions, we just have to search for each.
Join
Mostly we focus on equijoins:
- select * from employee e, department d where e.dno = d.dnumber
- select * from department d, employee e where d.mgr_ssn = e.ssn
Methods of impementation:
1. Nested-loop join: this is where we think of a for-loop:
for (e in employee) {
for (d in department) {
if (e.dno = d.dnumber) print_it(e,d);
}
}
This is quadratic.
2. index method: if we have an index on one of the attributes, we can use it:
for (e in employee) {
d = lookup(department, e.ssn);
if (d != null) print_it(e,d);
}
Note that any index will do, but that this may involve retrieving several disk blocks for each e. Still, it's better than Method 1.
3. Sort-merge join: we sort both files on the attribute in question,
and then do a join-merge. This takes a single linear pass. This is most
efficient if the files are already sorted, but note that it's still faster than 1 (and possibly faster than 2) if we have to sort the files.
4. Partition-hash join: Let the relations (record sets) be R and S. We
partition both files into Ri = {r in R | hash(r) = i}. Now we note that
the join R ⋈ S is simply the disjoint union of the Ri ⋈ Si. In most
cases, either Ri or Si will be small enough to fit in memory.
A good hash function helps; we want the buckets to be relatively uniform. We won't get that, but we should
hope for Poisson distribution. With N things in K buckets, we expect an
average of a=N/K per bucket. The probability of i things is aie-a/i! and the expected number of i-sized buckets is about N*aie-a/i!.
The join selection factor is the fraction of records that will participate in the join. In the example
select * from department d, employee e where d.mgr_ssn = e.ssn
all departments will be involved in the join, but almost no employees.
So we'd rather go through the departments, looking up managing
employees, rather than the other way around. Book example, EN6 p 693:
we have indexes on both fields. The employee.ssn index takes 4
accesses, and the dept.mgr_ssn index takes two. We have 50 departments
(10 blocks) and 6000 employees (2000 blocks). To do the join by
employee, we have 2000 accesses for the employees, plus, for each
employee, 3 block accesses to find the dept manager. That's 2000 +
6000*3 = 20,000 accesses. In the other order, it's 10 blocks for all
the departments plus for each dept 5 block accesses to find the
manager, for a total of 10 + 5*50 = 260.
Query optimization
As with compilers, the word "optimization" is used loosely.
A major rule is to apply select/project before joins, and in general do
operations first that have greatest potential for reducing the size of
the number of records.
Query tree: See fig 19.4.
Select lname from employee e, works_on w ,project p
where e.ssn = w.essn and w.pno = p.pnumber and p.pname = 'Aquarius';
See fig 19.5, but note that doing cross-products was never on the agenda. However, it still does help to do the second join first.
Some heuristic rules:
- Move the more-restrictive SELECT operations down towards the leaf nodes; that will reduce number of records.
- Same with PROJECT; try to do joins last
- Do the joins in the right order, using the join selection factor estimate
- Do the smaller joins first (or, more generally, do the joins that
will result in an absolutely smaller set of records, whether this is
because the relations were small to start with or because one condition
was highly selective).
PHP programming
lamp.cslabs.luc.edu: MySQL is case-sensitive
regarding table names! But now it's fixed. (Remember this, though, if
you're having trouble with LAMP on another linux system.)
I changed connect() to try both mysql and mysqli. See the code. It also is slightly more
revised table_format()
new lib353.php;
include 'lib353.php';
employees.php has been split into employees.php and lib353.php.
makeSelectMenu(): use as a model for creating selection menu boxes.
makePageButtons(): a function in lib353.php that creates buttons to
the other pages. Edit as necessary. Note that four buttons are embedded
in a table, which is why they are horizontal rather than vertical. Note
also that each button has its own form.
You can choose your own error reporting, with a line like the following (in lib353.php)
error_reporting (E_ALL & ~E_NOTICE);
To get it to eliminate MDB2.php errors, put it before the require 'MDB2.php'; line. You might also want to try
error_reporting (E_ALL & ~E_NOTICE & ~E_DEPRECATED);
testing
for errors in prepare(): after you do $stmt = $db -> prepare($query,
...), you can check if $stmt is an error object with
if (MDB2::isError($stmt)) { ... }
This way you can catch errors not just from execute(), but from prepare().
logic for Employee_Update
You will have two forms. Form1 is for initial arrival, with a button with name submit_ssn (also a text box for the ssn itself). Form2 is for doing the updates; it takes the ssn from form1, displays it readonly, and provides boxes for the existing employee fields, ready for the user to update. It also provides a button with name 'update'.
The main logic is as follows:
if ($_POST['submit_ssn']) { // came from form1
$ssn=$_POST['ssn'];
// get ssn supplied by user in
the ssn box
printform2($db, $ssn);
// get employee info and
display preloaded in form
} else if ($_POST['update']) { // came from form2
$ssn=$_POST['ssn'];
// get ssn from the readonly box
on form2, thus guaranteed to be unchanged
update_employee($db, $ssn);
printform1($db); // back to form 1
} else {
// first arrival
printform1($db);
}
Note that when form2 displays the ssn, it must display it as a
readonly text box, because we otherwise have no way to retrieve the
original ssn when it is time to do the update. If you want to allow the
user to update the ssn itself, you must provide a second textbox for that, with a different name.
A readonly textbox is created with
<input type="text" name="ssn" value="$ssn" readonly>
For form2, you will need to retrieve the employee record corresponding
to the supplied ssn, and then offer a bunch of input text boxes that
allow the user to edit these fields. The text boxes themselves come
from my employees.php; to fill them in, you set the value attribute:
<input type="text" name="fname" value="$fname"> first name<p>
<input type="text" name="minit" value="$minit"> middle initial<p>
<input type="text" name="lname" value="$lname"> last name<p>
(The above is from a heredoc
format where quotes in the body are preserved (ie printed as part of
the output), but the variables are expanded to their values.)
Form2 must:
- execute the query "select * from employee where ssn=?", using the supplied ssn for ?
- retrieve the columns into php variables $fname, $lname, etc
- use these php variables to create the form, as above
- provide an update button. Not only might the value be "update", but the name must be update. The name is what is used as the associative-array index to $_POST.
By the way, we've been using if ($_POST['submit_ssn'])
to test if the associative array $_POST has a key value 'submit_ssn'
(meaning that there was a button on the invoking form that had name="submit_ssn"). But that is E_DEPRECATED; the PHPolice think you shoud use
if (array_key_exists('submit_ssn', $_POST)) {...}
I've made this change to my employees.php that is available to you.
lamp.cslabs.luc.edu demo
error about no method MDB2::error->execute()
empupdate.php
Variants: empupdate1 (above): preliminary version
employee_update: working version, with checkbox for optionally viewing the full record after the update
empupdate2: has radio button for selecting employee
Issues with works_on:
form1: show the works_on table, maybe with employee and project names too, and provide boxes for essn, pno, hours?
form2 might allow you to look up the hours for a given essn and pno,
but you can also just figure that the user will copy the hours. Or else
provide two buttons, one for new assignment and one for update hours. This doesn't work as well for the employee table.
You can do all this with just form2.
More on forms
A form is enclosed between <form method="post" action="program">
and </form> tags. The usual value for action is "", meaning to
re-invoke the same program (in our case, a php program), but now with posted data values.
Basics
Inside the form you can have input items, of the form <input type="input-type" ... >. Input items can be
- text (one-line text boxes)
- button (clickable buttons)
- radio (radio buttons)
- checkbox (checkboxes)
- select: menus with options
- select ... multiple: allows selecting several items simultaneously
- textarea (scrolling text boxes)
- submit (submit buttons, semantically different from ordinary buttons)
- more
Here are some parts of last week's big form, in action at form1.html.
For radio buttons, note that they all have the same name. The value is what will be returned by $_POST['fun'] here. One radio button can have the checked attribute.
<input type="radio" name="fun" value = "yes">Yep!
<input type="radio" name="fun" value = "no">Nope!
<input type="radio" name="fun" value = "maybe" CHECKED>I dunno!
For checkboxes, the value is again what will be returned. I gave
several checkboxes the same name, so with some cgi interfaces an array
of checked values would be returned. For our purposes, it would have
been better to give each checkbox a unique name. Also, we don't care
about the values; to check that a box was checked we can check
$_POST['cbox'] (though here we wouldn't know which box unless we used
unique names).
<input type = "checkbox" name="cbox" value="credit" CHECKED>
<input type = "checkbox" name="cbox" value="money" CHECKED>
Understand the form action, and the idea tha tone page can have multiple <form ...> ... </form> objects.