Database week 2

Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays

Week 2

Read:

Chapter 1
Chapter 2 Section 1 on DB schemas
Chapter 3
Chapter 4

Exercises (due Feb 1)
Chapter 4, p 112: 4.5, 4.10, 4.12

More on keys

A KEY is any set of columns that is guaranteed to uniquely determine a row. Notice that this is really a constraint about row-insertion.

A table may have several keys. One of them is designated the primary key; this is the key that the DBMS most likely uses for internal indexing. Frequently it is a synthesized key; ie a unique serial number assigned by the DBMS.

Other keys are know as secondary keys or candidate keys.

Any key (primary or not) may be a composite key: that is, consisting of multiple columns. Typically, tables representing some kind of "transaction" or relationship between other tables will have a composite primary key; the same is often true of tables representing many-one relationships. In the Office example (the book calls it the Company example), the tables Works_on, Dept_locations, and Dependents all had a composite primary key. The Works_on table represents "transactions" involving employees and projects; the Dependents table key is composite because each employee may have several dependents (and we do not necessarily know the dependent SSNs), and the dept_locations key is composite because one department may have several locations.

Secondary Keys: other column combinations that are keys, but not the one intended

If you have a mysql window, and want to load up a file of SQL statements, use
source filename

If you have a file mydata.text consisting of tab-separated fields, with \N for NULL entries, you can load it as follows:

LOAD DATA LOCAL INFILE '/home/pld/453/mydata.text' INTO TABLE mytable;

You might also need LINES TERMINATED BY '\r\n'

Note that using "null" or "NULL" in place of \N does NOT have the desired effect. Also, '\n' is newline, not the same as '\N'

SQL (E&N chapter 4)

Review the idea of join:
The "extension" case: table2 in some sense extends information that could have been put in table1 except for redundancies. For each row in table1, find the unique matching row in table2. The join column in this case is likely to be a key in table2, and declared as a foreign key in table1.

Example:

    select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

The "relationship" case: table2 defines some relationship; it has a dual-column key and the join column is a key in table1 and one of the key columns in table2. Example:

    select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn;

Actually, this case is exactly the same as the first, with table1 and table2 reversed. Except that we tend to think of the employee table as representing things (people), and the works_on table as representing relationships (who works on what).

We can also have multiple matches in table2, or zero matches. Multiple matches occur in the second example (technically, multiple rows in table1 in the first example match the row in table2 with dnumber=5).

create table: office example, with and without changes to allow foreign keys

create table names as select fname,lname,ssn from employee;

SQL constraints: Section 4.2

constraint specification
constraints can be given names; there is some debate as to whether this is helpful.

Four constraints:

column type
not null
primary key
foreign key

Foreign key constraints: what happens if we insert the dname,dnumber for a department, then add employees, then do one of the following:

delete the dept record
update it, changing the dnumber from 4 to 40

From E&N p 95:

create table department(
    dname varchar(20),
    dnumber int,
    mgr_ssn char(9) not null default = '888665555',
    mgr_start date,
    constraint dept_primary_key primary key (dnumber),
    constraint dept_secondary_key unique (dname),
    constraint dept_mgr_foreign_key
       foreign key (mgr_ssn) references employee(ssn)
       on delete set default
       on update cascade
);

Named keys mean that you will be told what constraint is violated. This is less helpful than it seems. Also, constraints can be deleted by name.

But consider the on delete / on update. What these mean is that if you delete an employee e from the employee db who is a dept manager (ie that employee's ssn is used as department.mgr_ssn for some row), then the mgr_ssn is set to the default value of 888665555. Furthermore, if you update an employee entry to correct a department manager's ssn, then that corrected value is cascaded into the appropriate row(s) of the department table.

The SQL select-from-where statement: 4.3

select columns from tables where boolean condition selecting rows

The tricky part is that boolean conditions can involve joins.

Also, note that SQL is a "nonimperative" language: it has (essentially) no assignment operator.

Insert

insert into tablename values (f1val, f2val, ... , fnval);
You can also name columns:
insert into employee(fname, lname, ssn, dno) values ('john', 'smith', 345678912, 4);

Finally, instead of values you can provide a select query that returns a set of rows from some existing table, with matching "type signature" (matching column types).

Inserts can violate all four constraints.

insert into employee(fname,lname,ssn,dno)
values ('robert', 'Hatcher', '456789123', 2);

insert into employee(fname,lname,dno)
values ('robert', 'Hatcher', 5);

Update
Basic form:
    update employee
    set salary = salary*1.10
    where dno = 5;

Updates can also violate all four constraints, though the not null constraint can be violated only by setting a column to null, and the primary key constraint can only be violated by updating the primary key. Foreign key constraints are relatively easy to violate, though.

Delete
delete from tablename where boolean condition selecting rows

Here, we can only violate foreign key constraints, eg by deleting from table department the department of some existing worker.

SQL data types

INT (etc)
VARCHAR(n)
Boolean
DATE (& TIME, etc)
blobs

SQL examples

Some queries are modified to fit the Office database, from those of Ramakrishnan & Gehrke 2002; the others are from E&N.

Find all employees with salary >= 30000

select * from employee where salary >= 30000;

select e.fname, e.lname, e.salary from employee e where e.salary >=30000;

Note my use of the e name. I recommend this style for readabililty.

Query 2 of E&N, as written:

select pnumber, dnum, lname, address, bdate
from project, department, employee
where dnum = dnumber and mgr_ssn = ssn and plocation = 'Stafford';

Compare with

select p.pnumber, dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

[RG] Find the list of supervisor ssns (that is, omit duplicates)

select distinct e.super_ssn from employee e;

adding "where e.super_ssn is not null"

Find all employees who have worked on project 2:

(Note that in our works_on table, no employee works on the same project more than once. Compare the solution to obtaining a list of employees who have worked on project 2, where the table is works_on_by_week: ⟨essn, pno, week, hours⟩

select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2;

Note this is a join example (not our first). We could also have written it:

select e.fname, e.lname from employee e join works_on w on e.ssn = w.essn where w.pno = 2;

Back to E&N Query 2: a dual-join example

Employees who have worked >20 hours on a project?

select e.fname, e.lname, w.pno, w.hours from employee e, works_on w
where e.ssn = w.essn and w.hours >20;

E&N Query 4: List all project numbers ofr projects involving employee 'Smith', either as worker or manager

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith')
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith');

It works better for Wong:

set @NAME = 'Wong';

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = @NAME)
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = @NAME);

E&N Query 12, §4.3.5

Find all employees with address in Houston, TX
substrings
Note the uncertainty; if someone put two spaces in, we're in trouble. E&N did use commas; I did not.

Query 12A: find all employees born in the 1950's.

E&N way: too clever by half. For one thing, Oracle has a different standard date format than MySQL.

Better:

select e.fname, e.lname
from employee e
where e.bdate between '1950-01-01' and '1959-12-31';

E&N Query 14
BETWEEN and salary

E&N Query 15: order by (also a triple-join)

select d.dname, e.lname, e.fname, p.pname
from department d, employee e, works_on w, project p
where d.dnumber = e.dno and e.ssn = w.essn and w.pno = p.pnumber
order by d.dname, e.lname, e.fname;

Names of employees who have worked on project 2 or project 3:

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

Alternatively:

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
union
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Alas, MySQL does not support the INTERSECT operator.

Attempt 2:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn=w1.essn and w1.pno = 2 and e.ssn=w2.essn and w2.pno = 3;

This works.

Names of employees who have worked on more than one project

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

This could use a DISTINCT.

It could also be done by creating a COUNT of the number of projects worked on by each employee, and then printing those employees for which COUNT >= 2.

What joins do

Consider our earlier join example:

select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

You can think of e and d as cursors, representing rows in the respective tables employee and department. Linear search through the employee table , as in the SQL example

select e.lname, e.salary from employee e where e.salary >= 30000;

might be represented by the following java-like loop, showing a linear search through the table:

for (e in employee) {
    if (e.salary >= 30000) {
       print(e.lname, e.salary)
    }
}

Now let's write this for the join; here we get a nested loop and quadratic runtime (specifically, O(nm), where n=employee.size() and m=department.size())

for (e in employee) {
    for (d in department) {
       if (e.dno == d.dnumber) {print (e.fname, e.lname, d.dname);}
    }
}

Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.

Self-join

Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;

The idea here, in terms of cursors, is for e to traverse linearly the employee table. For each employee, we get e.super_ssn, and then use s to traverse the same table, looking for a row with s.ssn = e.super_ssn.

This particular example is much more readable if we relabel the supervisor columns:

select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;

Note that the AS here is not optional, and plays a rather different role than the AS that can be used in the from section (from employee AS e, employee AS s).