Database week 3

Comp 305/488: Database Administration, Corboy 208, 4:15 Tuesdays

Week 3

Read in Elmasri & Navathe (EN)

Chapter 1, Databases and Database Users
Chapter 2, Database System Concepts an Architecture, Section 1 on Data Models, Schemas and Instances
Chapter 3, The Relational Data Model ...., Sections 1 & 2

A tricky foreign-key constraint

Let's try to create a foreign key constraint to require that any project.plocation must appear as a dept_locations.dlocation:

alter table project ADD constraint
        FK_project_dept_locations foreign key (plocation) references
        dept_locations(dlocation);

        alter table project drop foreign key

FK_project_dept_locations;   
        -- to undo

But this fails, because dlocation is not a key of dept_locations (actually, this is a slight oversimplification; see below). Foreign-key constraints are already a performance hit; non-key lookups are in principle quite a bit less efficient. I get:

ERROR 1215 (HY000): Cannot add foreign key constraint

The error message is a bit unhelpful.

The real issue with trying to create a FK constraint with child project and parent dept_locations is not quite that dept_locations.dlocation is not a key; it is that dept_locations does not have an index on the dlocation column. The important thing is for MySQL to be able to verify a given FK constraint quickly, by using an index on the parent table. If the FK reference is to the parent table's primary key, then the index is automatic as MySQL creates an index on the primary key for every table.

More at http://dev.mysql.com/doc/refman/5.0/en/create-table-foreign-keys.html

To create an index (simplified version), use

create index indexname on tablename(indexcolumn);

create index dloc_index on dept_locations(dlocation);
drop index dloc_index on dept_locations;

At this point, the above "alter table ..." command should work.

There is also one other workaround:

create table legal_locations (
location varchar(15) primary key
);

insert into legal_locations (select distinct dlocation from dept_locations); -- no "values"

alter table project ADD foreign key (plocation) references legal_locations(location);
alter table dept_locations ADD foreign key (dlocation) references legal_locations(location);

To remove these:

show create table project;
alter table project DROP foreign key project_ibfk_2;

show create table dept_locations;
alter table dept_locations DROP foreign key dept_locations_ibfk_1;

E&N Query 4: List all project numbers of projects involving employee 'Smith', either as manager or worker

Note that projects are managed by the manager of the project's controlling department. One approach is to use the union keyword:

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith')
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith');

Smith, however, hasn't managed any projects. It works better for Wong:

set @NAME = 'Wong'; -- MySQL notation

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = @NAME)
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = @NAME);

Here is a version using a conventional join. Sort of:

select distinct p.pnumber
from project p, department d, employee e, works_on w
WHERE e.lname = @NAME AND (
    (d.dnumber = p.dnum and d.mgr_ssn = e.ssn)
    OR
    (p.pnumber = w.pno and w.essn = e.ssn))

There is in fact something very unconventional about this join: we can't decide what tables are being joined! The first of the OR clauses in the where section joins tables project, department and employee; the second OR clause joins tables project, works_on and employee. In fact, we cannot do this as a three-table join using the alternative table1 join table2 on... syntax.

E&N Query 12, Find all employees with address in Houston, TX

To do this, we have to use substrings and the LIKE operator:

select e.fname, e.lname from employee e
where e.address LIKE '%Houston TX%';

Note the uncertainty; if someone put two spaces in, we're in trouble. E&N did use commas in addresses; I did not.

Query 12A: find all employees born in the 1950's.

E&N way: too clever by half:

select e.fname, e.lname from employee e where e.bdate LIKE '195_-__-__';

For one thing, Oracle has a different standard date format than MySQL. (I added dashes to my pattern, hopefully for clarity.)

Better:

select e.fname, e.lname from employee e
where e.bdate between '1950-01-01' and '1959-12-31';

Note that dates are entered as strings, but they are indeed parsed by MySQL and are stored numerically.

E&N Query 14: Find employees with salary between $30,000 and $40,000

select e.fname, e.lname from employee e
where e.salary between 30000 and 40000;

E&N Query 15: Introduces order by (also a triple-join)

Give a list of employees and the projects they are working on, ordered by department, and, within departments, ordered alphabetically by lname, fname.

select d.dname, e.lname, e.fname, p.pname
from department d, employee e, works_on w, project p
where d.dnumber = e.dno and e.ssn = w.essn and w.pno = p.pnumber
order by d.dname, e.lname, e.fname;

Names of employees who have worked on project 2 or project 3:

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

Alternatively:

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
union
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Alas, MySQL 5.6 does not support the INTERSECT operator. (Also, we should get rid of the w.pno column in the output! Here is the revised query, which does run under postgres (below)):

(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Attempt 2:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn=w1.essn and w1.pno = 2 and e.ssn=w2.essn and w2.pno = 3;

This works.

As a related example, suppose we want names of employees who have worked on more than one project:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

This could use a DISTINCT.

It could also be done by creating a COUNT of the number of projects worked on by each employee, and then printing those employees for which COUNT >= 2.

What joins do

Consider our earlier join example:

select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

As mentioned last week, one can think of e and d as "cursors", or row variables, representing rows in the respective tables employee and department. Linear search through the employee table , as in the SQL example

select e.lname, e.salary from employee e where e.salary >= 30000;

might be represented by the following java-like loop, showing a linear search through the table:

for (e in employee) {
    if (e.salary >= 30000) {
        print(e.lname, e.salary)
    }
}

Now let's write this for the above join; here we get a nested loop and quadratic runtime (specifically, O(nm), where n=employee.size() and m=department.size())

for (e in employee) {
for (d in department) {
      if (e.dno == d.dnumber) {
            print(e.fname, e.lname, d.dname)
    }
}

Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.

Without a join, all you can look at is one row at a time of one table.

Self-join

Without a join, all you can look at is one row at a time of one table. When we join EMPLOYEE and DEPARTMENT, eg

select e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

we are looking at one row at a time of each table, but two tables. The DEPARTMENT table is being used here to "extend" the EMPLOYEE table: each employee record now has additional attributes from DEPARTMENT.

Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;

The idea here, in terms of cursors, is for e to traverse linearly the employee table. For each employee, we get e.super_ssn, and then use s to traverse the same table, looking for a row with s.ssn = e.super_ssn. Each record in the result is a combination of two rows of table EMPLOYEE, so a join is necessary.

In terms of table extension, the EMPLOYEE table is being "extended" to include additional attributes about supervisors besides just super_ssn.

This particular example is much more readable if we relabel the supervisor columns in the output

select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;

Note that the AS here is not optional, and plays a rather different role than the AS that can be used in the FROM section (from employee AS e, employee AS s).

WHERE conditions

In many cases, the where conditions are either join conditions or are simple Boolean expressions about a single row. Sometimes, however, the question being asked relates to other rows, as in "List employees who worked on more than one project"; if we form the join table of employees and their projects we still cannot answer this row-by-row. Another query might be to list employees who are part of some set of employees whose salaries add up to exactly 100,000.

The answer last week to the employees-who-worked-on-multiple-projects query was a "triple" join with inequality as one of the join conditions:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

Most joins (but not this one!) are equijoins; that is, joins involving an equality relationship. Inequalities can be harder to understand. Recall that the above version introduces many duplicates, one for each way of choosing two unequal projects. "Select distinct" is better.

Actually, this particular example is easier done using a subquery and the count operator (which we'll cover below):

select e.fname, e.lname from employee e
where (select count(w.pno) from works_on w where w.essn = e.ssn) > 1

In general, WHERE conditions that just select rows of a table satisfying a given Boolean condition are fundamentally trivial. Joins in WHERE conditions are a little less trivial, and WHERE conditions that select a record (or part of a record) depending on other records in the table tend to be the hardest. Example: list all students who got straight A's; this lists the student_id values from grade_report where all of the matching rows have an A in the grade column. This last category often involves grouping, or subqueries, or some other advanced feature.

Why is SQL hard?

Basically, it's hard because it is a non-imperative language: you don't issue a sequence of commands. Instead, you write a single big expression. It's hard to visualize big expressions, and hard to test them incrementally. Nested queries do seem to help in this regard.

(Actually, the data types of SQL are tables, and you can in fact save a table in a "table variable". But this is rare.)

Perhaps because of this non-imperative nature of the language, it is relatively easy to read several SQL examples and still have trouble knowing where to start when trying to write one from scratch.

Examples from above revisited

Find all employees who have worked on project 2

This one involves a single join between employees (to get the names) and the works_on table. From the works_on table we can easily get the SSNs of the employees who worked on project 2; we need to join with table employee to convert SSNs to names.

If we change the query to

Find all employees who have worked on a project in Stafford

we can do the following. Using the project table, we can get the project number from the location. Using the works_on table, we can find the employees (by essn) who have worked on that project. And using the employee table, we can convert to employee names:

select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;

Note that only data from table employee is actually part of the select statement.

Query 2: For every project located in Stafford, list the project number, the controlling department number, and the department manager's lname, address, bdate.

From the project table we can get the project location, number and department:

select p.pnumber, p.dnum from project p where p.plocation='stafford';

To get the department-manager information (all from one table) we needed two joins: first a join between the project and department tables (on dnum/dnumber) to get the department mgr_ssn, and then between department and employee (on mgr_ssn and ssn).

select p.pnumber, p.dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

Note that no department fields are included in the query results. The works_on table is not involved.

Query 4: Here we combine both the above: we want all the projects in which Wong has participated, either as employee or as manager. For the employee case, we have a join between works_on and employee. For the manager, we have a join between projects and departments, to get the dept manager of the controlling department, and then with the employee table.

We did the same thing with the self-join example above: given an employee e, we need to look up e.mgr_ssn in the employee table.

Joins again

Whenever you list multiple tables in the from line, you have a join. Without a join condition between two tables, you will have N×M records in the result; this is seldom what you want. With more than two tables, you don't need a join between every pair; you just need a "chain" of joins uniting everything. In general, for N tables you need N−1 join conditions.

Join conditions in the where clause serve to join different tables; they can be identified by the format table1.attribute1 = table2.attribute2 (sometimes other relational operations than = are involved, but those ar rare). Every other condition in the where clause has to do with selecting a subset of the total number of rows, thus discarding some rows from consideration. It is possible to see join conditions this way, too, discarding rows from the cross product of the two tables. But this is not terribly helpful in practice.

All this is one reason some prefer the newer syntax "from table1 join table2 on join-condition".

In all the examples presented above, one of the join attributes is the primary key of the table it appears in. Here are most of those examples again, where for each join's equality condition the attribute that is the primary key of its table is in bold.

select e.lname, d.dname from employee e, department d
where e.dno = d.dnumber;

select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith'

select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith'

select d.dname, e.lname, e.fname, p.pname
from department d, employee e, works_on w, project p
where d.dnumber = e.dno and e.ssn = w.essn and w.pno = p.pnumber

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;

When this is done with two tables, the other table (with join attribute that is not a key) can be thought of as being extended. For each record in the other table, we use the join attribute to look up a single record from the first table (using the primary key).

Note that in most cases only one join-condition attribute is a primary key; the other though is a foreign key.

Are there any cases where neither of the join's attributes is a key of its relation? Mathematically this can certainly happen; in fact, a join condition does not even have to be based on an equality comparison. But coming up with a meaningful example is a different thing. Here's the best I could do. Suppose we want a list of all employees who live in the same town as some department's office. We might want such a list, for example, for emergency preparedness. We can match the employee city field with the dept_locations dlocation field; neither of these is a key. Actually, in the Company database we have to use the employee address field, and do substring matching. This is what I got:

select distinct e.ssn, e.lname, e.address from employee e, dept_locations dl
where e.address LIKE concat('%', dl.dlocation, '%');

I used distinct because an employee in Houston would otherwise be listed for both departments 1 and 5 (both of which have offices in Houston).

Postgres

In the 1970's, the Ingres database project began at Berkeley. Ingres originally stood for Interactive graphics retrieval system, though shortly thereafter the "graphics" part became much less important.

After Ingres (which does still exist) came Postgres, for "post-ingres".

Later, to emphasize its support for SQL, Postgres became PostgreSQL.

Note that it is virtually impossible to pronounce PostgreSQL using the "sequel" form for SQL; if you want a short form, use "Postgres".

A decade ago, MySQL was the "fast one" and Postgres was the one that accurately implemented the SQL standard. That's less true now, though MySQL is still missing intersection (and a few other things).

Postgres can be found at http://postgresql.org.

After you install the software, you still have to create appropriate accounts; the postgres account is the only one installed by default. Here's what I had to do:

1. Log in as user "postgres", and run the psql command. The "bash>" prompt below is the shell prompt (as user postgres); the "post=> " prompt is the psql SQL prompt.

    bash> sudo -u postgres psql
    post=> create user 'pld';

2. From the shell, still as user postgres

    bash> createdb pld

Then, as user pld, the command "psql" works. If you created several databases above, eg "pldcompany", then the command to connect to postgres using the pldcompany database would be psql pldcompany. I did not actually figure out how to give myself the right to create new databases.

Some postgres commands

Your psql prompt should show you the current database.

SHOW ALL -- shows status variables only

\d same as mysql show tables;

\d employee same as mysql describe employee;

Here's another way to list your tables:

select table_name from information_schema.tables where table_schema='public' and table_type='BASE TABLE';

Intersect example

This now actually works, under postgres, but not in the following form:

old:

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

What is the problem?

new:

(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Above, we did this as follows:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn=w1.essn and w1.pno = 2 and e.ssn=w2.essn and w2.pno = 3;

The intersect operator itself would have been trivial to implement in MySQL, but that is only part of the story. It is trivial to implement the intersect operator in order to get the correct resuts. It is nontrivial to implement it in a way that allows the system as a whole to do reasonable-quality query optimization.

Generally, intersections are easy to implement as joins, and the join approach may be more efficient. As a general rule, an intersection of the form

(select a.col1, a.col2 from TableA a)
intersect
(select b.col1, b.col2 from TableB b)

is equivalent to the following (which may be a self-join if TableA and TableB are the same table):

select a.col1, a.col2 from TableA a join TableB b on a.col1 = b.col1 and a.col2 = b.col2

In some cases only one of the two join conditions may be necessary; if col1 is a primary key, for example, then you can drop the col2 comparison.

The Company query above, where we find employees who have worked on project 2 and on project 3, is of this general form, though we only needed the employee table once in the join. TableA corresponds to employee join works_on w1, and TableB corresponds to employee join works_on w2.

Demo of Oracle

To start (on the virtual machine)

sqlplus / as sysdba
sqlplus pld

To view tables:

select table_name from user_tables;
select owner, table_name from all_tables;
select * from user_objects where object_type = 'TABLE';

A schema is a collection of objects (eg tables) under common ownership. Thus, I could use different owners (and so different schemata) to represent what in MySQL corresponds to different "databases". But in MySQL, different databases don't have to have different owners. (Actually, MySQL tables don't really have "owners"; instead, users are GRANTed rights to the tables.)

One can in Oracle also create multiple tablespaces. These aren't really like MySQL databases either; they are there to assign storage space to particular sets of tables.

Question: what Oracle permissions will allow me to "factor" a table into two tables? In MySQL, if I own the database, I can do whatever I want with the tables in that database.

To load the company tables, I had to make a few changes. First, the default date format in MySQL is, eg, 27-01-2014; the default format in Oracle is 27-jan-14. There's a way to get Oracle to read in dates in a different format, but I didn't bother.

Second, the insert into tablename values command of Oracle takes only a single row; I had to edit this part of the load file.

Also, I named all the constraints. The default Oracle names are things like sys_c0033725; it is not clear if such things are Foreign-Key constraints or something else.

To see the constraints on a table:

Constraints have the following one-letter codes: P = primary key, R = foreign key C = check (eg not null)

select constraint_name,constraint_type from user_constraints where table_name = 'WORKS_ON';

select constraint_name, table_name from user_constraints where constraint_type='R';

To see the indexes on a table:

select index_name, index_type, table_name from user_indexes;

The index_name values are cryptic, and the index_type values are all 'NORMAL'.

Here are a few queries from above to run under Oracle:

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s where e.super_ssn = s.ssn;

(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);