Database week 3

Comp 305-001/488-305: Database Administration

Spring 2016: Tuesdays, 4:15-6:45, Corboy L08

Week 3

Read in Elmasri & Navathe (EN). Chapters are labeled (7th edition)/(6th edition)

Chapter 1/1, Databases and Database Users (skim)
Chapter 2/2, Database System Concepts an Architecture, Section 1 on Data Models, Schemas and Instances
Chapter 5/3, The Relational Data Model and SQL Sections 1 & 2
Chapter 6/4, Basic SQL

Homework 1

EN7 Chapter 6, p 203 / EN6 chapter 4, page 112: exercises 10abc and 12abcd

These exercises appear in full on Sakai, in the Resources folder.

Due: Thursday, Feb 11, on Sakai

Rules:

Your queries should work even if additional data is added to the tables.

Every table referred to in the FROM clause should be named with a "table alias", that is:
select e.lname, d.dname from employee e, department d ...

All answers should be in the form of a single query; do not retrieve a value with one query and then manually plug that value into a second query. Similarly, data appearing in the query should be from the exercise itself, and not from a "visual" lookup.

All SQL should be entered in a format that I can copy and paste directly into a command window. In particular, make sure:

All "prompt" characters such as "->" have been stripped out.
You use regular quotation marks, not special unicode quotation marks. Unicode-quoted ‛foo’ is not the same as 'foo'. (A sure way to convert inadvertently to unicode is to paste your work into a Word document; try installing Notepad++.)
There are no leading tabs or spaces

From last week:

1. Figuring out joins: often an ER diagram helps, in which each table's attributes are listed. One can then draw a line between the joined attributes.

2. table aliases vs column aliases: very different things

3. Review other queries

4. count(*) and other aggregation functions

This works (well, ran without error) under MySQL:

select lname, max(salary) from employee;

It does not work under postgres, and should not!!

The rule: any non-aggregate attributes (eg lname above) must appear in a GROUP BY clause. We'll get to that.

4. Review other queries

A few more basic concepts

Database tables are usually "self-describing", in that the table description and/or relationships to other tables is often embedded in the table description. At a minimum, columns have names and types.

The table definitions, column names and their types are generally known, collectively, as the database schema.

Database tables can be changed without the need to recompile programs using that table: new columns can be added, or entire new tables. Essentially this is because DBMS queries are interpreted, and table columns are identified by name rather than offset. (C programs that access record fields do need recompilation if fields are changed.)

Databases often support different views, perhaps for users with different privilege levels. A view may be a subset of the original set of columns, or it may contain some computed columns in lieu of the original columns. Excluding columns such as social_security_num or salary is relatively common.

Transactions are sets of related updates, eg removing money from one account and adding it to another, or perhaps dropping one class and adding another (sadly, remarkably many student DBMSs lack the latter transaction operation). EN defines a transaction to be the result of an executing program rather than a set of related updates; this latter definition might include adding multiple new rows to one table.

Transaction processing must satisfy the isolation property, that transactions appear to execute in isolation from one another, and the atomicity property, which says that each transaction is completely executed or not executed at all. (This is sometimes described as that transactions must meet the ACID test: Atomicity, Consistency, Isolation, Durability. However, consistency and durability are relevant to all database updates; they are not particular to transactions.)

Here is a simple failure of isolation: two transactions are transferring money to other accounts.

Transaction 1 is transferring $100 from A to B
Transaction 2 is transferring $50 from A to C

The total amount of money should always remain $1000.

transaction 1	transaction 2	Acct A	Acct B	Acct C
		1000	0	0
get value of A: $1000		1000	0	0
add $100 to B		1000	100	0
	Debit A by $50	950	100	0
Store $1000 - $100 in A		900	100	0
	Credit C by $50	900	100	50

Multi-user DBMSs need concurrency control.

Concurrency control leads to locks, the other great performance bottleneck of RDBMSs.

Section 1.6 of EN talks about some of the advantages of having a central DBMS rather than individual DBMSs maintained by each administrative group. This is pretty much a settled issue now, though it does mean that the "natural owners" of data in an organization (eg registration and records at Loyola, for student registration data) will not in fact own that data.

Relational DBs (EN7 chapter 5 / EN6 chapter 3)

A relation is any set of tuples
The set of all possible tuples is the CROSS PRODUCT of some domains
        col1 × col2 × col3 × ... × colN
Example: A = {1,2,3}, B = {x,y} C = {1,2}
        A × B
        A × C
        < relation in A × C
        <= relation in A × C

DB relations are not defined by rule, but by tabulation!

Given attribute sets A₁, A₂, ..., A_n, a relation is a subset of the cartesian product A₁×A₂×...×A_n; that is, a set of tuples ⟨a₁,a₂,...,a_n⟩ where each a_i∈A_i. These tuples may also be called records.

Relations in a DB are represented as tables. EN also uses the term relation state to refer to a specific set of records in a table.

STUDENT table, EN p 63

Name	SSn	Home_phone	Address	Office_phone	Age	GPA
Benjamine Bayer	305-61-2435	817-373-1616	2918 bluebonnet Lane	NULL	19	3.21
Chung-cha Kim	381-62-1245	817-375-4409	125 Kirby Road	NULL	18	2.89
Dick Davidson	422-11-2320	NULL	3452 Elgin Road	817-749-1253	25	3.53
Rohan Panchal	489-22-1100	817-376-9821	265 Lark Lane	817-749-6492	28	3.93
Barbara Benson	533-69-1238	817-839-8461	7384 Fontana Lane	NULL	19	3.25

Note the

attributes
tuples

Also note that some entries are NULL. This means undefined or not available or not known; unfortunately, these three options are not synonymous or interchangeable. NULL values are essential, but they do introduce some complications. The first is that records with NULL entries are not in fact elements of A₁×A₂×...×A_n; they are elements of
(A₁ ∪ {NULL}) × (A₂ ∪ {NULL}) × ... × (A_n ∪ {NULL})

EN also gives an alternative definition of a relation, as a set of maps from the attribute set to the set of attribute values, where the attribute set is essentially the set of names of columns. With this approach, a null entry is represented by a partial map, undefined for some attributes.

Note that we must be careful when comparing null values: if two people have NULL as their Office_phone, it does not mean they have the same phone! Worse, we simply do not know if the NULL means we don't know their phone, or if they simply do not have one, or if they have no office at all and so the "office_phone" is irrelevant.

Joins

As we saw briefly last week, the join is the operation of creating all records merged from two (or more) tables, where one attribute of one table is required to match a corresponding attribute of another. Usually, but not always, the column-matching is based on equality of corresponding attributes.

Examples:
University:

Listing all students in Section 112 (my data)
Printing all of each student's grades, by joining the Student_number fields of STUDENT and GRADE_REPORT
Printing all sections including Course_name, joining COURSE and SECTION on the Course_number field

The first example we did last week as follows:

select s.name from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;

Company:

Printing the name and address of all employees who work in the 'Research' dept (Query1 on EN p 100; uses employee and department tables)
Printing the project number, dept number, and the dept manager's name, for all projects located in 'Stafford' (Query 2 on EN p 100; uses project and department tables)
Printing each employee's name and his or her supervisor's name (Query 8, EN p 101; uses employee table joined to itself)

Demos of these

A full Cartesian product would be denoted in SQL by, eg,

select * from employee, department;

where there is no WHERE clause establishing a relation between the two tables.

The join is conceptually somewhat inefficient. Lots of behind-the-scenes optimization makes it fast.

More on keys

A KEY is any set of columns that is guaranteed to uniquely determine a row.
Primary Key: the key the database developer thinks is most important; usually a single attribute if there is one
Composite Key: multiple columns (eg the GRADE_REPORT table). Note that there is no single-column key here. Can also be the primary key.

Secondary Keys: other column combinations that are keys, but not the one "intended".

Note that keys are not properties of particular tables, but rather of the "table schema". They represent design constraints.

SQL examples

Some queries are from Ramakrishnan & Gehrke 2002 but are modified to be appropriate for E&N's Office database; the others are from E&N directly.

The E&N example numbering is peculiar because the book was radically restructured with the 6th Edition, and examples kept their numbers from previous editions.

Find all employees with salary >= 30000

select * from employee where salary >= 30000;

select e.fname, e.lname, e.salary from employee e where e.salary >=30000;

Note my use of the e table alias. I recommend this style for readabililty. You can think of e as a variable that ranges over all the rows, though it looks syntactically more like it represents a table.

Query 2 of E&N

For every project located in Stafford, list the project number, the controlling department number, and the department manager's lname, address, bdate.

Solution as written:

select pnumber, dnum, lname, address, bdate
from project, department, employee
where dnum = dnumber and mgr_ssn = ssn and plocation = 'Stafford';

Note that this is a dual-join example.

Compare with the version with names:

select p.pnumber, dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

Find the list of supervisor ssns (that is, omit duplicates)

select distinct e.super_ssn from employee e;

Looking at the output, it might be good to add "where e.super_ssn is not null"

Find all employees who have worked on project 2

(Note that in our works_on table, no employee works on the same project more than once. Compare the solution to obtaining a list of employees who have worked on project 2, where the table is works_on_by_week: ⟨essn, pno, week, hours⟩

select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2;

Note this is a join example (not our first). We could also have written it:

select e.fname, e.lname from employee e join works_on w on e.ssn = w.essn where w.pno = 2;

Find all employees who have worked >20 hours on a project

select e.fname, e.lname, w.pno, w.hours from employee e, works_on w
where e.ssn = w.essn and w.hours >20;

A tricky foreign-key constraint

Let's try to create a foreign key constraint to require that any project.plocation must appear as a dept_locations.dlocation:

alter table project ADD constraint
        FK_project_dept_locations foreign key (plocation) references
        dept_locations(dlocation);

        alter table project drop foreign key

FK_project_dept_locations;   
        -- to undo

But this fails, because dlocation is not a key of dept_locations (actually, this is a slight oversimplification; see below). Foreign-key constraints are already a performance hit; non-key lookups are in principle quite a bit less efficient. I get:

MySQL: ERROR 1215 (HY000): Cannot add foreign key constraint [not very helpful!]
Postgres: ERROR: there is no unique constraint matching given keys for referenced table "dept_locations"

What the Postgres error message means is that dept_locations.dlocation -- the "referenced table" -- is not a key. But the real issue is that table dept_locations does not have an index on the dlocation column. The important thing is for the RDBMS to be able to verify a given FK constraint quickly, by using an index on the parent table. If the FK reference is to the parent table's primary key, then the index is automatic as MySQL creates an index on the primary key for every table.

More at http://dev.mysql.com/doc/refman/5.0/en/create-table-foreign-keys.html

To create an index (simplified version), use

create index indexname on tablename(indexcolumn);

create index dloc_index on dept_locations(dlocation);
drop index dloc_index on dept_locations;

At this point, the above "alter table ..." command works under MySQL, though it still does not work under Postgres. Here is the Postgres rule:

A foreign key must reference columns that either are a primary key or form a unique constraint. This means that the referenced columns always have an index (the one underlying the primary key or unique constraint); so checks on whether a referencing row has a match will be efficient.

That is, the actual requirement is that the referenced attribute (or set of attributes) be a key, though the reason given for this is so that there is an index (which we could in principle add separately).

There is also one other workaround:

create table legal_locations (
location varchar(15) primary key
);

insert into legal_locations (select distinct dlocation from dept_locations); -- no "values"

alter table project ADD foreign key (plocation) references legal_locations(location);
alter table dept_locations ADD foreign key (dlocation) references legal_locations(location);

To remove these:

show create table project;
alter table project DROP foreign key project_ibfk_2;

show create table dept_locations;
alter table dept_locations DROP foreign key dept_locations_ibfk_1;

E&N Query 4: List all project numbers of projects involving employee 'Smith', either as manager or worker

Note that projects are managed by the manager of the project's controlling department. One approach is to use the union keyword:

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith')
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith');

Smith, however, hasn't managed any projects. It works better for Wong:

set @NAME = 'Wong'; -- MySQL notation

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = @NAME)
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = @NAME);

Here is a version using a conventional join. Sort of:

select distinct p.pnumber
from project p, department d, employee e, works_on w
WHERE e.lname = @NAME AND (
    (d.dnumber = p.dnum and d.mgr_ssn = e.ssn)
    OR
    (p.pnumber = w.pno and w.essn = e.ssn))

There is in fact something very unconventional about this join: we can't decide what tables are being joined! The first of the OR clauses in the where section joins tables project, department and employee; the second OR clause joins tables project, works_on and employee. In fact, we cannot do this as a three-table join using the alternative table1 join table2 on... syntax.

E&N Query 12, Find all employees with address in Houston, TX

To do this, we have to use substrings and the LIKE operator:

select e.fname, e.lname from employee e
where e.address LIKE '%Houston TX%';

Note the uncertainty; if someone put two spaces in, we're in trouble. E&N did use commas in addresses; I did not.

Query 12A: find all employees born in the 1950's.

E&N way: too clever by half:

select e.fname, e.lname from employee e where e.bdate LIKE '195_-__-__';

For one thing, Oracle has a different standard date format than MySQL. (I added dashes to my pattern, hopefully for clarity.)

Better:

select e.fname, e.lname from employee e
where e.bdate between '1950-01-01' and '1959-12-31';

Note that dates are entered as strings, but they are indeed parsed by MySQL and are stored numerically.

E&N Query 14: Find employees with salary between $30,000 and $40,000

select e.fname, e.lname from employee e
where e.salary between 30000 and 40000;

E&N Query 15: Introduces order by (also a triple-join)

Give a list of employees and the projects they are working on, ordered by department, and, within departments, ordered alphabetically by lname, fname.

select d.dname, e.lname, e.fname, p.pname
from department d, employee e, works_on w, project p
where d.dnumber = e.dno and e.ssn = w.essn and w.pno = p.pnumber
order by d.dname, e.lname, e.fname;

Names of employees who have worked on project 2 or project 3:

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

Alternatively:

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
union
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Alas, MySQL 5.6 does not support the INTERSECT operator. (Also, we should get rid of the w.pno column in the output! Here is the revised query, which does run under postgres (below)):

(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

Attempt 2:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn=w1.essn and w1.pno = 2 and e.ssn=w2.essn and w2.pno = 3;

This works.

As a related example, suppose we want names of employees who have worked on more than one project:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

This could use a DISTINCT.

It could also be done by creating a COUNT of the number of projects worked on by each employee, and then printing those employees for which COUNT >= 2.

What joins do

Consider our earlier join example:

select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

As mentioned last week, one can think of e and d as "cursors", or row variables, representing rows in the respective tables employee and department. Linear search through the employee table , as in the SQL example

select e.lname, e.salary from employee e where e.salary >= 30000;

might be represented by the following java-like loop, showing a linear search through the table:

for (e in employee) {
    if (e.salary >= 30000) {
        print(e.lname, e.salary)
    }
}

Now let's write this for the above join; here we get a nested loop and quadratic runtime (specifically, O(nm), where n=employee.size() and m=department.size())

for (e in employee) {
for (d in department) {
      if (e.dno == d.dnumber) {
            print(e.fname, e.lname, d.dname)
    }
}

Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.

Without a join, all you can look at is one row at a time of one table.

Self-join

Without a join, all you can look at is one row at a time of one table. When we join EMPLOYEE and DEPARTMENT, eg

select e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

we are looking at one row at a time of each table, but two tables. The DEPARTMENT table is being used here to "extend" the EMPLOYEE table: each employee record now has additional attributes from DEPARTMENT.

Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;

The idea here, in terms of cursors, is for e to traverse linearly the employee table. For each employee, we get e.super_ssn, and then use s to traverse the same table, looking for a row with s.ssn = e.super_ssn. Each record in the result is a combination of two rows of table EMPLOYEE, so a join is necessary.

In terms of table extension, the EMPLOYEE table is being "extended" to include additional attributes about supervisors besides just super_ssn.

This particular example is much more readable if we relabel the supervisor columns in the output

select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;

Note that the AS here is not optional, and plays a rather different role than the AS that can be used in the FROM section (from employee AS e, employee AS s).

WHERE conditions

In many cases, the where conditions are either join conditions or are simple Boolean expressions about a single row. Sometimes, however, the question being asked relates to other rows, as in "List employees who worked on more than one project"; if we form the join table of employees and their projects we still cannot answer this row-by-row. Another query might be to list employees who are part of some set of employees whose salaries add up to exactly 100,000.

The answer last week to the employees-who-worked-on-multiple-projects query was a "triple" join with inequality as one of the join conditions:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

Most joins (but not this one!) are equijoins; that is, joins involving an equality relationship. Inequalities can be harder to understand. Recall that the above version introduces many duplicates, one for each way of choosing two unequal projects. "Select distinct" is better.

Actually, this particular example is easier done using a subquery and the count operator (which we'll cover below):

select e.fname, e.lname from employee e
where (select count(w.pno) from works_on w where w.essn = e.ssn) > 1

In general, WHERE conditions that just select rows of a table satisfying a given Boolean condition are fundamentally trivial. Joins in WHERE conditions are a little less trivial, and WHERE conditions that select a record (or part of a record) depending on other records in the table tend to be the hardest. Example: list all students who got straight A's; this lists the student_id values from grade_report where all of the matching rows have an A in the grade column. This last category often involves grouping, or subqueries, or some other advanced feature.

Why is SQL hard?

Basically, it's hard because it is a non-imperative language: you don't issue a sequence of commands. Instead, you write a single big expression. It's hard to visualize big expressions, and hard to test them incrementally. Nested queries do seem to help in this regard.

(Actually, the data types of SQL are tables, and you can in fact save a table in a "table variable". But this is rare.)

Perhaps because of this non-imperative nature of the language, it is relatively easy to read several SQL examples and still have trouble knowing where to start when trying to write one from scratch.

Examples from above revisited

Find all employees who have worked on project 2

This one involves a single join between employees (to get the names) and the works_on table. From the works_on table we can easily get the SSNs of the employees who worked on project 2; we need to join with table employee to convert SSNs to names.

If we change the query to

Find all employees who have worked on a project in Stafford

we can do the following. Using the project table, we can get the project number from the location. Using the works_on table, we can find the employees (by essn) who have worked on that project. And using the employee table, we can convert to employee names:

select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;

Note that only data from table employee is actually part of the select statement.

Query 2: For every project located in Stafford, list the project number, the controlling department number, and the department manager's lname, address, bdate.

From the project table we can get the project location, number and department:

select p.pnumber, p.dnum from project p where p.plocation='stafford';

To get the department-manager information (all from one table) we needed two joins: first a join between the project and department tables (on dnum/dnumber) to get the department mgr_ssn, and then between department and employee (on mgr_ssn and ssn).

select p.pnumber, p.dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

Note that no department fields are included in the query results. The works_on table is not involved.

Query 4: Here we combine both the above: we want all the projects in which Wong has participated, either as employee or as manager. For the employee case, we have a join between works_on and employee. For the manager, we have a join between projects and departments, to get the dept manager of the controlling department, and then with the employee table.

We did the same thing with the self-join example above: given an employee e, we need to look up e.mgr_ssn in the employee table.

Joins again

Whenever you list multiple tables in the from line, you have a join. Without a join condition between two tables, you will have N×M records in the result; this is seldom what you want. With more than two tables, you don't need a join between every pair; you just need a "chain" of joins uniting everything. In general, for N tables you need N−1 join conditions.

Join conditions in the where clause serve to join different tables; they can be identified by the format table1.attribute1 = table2.attribute2 (sometimes other relational operations than = are involved, but those ar rare). Every other condition in the where clause has to do with selecting a subset of the total number of rows, thus discarding some rows from consideration. It is possible to see join conditions this way, too, discarding rows from the cross product of the two tables. But this is not terribly helpful in practice.

All this is one reason some prefer the newer syntax "from table1 join table2 on join-condition".

In all the examples presented above, one of the join attributes is the primary key of the table it appears in. Here are most of those examples again, where for each join's equality condition the attribute that is the primary key of its table is in bold.

select e.lname, d.dname from employee e, department d
where e.dno = d.dnumber;

select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith'

select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith'

select d.dname, e.lname, e.fname, p.pname
from department d, employee e, works_on w, project p
where d.dnumber = e.dno and e.ssn = w.essn and w.pno = p.pnumber

select e.fname, e.lname , w.pno from employee e, works_on w
where e.ssn = w.essn and (w.pno = 2 or w.pno =3);

select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;

When this is done with two tables, the other table (with join attribute that is not a key) can be thought of as being extended. For each record in the other table, we use the join attribute to look up a single record from the first table (using the primary key).

Note that in most cases only one join-condition attribute is a primary key; the other though is a foreign key.

Are there any cases where neither of the join's attributes is a key of its relation? Mathematically this can certainly happen; in fact, a join condition does not even have to be based on an equality comparison. But coming up with a meaningful example is a different thing. Here's the best I could do. Suppose we want a list of all employees who live in the same town as some department's office. We might want such a list, for example, for emergency preparedness. We can match the employee city field with the dept_locations dlocation field; neither of these is a key. Actually, in the Company database we have to use the employee address field, and do substring matching, as follows:

select distinct e.ssn, e.lname, e.address from employee e, dept_locations dl
where e.address LIKE concat('%', dl.dlocation, '%');

I used distinct because an employee in Houston would otherwise be listed for both departments 1 and 5 (both of which have offices in Houston).