Database week 3

Comp 353/453: Database Programming, Corboy 523, 7:00 Thursdays

Week 3, Feb 2

Read in Elmasri & Navathe (EN)

Chapter 3, The Relational Data Model ...., section 3: Update Operations
Chapter 4: Basic SQL
Chapter 5: 5.1.1, 5.1.2, 5.1.6, 5.1.7, 5.1.8

Homework 1: be wary of unicode characters in your SQL; unicode ' is not the same as '.

Also, 4.12(c) cannot be done using the methods of Chapter 4.

MySQL under linux

The linux filesystem is case sensitive. If you want table names to be case-insensitive, edit /etc/mysql/my.cnf to add the following in the mysqld section:

lower_case_table_names=1

Then restart your mysql server (eg with /etc/init.d/mysql restart).

What joins do

Consider our earlier join example:

select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

You can think of e and d as cursors, representing rows in the respective tables employee and department. Linear search through the employee table , as in the SQL example

select e.lname, e.salary from employee e where e.salary >= 30000;

might be represented by the following java-like loop, showing a linear search through the table:

for (e in employee) {
    if (e.salary >= 30000) {
       print(e.lname, e.salary)
    }
}

Now let's write this for the join; here we get a nested loop and quadratic runtime (specifically, O(nm), where n=employee.size() and m=department.size())

for (e in employee) {
    for (d in department) {
       if (e.dno == d.dnumber) {print (e.fname, e.lname, d.dname);}
    }
}

Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.

Self-join

Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;

The idea here, in terms of cursors, is for e to traverse linearly the employee table. For each employee, we get e.super_ssn, and then use s to traverse the same table, looking for a row with s.ssn = e.super_ssn.

This particular example is much more readable if we relabel the supervisor columns:

select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;

Note that the AS here is not optional, and plays a rather different role than the AS that can be used in the from section (from employee AS e, employee AS s).

WHERE conditions

In many cases, the where conditions are either join conditions or are simple Boolean expressions about a single row. Sometimes, however, the question being aske relates to other rows, as in "List employees who worked on more than one project"; if we form the join table of employees and their projects we still cannot answer this row-by-row. Another query might be to list employees who are part of some set of employees whose salaries add up to exactly 100,000.

The answer last week to the employees-who-worked-on-multiple-projects query was a "triple" join with inequality as one of the join conditions:

select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;

Most joins (but not this one!) are equijoins; that is, joins involving an equality relationship. Inequalities can be harder to understand. Recall that the above version introduces many duplicates, one for each way of choosing two unequal projects. "Select distinct" is better.

Actually, this particular example is easier done using a subquery and the count operator (which we'll cover below):

select e.fname, e.lname from employee e
where (select count(w.pno) from works_on w where w.essn = e.ssn) > 1

In general, WHERE conditions that just select rows of a table satisfying a given Boolean condition are fundamentally trivial. Joins in WHERE conditions are a little less trivial, and WHERE conditions that select a record (or part of a record) depending on other records in the table tend to be the hardest. Example: list all students who got straight A's; this lists the student_id values from grade_report where all of the matching rows have an A in the grade column. This last category often involves grouping, or subqueries, or some other advanced feature.

Why is SQL hard?

Basically, it's hard because it is a non-imperative language: you don't issue a sequence of commands. Instead, you write a single big expression. It's hard to visualize big expressions, and hard to test them incrementally. Nested queries do seem to help in this regard.

(Actually, the data types of SQL are tables, and you can in fact save a table in a "table variable". But this is rare.)

Perhaps because of this non-imperative nature of the language, it is relatively easy to read several SQL examples and still have trouble knowing where to start when trying to write one from scratch.

Examples from week 2:

Find all employees who have worked on project 2:

This one involves a single join between employees (to get the names) and the works_on table. From the works_on table we can easily get the SSNs of the employees who worked on project 2; we need to join with table employee to convert SSNs to names.

If we change the query to

Find all employees who have worked on a project in Stafford

we can do the following. Using the project table, we can get the project number from the location. Using the works_on table, we can find the employees (by essn) who have worked on that project. And using the employee table, we can convert to employee names:

select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;

Note that only data from table employee is actually part of the select statement.

Query 2: For every project located in Stafford, list the project number, the controlling department number, and the department manager's lname, address, bdate.

From the project table we can get the project location, number and department:
select p.pnumber, p.dnum from project p where p.plocation='stafford';
To get the department-manager information (all from one table) we needed two joins: first a join between the project and department tables (on dnum/dnumber) to get the department mgr_ssn, and then between department and employee (on mgr_ssn and ssn).

select p.pnumber, p.dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

Note that no department fields are included in the query results. The works_on table is not involved.

Query 4: Here we combine both the above: we want all the projects in which Wong has participated, either as employee or as manager. For the employee case, we have a join between works_on and employee. For the manager, we have a join between projects and departments, to get the dept manager of the controlling department, and then with the employee table.

We did the same thing with the self-join example above: given an employee e, we need to look up e.mgr_ssn in the employee table.

Joins again

Whenever you list multiple tables in the from line, you have a join. Without a join condition between two tables, you will have N×M records in the result; this is seldom what you want. With more than two tables, you don't need a join between every pair; you just need a "chain" of joins uniting everything. In general, for N tables you need N−1 join conditions.

Join conditions in the where clause serve to join different tables; they can be identified by the format table1.attribute1 = table2.attribute2 (sometimes other relational operations than = are involved, but those ar rare). Every other condition in the where clause has to do with selecting a subset of the total number of rows, thus discarding some rows from consideration. It is possible to see join conditions this way, too, discarding rows from the cross product of the two tables. But this is not terribly helpful in practice.

All this is one reason some prefer the newer "from table1 join table2 on join-condition".

Postgres

In the 1970's, the Ingres database project began at Berkeley. Ingres originally stood for Interactive graphics retrieval system, though shortly thereafter the "graphics" part became much less important.

After Ingres (which does still exist) came Postgres, for "post-ingres".

Later, to emphasize its support for SQL, Postgres became PostgreSQL.

Note that it is virtually impossible to pronounce PostgreSQL using the "sequel" form for SQL; if you want a short form, use "Postgres".

A decade ago, MySQL was the "fast one" and Postgres was the one that accurately implemented the SQL standard. That's less true now, though MySQL is still missing intersection (and a few other things).

Postgres can be found at http://postgresql.org.

After you install the software, you still have to create appropriate accounts; the postgres account is the only one installed by default. Here's what I had to do:

1. Log in as user "postgres", and run the psql command. The "bash>" prompt below is the shell prompt (as user postgres); the "post=> " prompt is the psql SQL prompt.

    bash> sudo -u postgres psql
    post=> create user 'pld';

2. From the shell, still as user postgres

    bash> createdb pld

Then, as user pld, the command "psql" works. If you created several databases above, eg "pldoffice", then the command to connect to postgres using the pldoffice database would be psql pldoffice. Note that I did not actually figure out how to give myself the right to create new databases.

Some postgres commands

Your psql prompt should show you the current database.

SHOW ALL -- shows status variables only

\d same as mysql show tables;

\d employee same as mysql describe employee;

Here's another way to list your tables:

select table_name from information_schema.tables where table_schema='public' and table_type='BASE TABLE';

Intersect example:

This now actually works, under postgres, but I made a mistake:

old:

(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

What is the problem?

new:

(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

See below for another way to do this.

I made the remark that this would have been trivial to implement in MySQL. That's only partly true. It is trivial to implement the intersect operator in order to get the correct resuts. It is nontrivial to implement it in a way that does reasonable-quality query optimization.

Outer joins

What if we want to list all employees and, if they have a qualification, that too? See the table qualification:

+-----------+-----------+
| ssn       | qual      |
+-----------+-----------+
| 123456789 | bachelors |
| 333445555 | masters   |
| 987654321 | masters   |
| 666884444 | doctorate |
| 987987987 | ccie      |
| 888665555 | doctorate |
+-----------+-----------+

This is created with

    create table qualification (
       ssn       char(9) primary key not null,
       qual     varchar(20) not null,
       foreign key (ssn) references employee(ssn)
    );

Standard join syntax:
    select e.lname, q.qual from employee e, qualification q where e.ssn = q.ssn;

Alternative join syntax:
    select e.lname, q.qual from employee e join qualification q on e.ssn = q.ssn;

"Left [outer] join" syntax
    select e.lname, q.qual from employee e left join qualification q on e.ssn = q.ssn;

What is happening here? The point is that with the left join, then all rows of the left table are included, even when there is no match in the right table.

Let's add an extra line to the qualification table:
    insert into qualification values ('345456567', 'hacker');       // no such existing employee!

Now let's do the left and right joins (replace left with right, or reverse the order).

select e.lname, q.qual from employee e right join qualification q on e.ssn = q.ssn;
select e.lname, q.qual from qualification q left join employee e on e.ssn = q.ssn;

What changed?

Finally, note that we can do this to make the output prettier:

select e.lname, case when q.qual is null then '' else q.qual end as emp_qual
from employee e left join qualification q on e.ssn = q.ssn;

What's going on here is that we're using an expression in the select part, rather than a field name, and using the
case when boolean_expression then expression1 else expression2 end
as our expression. We're also giving the column a name, "emp_qual".

While expressions in select columns are often important, note that some kinds of output formatting are better done in the front end rather than in SQL itself.

To delete the row,
delete from qualification where qual='hacker';

Null

As we have discussed, NULL can represent any of:

A value we just do not know at the moment
A value that is more-or-less-permanently unavailable (eg NULL for a middle initial)
A value that is not applicable, given some other attributes of the row

SQL supports the operators ____ IS NULL and ____ IS NOT NULL, for testing whether a value is null. However, the = and <> operators do not work. Similarly, an equality comparison between two null fields of two records (eg as part of a join) will fail. The reasoning is that null is not to be thought of as a specific value, but rather as meaning "unknown". If two people have unknown phone numbers (both NULL), that does not mean they have the same number!

Compare

select lname from employee where super_ssn is null;
select lname from employee where super_ssn is not null;
select lname from employee where super_ssn = null;
select lname from employee where not super_ssn = null;
select lname from employee where super_ssn <> null;

The latter returns no records, even if you change it to "super_ssn <> null" or "not (super_ssn = null)". This is because any comparison involving NULL behaves as if it returns a third truth value, "unknown". This "unknown" value propagates upwards in any boolean expression it is part of, according to appropriate truth-table rules; in particular, "true and unknown" is "unknown", and "not unknown" is "unknown". The only times the end result is not also "unknown" are "true or unknown", which evaluates to "true", and "false and unknown", which evaluates to "false". Consider

select * from employee where dno = 5 or super_ssn = null;

This returns the employees for which dno=5; in this case the Boolean expression becomes "true or unknown", which is "true".

Nested queries

Queries can be nested; this is how we will construct complex queries where inclusion of a record in the result depends on other records in the database.

When queries are nested, we use the term outer query for the enclosing query, and inner query (queries) for the enclosed query/queries.

Single-row nested queries

The simplest case is perhaps when only a single row and column is retrieved; that value can be used in a boolean or arithmetic expression. (We should really call these single-value queries.)

For example, we can find the ssn of the big boss (ie the one with no supervisor) with

select e.ssn from employee e where e.super_ssn is null;

Now let's find everyone who works for the boss:

select e.fname, e.lname from employee e where e.super_ssn =
(select e.ssn from employee e where e.super_ssn is null);

Note that we reused the letter e above; the two uses (inner and outer) don't interact at all.

Exercise: how can we do this as a "traditional" query?

Now let's find everyone who makes at least half of what the boss makes:

select e.fname, e.lname from employee e where 2*e.salary >
(select e.salary from employee e where e.super_ssn is null);

How about:

select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.super_ssn is null)/2;

For any given employee with ssn X, we can find the ssn of the employee's manager with

select e1.super_ssn from employee e1 where e1.ssn = X;

We can find the supervisor's salary with

select e1.salary from employee e1 where e1.ssn = X's super_ssn

Here's a query with two inner single-row queries. Note that Wong and Wallace are the two managers reporting directly to Mr Borg. My main point here is that the results of the inner selects can be used inside arithmetic expressions.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.6*(
    (select e.salary from employee e where e.lname = 'Wong')
     +
     (select e.salary from employee e where e.lname = 'Wallace')
);

Now let's list all employees who make at least 70% of what their manager earns.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.7* (select e1.salary from employee e1 where e1.ssn = e.super_ssn);

The query above is correlated: the inner query refers to the outer one. In the earlier examples, we could do query evaluation from the inside out; here we can not. Instead, our evaluation strategy turns out to be more like an inner loop.

Example of failure if multiple values are returned:

select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.lname LIKE 'W%');

Finally, we can use the COUNT() function. First, an example of its standalone use, to get the number of employees in department 5; the count(*) means to count all records returned.

select count(*) from employee e where e.dno = 5;

Next, here's an example where the inner query contains count(). This lists all departments with at least 3 members:

select d.dname from department d
where (select count(*) from employee e where e.dno = d.dnumber) >=3;

Finally, here's an example where we use the count in the select part, to list all departments and their number of employees:

select d.dname, (select count(*) from employee e where e.dno = d.dnumber) AS dept_size
from department d;

Note the use of AS here, to name the computed column; it is very different from the (usually omitted) use of AS in the from section.

This is not the preferred way to use count(), though; more on the COUNT() function follows.

Grouping functions

Sometimes the inner query returns multiple rows (usually numeric), but we reduce those rows to a single value using one of the aggregate functions (E&N §5.1.7):

count
sum
avg
max
min

For example, here is how we'd find all employees who make at least 25% of their department total:

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 *(select sum(e1.salary) from employee e1 where e1.dno = e.dno);

Notice we can not do:

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 * sum(select e1.salary from employee e1 where e1.dno = e.dno);

Let's take a look at E&N 4.12(c):

For each section taught by Professor King, retrieve the course number, semester, year, and number of students who took the section

The easiest way to do this is to include in the select part of the outer query a computed column of the form

(select count(*) from grade_report g where g.section_identifier = s.section_identifier) AS NUM_STUDENTS

You can do this query a few other ways, too. First, note the following peculiar query:

select s.course_number, s.semester, s.year
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

It is peculiar because the involvement of grade_report is unclear; all it seems to do (all it does do) is to generate a separate copy of the fields ⟨s.course_number,s.semester,s.year⟩ for each separate student (each record in grade_report for that particuar section_identifier).

One way to use COUNT(*) which is technically wrong but which works in MySQL is:

// wrong
select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

The right way to do this is to use the GROUP BY option. We have multiple records for each ⟨s.course_number,s.semester,s.year⟩; we group these common fields together (by s.section_identifier) for counting:

select s.course_number, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King'
GROUP BY s.course_number;

We'll look into the GROUP BY in more detail later; for now, note that if aggregation is used in some of the select columns, then all the non-aggregated ("ordinary") columns must be included in the GROUP BY clause.