Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays

Week 3

Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.

Homework 1:

Postgres

In the 1970's, the Ingres database project began at Berkeley. Ingres originally stood for Interactive graphics retrieval system, though shortly thereafter the "graphics" part became much less important.

After Ingres (which does still exist) came Postgres, for "post-ingres".

Later, to emphasize its support for SQL, Postgres became PostgreSQL.

Note that it is virtually impossible to pronounce PostgreSQL using the "sequel" form for SQL; if you want a short form, use "Postgres".

A decade ago, MySQL was the "fast one" and Postgres was the one that accurately implemented the SQL standard. That's less true now, though MySQL is still missing intersection (and a few other things).

Postgres can be found at http://postgresql.org.

After you install the software, you still have to create appropriate accounts; the postgres account is the only one installed by default. Here's what I had to do:

1. Log in as user "postgres", and run the psql command. The "bash>" prompt below is the shell prompt (as user postgres); the "post=> " prompt is the psql SQL prompt.

    bash> sudo -u postgres psql
    post=> create user 'pld';   

2. From the shell, still as user postgres

    bash> createdb pld

Then, as user pld, the command "psql" works. If you created several databases above, eg "pldoffice", then the command to connect to postgres using the pldoffice database would be psql pldoffice. Note that I did not actually figure out how to give myself the right to create new databases.


Some postgres commands


Your psql prompt should show you the current database.

SHOW ALL -- shows status variables only

To list your tables:

select table_name from information_schema.tables where table_schema='public' and table_type='BASE TABLE';


Intersect example:

This now actually works, under postgres, but I made a mistake:

old:
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

What is the problem?

new:
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);

See below for another way to do this.


What joins do


Consider our earlier join example:

    select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

You can think of e and d as cursors, representing rows in the respective tables employee and department. Linear search through the employee table , as in the SQL example

    select e.lname, e.salary from employee e where e.salary >= 30000;

might be represented by the following java-like loop, showing a linear search through the table:

for (e in employee) {
    if (e.salary >= 30000) {
       print(e.lname, e.salary)
    }
}

Now let's write this for the join; here we get a nested loop and quadratic runtime (specifically, O(nm), where n=employee.size() and m=department.size())

for (e in employee) {
    for (d in department) {
       if (e.dno == d.dnumber) {print (e.fname, e.lname, d.dname);}
    }
}

Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.


Self-join

Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:

select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;

The idea here, in terms of cursors, is for e to traverse linearly the employee table. For each employee, we get e.super_ssn, and then use s to traverse the same table, looking for a row with s.ssn = e.super_ssn.

This particular example is much more readable if we relabel the supervisor columns:

select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;

Note that the AS here is not optional, and plays a rather different role than the AS that can be used in the from section (from employee AS e, employee AS s).



Outer joins

What if we want to list all employees and, if they have a qualification, that too? See the table qualifications:

+-----------+-----------+
| ssn       | qual      |
+-----------+-----------+
| 123456789 | bachelors |
| 333445555 | masters   |
| 987654321 | masters   |
| 666884444 | doctorate |
| 987987987 | ccie      |
| 888665555 | doctorate |
+-----------+-----------+

Standard join syntax:
    select e.lname, q.qual from employee e, qualifications q where e.ssn = q.ssn;

Alternative join syntax:
    select e.lname, q.qual from employee e join qualifications q on e.ssn = q.ssn;

"Left [outer] join" syntax
    select e.lname, q.qual from employee e left join qualifications q on e.ssn = q.ssn;

What is happening here? The point is that with the left join, then all rows of the left table are included, even when there is no match in the right table.

Let's add an extra line to the qualifications table:
    insert into qualifications values ('345456567', 'hacker');

Now let's do the left and right joins. What changed?

To delete the row,
    delete from qualifications where qual='hacker';


Null

As we have discussed, NULL can represent any of:
SQL supports the operators ____ IS NULL and ____ IS NOT NULL, for testing whether a value is null. However, the = and <> operators do not work. Similarly, an equality comparison between two null fields of two records (eg as part of a join) will fail. The reasoning is that null is not to be thought of as a specific value, but rather as meaning "unknown". If two people have unknown phone numbers (both NULL), that does not mean they have the same number!

Compare


Nested queries

Queries can be nested. When this is done, we use the term outer query for the enclosing query, and inner query (queries) for the enclosed query/queries.

Single-row nested queries

The simplest case is perhaps when only a single row and column is retrieved; that value can be used in a boolean or arithmetic expression. (We should really call these single-value queries.)

For example, we can find the ssn of the big boss (ie the one with no supervisor) with

select e.ssn from employee e where e.super_ssn is null;

Now let's find everyone who works for the boss:

select e.fname, e.lname from employee e where e.super_ssn =
    (select e.ssn from employee e where e.super_ssn is null);

Note that we reused the letter e above; the two uses (inner and outer) don't interact at all.

Exercise: how can we do this as a "traditional" query?

Now let's find everyone who makes at least half of what the boss makes:

select e.fname, e.lname from employee e where 2*e.salary >
    (select e.salary from employee e where e.super_ssn is null);

How about:

select e.fname, e.lname from employee e where e.salary >
    (select e.salary from employee e where e.super_ssn is null)/2;

For any given employee with ssn X, we can find the ssn of the employee's manager with
   
select e1.super_ssn from employee e1 where e1.ssn = X;

We can find the supervisor's salary with
   
select e1.salary from employee e1 where e1.ssn = X's super_ssn

Here's a query with two inner single-row queries. Note that Wong and Wallace are the two managers reporting directly to Mr Borg. My main point here is that the results of the inner selects can be used inside arithmetic expressions.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.6*(
    (select e.salary from employee e where e.lname = 'Wong')
     +
     (select e.salary from employee e where e.lname = 'Wallace')
);


Now let's list all employees who make at least 70% of what their manager earns.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.7* (select e1.salary from employee e1 where e1.ssn = e.super_ssn);

The query above is correlated: the inner query refers to the outer one. In the earlier examples, we could do query evaluation from the inside out; here we can not. Instead, our evaluation strategy turns out to be more like an inner loop.

Example of failure if multiple values are returned:

select e.fname, e.lname from employee e where e.salary >
    (select e.salary from employee e where e.lname LIKE 'W%');

Grouping functions

Sometimes the inner query returns multiple rows (usually numeric), but we reduce those rows to a single value using one of the aggregate functions (E&N §5.1.7):

count
sum
avg
max
min

For example, here is how we'd find all employees who make at least 25% of their department total:

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 *(select sum(e1.salary) from employee e1 where e1.dno = e.dno);

Notice we can not do:

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 * sum(select e1.salary from employee e1 where e1.dno = e.dno);

Here's how to do E&N 4.12(c):

For each section taught by Professor King, retrieve the course number, semester, year, and number of students who took the section

select course_number, semester, year,
    (select count(*) from grade_report g where g.section_identifier = s.section_identifier) AS NUM_STUDENTS
from section s where s.instructor='King';

Note the use of AS here; it is very different from the (usually omitted) use of AS in the from section.

You can do this query a few other ways, too; one is

select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

Note carefully what is being COUNTed. Nominally, no value from table g appears in the select portion!

What does this do:

select s.course_number, s.semester, s.year
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

multi-value nested queries

In these queries, the inner query returns (or potentially returns) a set of rows, or rows with multiple columns. In this case, the following operations are allowed:
Suppose we want to find the names of employees who have worked on project 10. We can do this as a traditional join, but we can also do:

select e.fname, e.lname from employee e where e.ssn in
    (select w.essn from works_on w where w.pno = 10);


E&N Q4A, p 118: List all project numbers for pojects that involve an employee whose lastname is 'Smith', either as worker or manager. We did this before with a standard join; here is a nested version:
select distinct p.pnumber
from project p
where p.pnumber IN
(select p1.pnumber from project p1, department d, employee e
where p1.dnum = d.dnumber and d.mgr_ssn = e.ssn and e.lname = 'Smith')
or
p.pnumber IN
(select w.pno from works_on w, employee e
where w.essn = e.ssn and e.lname = 'Smith');

This one is not correlated.

Tuple example, middle of p 118. Just what question does this answer?

select distinct w.essn
from works_on w
where (w.pno, w.hours) IN
(select w1.pno, w1.hours from works_on w1 where w1.essn = '123456789');


Query 16: an example of a nested correlated query:

select e.fname, e.lname from employee e
where e.ssn in
(select d.essn from dependent d where e.fname = d.dependent_name and e.sex = d.sex);

Note that queries using = or in to connect the inner and outer queries can always be done without nesting, using an additional join. Is this more or less efficient?

Adding a row to make this nonempty:
insert into dependent values ('123456789', 'John', 'M', '1990-02-24', 'son');
delete from dependent where bdate='1990-02-24';