Database week 4

Comp 305/488: Database Administration, Corboy 208, 4:15 Tuesdays

Week 4

Read in Elmasri & Navathe (EN)

Chapter 3, The Relational Data Model ...., Sections 1 & 2
Chapter 4, all, especially 4.3 on queries
Chapter 5, sections 5.1 (advanced SQL), 5.3, (views), 5.4 (schema change)

Section 5.2 is on constraints; you may omit that for now.

Homework 2: The following exercises from E&N (full text available on Sakai), due Friday Feb 13:

4.12c
5.5
5.6
5.7
5.8

Demo of mysql -h sql1.cs.luc.edu -u pld -p

crypto

Outer joins

Suppose we introduce the table qualification, holding a (single) qualification for each employee:

+-----------+-----------+
| ssn       | qual      |
+-----------+-----------+
| 123456789 | bachelors |
| 333445555 | masters   |
| 987654321 | masters   |
| 666884444 | doctorate |
| 987987987 | ccie      |
| 888665555 | doctorate |
+-----------+-----------+

This is created with

    create table qualification (
       ssn       char(9) primary key not null,
       qual     varchar(20) not null,
       foreign key (ssn) references employee(ssn)
    );

What if we want to list all employees and, if they have a qualification, that too? Here is the standard join syntax:

    select e.lname, q.qual from employee e, qualification q where e.ssn = q.ssn;

Alternative join syntax:

select e.lname, q.qual from employee e join qualification q on e.ssn = q.ssn;

But employees with no qualification are not listed at all! How can we get everyone listed, whether or not they have a qualification?

"Left [outer] join" syntax

select e.lname, q.qual from employee e left join qualification q on e.ssn = q.ssn;

What is happening here? The point is that with the left join, then all rows of the left table are included, even when there is no match in the right table.

Let's add an extra line to the qualification table:

insert into qualification values ('345456567', 'hacker'); // no such existing employee!

Now let's do the left and right joins (replace left with right, or reverse the order).

select e.lname, q.qual from employee e right join qualification q on e.ssn = q.ssn;
select e.lname, q.qual from qualification q left join employee e on e.ssn = q.ssn;

What changed?

To delete the row,

delete from qualification where qual='hacker';

We can do this to make the output prettier:

select e.lname, case when q.qual is null then '' else q.qual end as emp_qual
from employee e left join qualification q on e.ssn = q.ssn;

What's going on here is that we're using an expression in the select part, rather than a field name, and using the

case when boolean_expression then expression1 else expression2 end

as our expression. We're also giving the column a name, "emp_qual".

While expressions in select columns are often important, note that some kinds of output formatting are better done in the front end rather than in SQL itself.

Here is an alternative use of select expressions that does not require outer joins at all.

select e.fname, e.lname, (select q.qual from qualification q where e.ssn = q.ssn)
from employee e;

We might be happier giving a name to the third column:

select e.fname, e.lname, (select q.qual from qualification q where e.ssn = q.ssn) AS qual
from employee e;

Note that this does not work:

select e.lname, q.qual from EMPLOYEE e, QUALIFICATION q
where (e.ssn = q.ssn) OR (not exists (select q.qual from QUALIFICATION q where e.ssn=q.ssn));

What goes wrong? Look at the output. If someone has no qualifications, the second half of the OR is always true. So that employee appears with all qualifications.

Finally, the following query may at first seem strange:

select e.lname, q.qual from employee e left join qualification q on e.ssn = q.ssn where q.ssn is null;

How can q.ssn ever be null? But the equijoin condition e.ssn = q.ssn does not always hold; it holds only when there actually is a matching QUALIFICATION record. When there is no q to match e, the record still enters the join, but this time with all nulls in the QUALIFICATION fields.

Null

As we have discussed, NULL can represent any of:

A value we just do not know at the moment (eg an unknown middle initial or phone number)
A value that is more-or-less-permanently unavailable (eg NULL for a middle initial of someone with no middle name)
A value that is not applicable, perhaps given some other attributes of the row (Borg's super_ssn might be an example,

SQL supports the operators ____ IS NULL and ____ IS NOT NULL, for testing whether a value is null. However, the = and <> operators do not work. Similarly, an equality comparison between two null fields of two records (eg as part of a join) will fail. The reasoning is that null is not to be thought of as a specific value, but rather as meaning "unknown". If two people have unknown phone numbers (both NULL), that does not mean they have the same number!

Compare

select lname from employee where super_ssn is null;
select lname from employee where super_ssn is not null;
select lname from employee where super_ssn = null;
select lname from employee where not super_ssn = null;
select lname from employee where super_ssn <> null;

The last three return no records! This is because any comparison involving NULL behaves as if it returns a third truth value, "unknown". This "unknown" value propagates upwards in any boolean expression it is part of, according to appropriate truth-table rules; in particular, "true AND unknown" is "unknown", and "NOT unknown" is "unknown". The only times the end result is not also "unknown" are "true OR unknown", which evaluates to "true", and "false AND unknown", which evaluates to "false". Consider

select * from employee where dno = 5 or super_ssn = null; -- WRONG

This returns all employees for which dno=5; in this case the Boolean expression becomes "true or unknown", which is "true". Note that the correct formulation of the query above is

select * from employee where dno = 5 or super_ssn is null;

Nested queries

Queries can be nested; this is how we can construct complex queries where inclusion of a record in the result depends on other records in the database. Be aware, however, that nested queries are often relatively inefficient; the nesting structure may enforce a quadratic runtime speed. Often a query that has a straightforward solution involving nesting also has a non-nested solution involving joins; the latter may be much more efficient.

When queries are nested, we use the term outer query for the enclosing query, and inner query (queries) for the enclosed query/queries.

Single-row nested queries

The simplest case is perhaps when only a single row and column is retrieved. SQL automatically converts a query result that is a 1×1 table to the value that 1×1 table contains; this value can be used in a Boolean or arithmetic expression. We should really call these single-value queries.

Note that the number of columns in a query result is determined syntactically, by the form of the select clause. The number of rows, however, usually depends on the data.

As an example, we can find the ssn of the big boss (ie the one with no supervisor) with

select e.ssn from employee e where e.super_ssn is null;

Now let's find everyone who works for the boss:

select e.fname, e.lname from employee e where e.super_ssn =
(select e.ssn from employee e where e.super_ssn is null);

Note that we reused the letter e above; the two uses (inner and outer) don't interact at all. They are like in Java a global variable e and a local variable e.

How can we do this as a "traditional" query? We want employees e where there is a supervising employee s with a null super_ssn. We can do this as a self-join, just like the table of employees and their supervisors:

select e.fname, e.lname from employee e, employee s
where e.super_ssn = s.ssn and s.super_ssn is null;

Which is easier?

Now let's find everyone who makes at least half of what the boss makes:

select e.fname, e.lname from employee e where 2*e.salary >
(select e.salary from employee e where e.super_ssn is null);

How about:

select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.super_ssn is null)/2;

Hopefully, the DBMS can realize in these cases that the inner query needs to be executed just once.

For any given employee with ssn X, we can find the ssn of the employee's manager with

select e1.super_ssn from employee e1 where e1.ssn = X;

We can find the supervisor's salary with

select e1.salary from employee e1 where e1.ssn = X's super_ssn

Here is a query with two inner single-row queries. Note that Wong and Wallace are the two managers reporting directly to Mr Borg. My main point here is that the results of the inner selects can be used inside arithmetic expressions.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.6*(
    (select e.salary from employee e where e.lname = 'Wong')
     +
     (select e.salary from employee e where e.lname = 'Wallace')
);

Now let's list all employees who make at least 70% of what their own manager earns.

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.7* (select e1.salary from employee e1 where e1.ssn = e.super_ssn);

The query above is correlated: the inner query refers to the outer one. In the earlier examples, we could do query evaluation from the inside out; here we can not. Instead, our evaluation strategy turns out to be more like an inner loop.

Example of failure if multiple values are returned:

select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.lname LIKE 'W%');

Finally, we can use the COUNT() function. First, here is an example of its standalone use, to get the number of employees in department 5; the count(*) means to count all records returned.

select count(*) from employee e where e.dno = 5;

Next, here is an example where the inner query contains count(). This lists all departments with at least 3 members:

select d.dname from department d
where (select count(*) from employee e where e.dno = d.dnumber) >=3;

Finally, here's an example where we use the count in the select part as a select expression, to list all departments and their number of employees:

select d.dname, (select count(*) from employee e where e.dno = d.dnumber) AS dept_size
from department d;

Note the use of AS here, to name the computed column; it is very different from the (usually omitted) use of AS in the from section.

This is not the most common way to use count(), though; more on the COUNT() function follows.

What we can not do with single-value queries is to save the value in a numeric variable and then use it in other expressions.

Grouping functions

Sometimes the inner query returns multiple rows (usually numeric), but we reduce those rows to a single value using one of the aggregate functions (E&N §5.1.7):

count
sum
avg
max
min

For example, here is how we would find all employees who make at least 25% of their department total:

select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 *(select sum(e1.salary) from employee e1 where e1.dno = e.dno);

Notice we can not do the following. Aggregation functions can only be used within the select clause, not on the query result as a whole.

select e.fname, e.lname, e.salary from employee e -- WRONG!
where e.salary >= 0.25 * sum(select e1.salary from employee e1 where e1.dno = e.dno);

Let's take a look at E&N 4.12(c) (from the homework)

For each section taught by Professor King, retrieve the course number, semester, year, and number of students who took the section

The easiest way to do this is to use nested queries, where you include in the select part of the outer query a computed inner-query column of the form

(select count(*) from grade_report g where g.section_identifier = s.section_identifier) AS NUM_STUDENTS

Note that in the query above, the table variable s is a free variable, to be supplied by context in the outer query.

You can do this query a few other ways, too. First, note the following peculiar query:

select s.course_number, s.semester, s.year
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

It is peculiar because the involvement of grade_report is unclear; all it seems to do (all it does do) is to generate a separate copy of the fields ⟨s.course_number,s.semester,s.year⟩ for each separate student (each record in grade_report for that particular section_identifier).

One way to use COUNT(*) which is technically wrong but which works in MySQL is:

// wrong
select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';

The right way to do this is to use the GROUP BY option. We have multiple records for each ⟨s.course_number,s.semester,s.year⟩; we group these common fields together (by s.section_identifier) for counting:

select s.course_number, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King'
GROUP BY s.course_number;

We'll look into the GROUP BY in more detail later; for now, note that if aggregation is used in some of the select columns, then all the non-aggregated ("ordinary") columns must be included in the GROUP BY clause.

GROUP BY

Suppose we want to count the employees in each department. We could run the following (sorting if preferred) and count each department by hand:

select e.dno from EMPLOYEE e;

But we really want SQL to do this; that is, we want a table of⟨dept_no, count⟩. Here's how:

select e.dno, count(*)
from EMPLOYEE e
group by dno;

What is going on here is that we select the e.dno records, and then group them by dno, and then select a single row for each group. The selection values can either be conventional attributes that are known to be constant for the group, or else one of the aggregate functions count, sum, max,min, avg applied to a non-constant attribute. For count(), we can use the form count(*) to count all records; for the others, we must apply them to specific numeric attributes.

Conventional attributes are known to be constant for the group if they appear in the group by clause. In other words, every non-aggregate attribute in the select clause must also be in the group by clause. Usually all attributes in the group by clause are used in the select as well.

Here's the department information with average salaries added (and also the birthdate of the oldest employee) (Query 24)

select e.dno, count(*), avg(e.salary), min(bdate)
from EMPLOYEE e
group by dno;

What is going on conceptually is that, after the rows are selected(there is no where clause in the example above, so this row-selection is trivial), the remaining rows are then partitioned according to the group by condition. Then, the select results are given for each group: they will be either constant for the group, or else an aggregate function that will now be computed for the group. Recall again that any non-aggregate entry in the select clause must be listed in the group by clause; this is what guarantees that non-aggregate entries are constant for each group.

If we want to add department names to the above we can do a simplejoin. Note that we must (except in MySQL) add dname to the group by clause.

select d.dname, e.dno, count(*), avg(e.salary), min(bdate)
from EMPLOYEE e, DEPARTMENT d
where e.dno = d.dnumber
group by dno, d.dname;

Here's another example: Query 25: for each project, give the project number, the project name, and the number of employees who worked on it. Note that it is usually appropriate to add an as title for count(*) columns to identify what is being counted.

select p.pnumber, p.pname, count(*) as employees
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname

The output is:

pnumber	pname	employees
1	ProductX	2
2	ProductY	4
3	ProductZ	2
10	Computerization	3
20	Reorganization	3
30	Newbenefits	3

Demo: this fails if we leave either attribute out of the group by clause (at least in postgres). (Note that, nonetheless, it is clear that if we leave out p.name then this attribute is still constant for p.pnumber groups;p.pnumber is after all the key for the PROJECT table.)

Also note that this might be a candidate for leaving p.pnumber off the select clause, if p.pname is sufficient.

Here's a tabular representation of the full join table here, with all attributes from works_on; it represents Fig 5.1(b) of E&N. The group by option means we take each group, identified in the virtual final column; count(*) refers to the number of rows in that group.

p.pname	p.pnumber	w.essn	w.pno	w.hours
ProductX	1	123456789	1	32.5	group 1
ProductX	1	453453453	1	20	group 1
ProductY	2	453453453	2	20	group 2
ProductY	2	333445555	2	10
ProductY	2	987654321	2	7.5
ProductY	2	123456789	2	7.5
ProductZ	3	333445555	3	10	group 3
ProductZ	3	666884444	3	40	group 3
Computerization	10	987987987	10	35	group 10
Computerization	10	333445555	10	10
Computerization	10	999887777	10	10
Reorganization	20	333445555	20	10	group 20
Reorganization	20	987654321	20	15
Reorganization	20	888665555	20	0
Newbenefits	30	999887777	30	30	group 30
Newbenefits	30	987987987	30	5
Newbenefits	30	987654321	30	20

All the aggregation functions (count, sum, min, max, avg) can be applied in select clauses to ungrouped queries, but then if any aggregation functions appear, then everything in the select clause must be an aggregation function. What we get are various "statistics" for the entire table.Example:

select count(*), min(e.bdate), sum(e.salary), avg(e.dno) from employee e;

This is a relatively limited situation; it is more common for the aggregation functions to be combined with group by.

Remember E&N 4.12(c)?

For each section taught by Professor King, retrieve the course number, semester, year, and number ofstudents who took the section

Here is a solution that works in MySQL, but is not standard SQL. Oracle and postgres do not allow this:

select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from SECTION s, GRADE_REPORT g
where g.section_identifier = s.section_identifier and s.instructor='King';

The COUNT(*) counts all records, and when it is present, everything else in the select clause should be "constant" (either there should be nothing else, or else the query should use group by for all the non-aggregated attributes.

Here's a set of examples that use count(*) appropriately (and work under postgres):

select student_number from GRADE_REPORT; -- no aggregation
select count(*) from GRADE_REPORT; -- no non-aggregated attributes
select count(student_number) from GRADE_REPORT;
select count(distinct student_number) from GRADE_REPORT;
select count(*) from (select distinct student_number, section_identifier from grade_report) as dummy;

Here are some examples that are wrong, though, as usual, some work in MySQL

select student_number, count(*) from GRADE_REPORT; -- needs GROUP BY
select count(distinct student_number, section_identifier) from GRADE_REPORT
select count(distinct *) from GRADE_REPORT; /* fails in mysql*/

select count(distinct ____) ... requires a single column attribute!

Multi-value nested queries

In these queries, the inner query returns (or potentially returns) a set of rows, or rows with multiple columns. In this case, the following operations are allowed:

IN
EXISTS / NOT EXISTS
UNIQUE
= ANY / = SOME
op ALL (op is typically a comparison like > or >=)

Suppose we want to find the names of employees who have worked on project 10. We can do this as a traditional join, but we can also do:

select e.fname, e.lname from employee e where e.ssn in
(select w.essn from works_on w where w.pno = 10);

E&N Q4A, p 118: List all project numbers for projects that involve an employee whose last name is 'Smith', either as worker or manager.

We did this before with a standard join; here is a nested version:

select distinct p.pnumber
from project p
where p.pnumber IN

(select p1.pnumber from project p1, department d, employee e
where p1.dnum = d.dnumber and d.mgr_ssn = e.ssn and e.lname = 'Smith')

or
p.pnumber IN

(select w.pno from works_on w, employee e
where w.essn = e.ssn and e.lname = 'Smith');

This one is not correlated; the inner queries can be evaluated stand-alone.

In the middle of p 118, E&N6 states

SQL allows the use of tuples of values in comparisons by placing them within parentheses. To illustrate this, consider the following query [which I edited slightly -- pld]

select distinct w.essn
from works_on w
where (w.pno, w.hours) IN
(select w1.pno, w1.hours from works_on w1 where w1.essn = '123456789');

Just what question does this answer?

Query 16: an example of a nested correlated query:
Retrieve the name of each employee who has a dependent with the same first name and is the same sex as the employee:

Here is E&N6's solution:

select e.fname, e.lname from employee e
where e.ssn in
(select d.essn from dependent d where e.fname = d.dependent_name and e.sex = d.sex);

Note that, in the inner query, e is a "free variable", referring to the employee e declared in the outer query.

Let's add a row to make this nonempty:

insert into dependent values ('123456789', 'John', 'M', '1990-02-24', 'son');
delete from dependent where bdate='1990-02-24'; -- to delete the record above

Note that this query only returns rows where e.ssn = d.essn, although this is not stated as an equality check (if we don't have e.ssn = d.essn, then e.ssn will not be in the set of d.essn). But note that if we add the above son John to another employee, '987654321', then the inner query for employee John Smith, '123456789', would amount to

select d.essn from dependent d where 'John' = d.dependent_name and 'M' = d.sex;

and would return '987654321'. But the in clause would then not match (which is the correct behavior).

It can be confusing for an inner query to return such "irrelevant" results, even if the end result is correct.

Another, possibly better, way to write this query might be:

select e.fname, e.lname from employee e join dependent d on e.ssn = d.essn
where e.fname = d.dependent_name and e.sex = d.sex;

Note that queries using = or in to connect the inner and outer queries can always be done without nesting, using an additional join. Is this more or less efficient?

Nested queries with Exists and Unique

Here's the above example again of a query to find all employees with a dependent of the same name and sex, this time done with exists instead of in:

select e.fname, e.lname from EMPLOYEE e
where exists
(select * from DEPENDENT d where e.ssn = d.essn and e.fname = d.dependent_name and e.sex = d.sex);

(Note here that the inner query will return no records if the result is empty, unlike the in version.)

What about the names of employees with no dependents (Query 6 of E&N)? We can do a query with COUNT(*) to check if an employee e qualifies:

select count(*) from DEPENDENT d where d.essn = e.ssn; /* e is a "free variable" here */

The full query is:

select e.fname, e.lname from EMPLOYEE e
where (select count(*) from DEPENDENT d where d.essn = e.ssn) = 0;

But here it is with not exists:

select e.fname, e.lname from EMPLOYEE e
where not exists
(select * from DEPENDENT d where e.ssn = d.essn);

Query 7: list the names of managers with at least one dependent:

select e.fname, e.lname from EMPLOYEE e
where
exists (select * from DEPENDENT d where d.essn = e.ssn)
and
exists( select * from DEPARTMENT d where e.ssn = d.mgr_ssn);

Or perhaps

select e.fname, e.lname from employee e, department d
where
e.ssn = d.mgr_ssn
and
exists (select * from dependent dd where dd.essn = e.ssn);

Generally speaking, joins are slightly preferred, simply because it is likely they are better optimized.

Suppose we want to find all employees who have worked on project 3. We can do this with a straightforward join:

select e.fname, e.lname from employee e, works_on w
where e.ssn = w.essn and w.pno = 3;

or we can write:

select e.fname, e.lname from EMPLOYEE e
where exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno = 3);

But now suppose we want all employees who have not worked on project 3. If this can be done with a straightforward join, it is beyond me. But it is easy with the exists pattern above:

select e.fname, e.lname from EMPLOYEE e
where not exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno = 3);

Application of not exists to the results of a subquery is a powerful technique.

How about all the employees who have worked more hours on some project than anyone in department 4?

select e.fname, e.lname from EMPLOYEE e, works_on w
where e.ssn = w.essn and w.hours >all (
select w.hours from WORKS_ON w, EMPLOYEE e2 where w.essn= e2.ssn and e2.dno = 4
);

We could also write that

where e.ssn = w.essn and w.hours > (select max(w.hours) from WORKS_ON w, EMPLOYEE e2 where w.essn = e2.ssn and e2.dno = 4)

Here's yet another way to do the intersect example from last week, without using intersect. Recall that the point was to list all employees who worked on project 2 and project 3.

select e.fname, e.lname
from EMPLOYEE e
where

e.ssn in (select e.ssn from EMPLOYEE e, WORKS_ON w where e.ssn = w.essn and w.pno = 2)
and
e.ssn in (select e.ssn from EMPLOYEE e, WORKS_ON w where e.ssn = w.essn and w.pno = 3);

What about the name of the oldest employee? This nested query isn't even correlated.

select e.fname, e.lname from EMPLOYEE e
where e.bdate = (select min(e2.bdate) from EMPLOYEE e2);

And if we want the names of all employees older than everyone in department 5:

select e.fname, e.lname from EMPLOYEE e
where e.bdate < all (select e.bdate from EMPLOYEE e where e.dno = 5);

What about all employees who have worked on a dept 5 project and also a dept 4? This can be done as a join and also as:

select e.fname, e.lname from EMPLOYEE e
where
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 4)
and
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 5);

(Wong has worked on projects 2 & 3 in dept 5 and 10 in dept 4, and Wallace has worked on projects 30 in dept 4 and the following adds her to project 2 in dept 5)

insert into works_on values ('987654321', 2, 3); -- project 2 is in dept 5; the 3 is the hours/week worked
delete from works_on where essn='987654321' and pno=2;

How about employees who have worked on a project in a department other than their "home" department? Again, this can be done as a join but also as:

select e.fname, e.lname from EMPLOYEE e
where exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and e.dno <> p.dnum);

That last <> isn't really a join condition; we already have enough of those.

Queries with ALL (Universal Quantification)

Here's a harder one: list the employees who have worked for ALL departments, or, in more detail, list all employees e such that for every department d, e worked for a project of d.

A first approach to this might be by listing employees for which the set of departments they have worked for equals the set of all departments:

select e.lname from employee e
where
(select distinct p.dnum from project p, works_on w
where w.pno = p.pnumber and w.essn = e.ssn order by p.dnum) -- projects e has worked on
=
(select d.dnumber from department d);

However, SQL does not allow arbitrary set comparisons: MySQL yields the error "subquery returns more than 1 row". Postgres complains "ERROR: more than one row returned by a subquery used as an expression".

The second department set above, (select d.dnumber from department d), is the set of all departments. Let us call this D2. The first department set is the set of departments that employee e has worked for; let us call this D1. It is automatic that D1 ⊆ D2; what we really want to know is whether there is anything in D2 that is not in D1. The set-theoretic MINUS operator works as follows:

    A MINUS B = {x∈A | x∉B}

So what we really need to know is whether D2 MINUS D1 is nonempty. We can actually do this; in Oracle we'd use "MINUS" but in postgres the set-minus operator is called "EXCEPT":

select e.lname from employee e
where NOT EXISTS (
    (select d.dnumber from department d)
    EXCEPT
    (select distinct p.dnum from project p, works_on w
     where w.pno = p.pnumber and w.essn = e.ssn order by p.dnum)       -- projects e has worked on
)

Alas, there is no MINUS / EXCEPT operator in MySQL.

A way to approach this problem that MySQL can handle is by is asking for all employees such that there is not a department that the employee has not worked on any of that department's projects, or list all employees e such that it is there is no department d such that e did not work for any project of d.

The following two ways of describing a set are the same, where D is the domain of d. The notation is general, but it may help to think of e as representing employees, d as representing departments D, and foo(e,d) meaning "e works on a project managed by d" (we could also use "e works for d", but in that case for each e there is exactly one d).

{ e | for all d ∈ D, foo(e,d) }
{ e | there does not exist d ∈ D for which ¬foo(x,y) }

That is, "for every d, foo(d) is true" is to say that "there is no d such that foo(d) is false". We might call the latter the double-negative formulation.

We cannot phrase the first bulleted form directly in SQL, but we can express the second form, using at least one and usually two not exist clauses.

Back to the example "list all employees e such that for every department d, e worked for a project of d". By the above, this is the same as "list all employees e such that there does not exist a department d for which e did not work on any project of d". To list the projects that e worked on in department d, we could use the following (free variables e and d highlighted):

select * from works_on w, project p
where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = d.dnumber

We can say "e did not work on any project of d" by saying the above list is empty:

not exists (select * from works_on w, project p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = d.dnumber)

So here is the final query: there does not exist a department e did not work for

select e.fname, e.lname from EMPLOYEE e
where not exists (
select d.dnumber from department d where
not exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = d.dnumber)
);

Queries involving ALL in the qualification (as opposed to "list ALL senior students") tend to be hard to follow, at least until the "double negative" reformulation becomes more natural.

As another example, suppose we want the names of each employee who works on all the projects controlled by department 4 (Query Q3). (Those projects are pnumbers 10 and 30 using the E&N original data.) To do this one, we need to select the department-4 projects.

Here is E&N's version of this query, Q3B (p 122, 6th ed), slightly rearranged

select e.fname, e.lname from EMPLOYEE e
where
not exists (

select * from WORKS_ON w1 where
w1.pno in (select p.pnumber from PROJECT p where p.dnum = 4)
and
not exists (select * from WORKS_ON w2 where w2.essn = e.ssn and w2.pno = w1.pno)

But here is an arguably more "natural" version, or at least more directly using the double-negative style:

select e.fname, e.lname from EMPLOYEE e
where
not exists (

select * from project p where p.dnum = 4
and
not exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno = p.pnumber) /* e did not work on p */

);

The idea here is that someone has worked on all department-4 projects if there is no department-4 project that the employee did not work on.

Now suppose we want employees who have worked on at least two projects in department 4. This is easier.

select e.fname, e.lname from EMPLOYEE e
where
(select count(*) from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 4) >= 2;

One of the homework-2 exercises asks you to list all students who have received all A's. This is another example of this pattern: it is equivalent to asking for all students where there does not exist a grade that is not A. Here, though, the inner query would be the set of grades received by the student that are ≠ A; there would not be a second not exist clause.

Except / Minus

Like for intersect, there is a general way to take a query that uses the except (or minus) operator, and reimplement it as a join. In this case, however, an outer join is needed. Suppose we have

(select a.col1, a.col2 from TableA a)
except
(select b.col1, b.col2 from TableB b)

We can convert this as follows using NOT IN:

select distinct a.col1, a.col2 from TableA a
where (a.col1, a.col2) NOT IN
(select b.col1, b.col2 from TableB b);

We can also do this as a left outer join:

select distinct a.col1, a.col2 from TableA a left join TableB b on a.col1=b.col1 and a.col2 = b.col2
where b.col1 is null

Recall that the left outer join will include in the join above all records that represent "real" matches, and also all records in a that have no match in b.

As a demo, let us first create a smaller version of the employee table:

create view employee2 as
select * from employee e where e.dno <> 5;

Now the join:

select distinct e.lname, e.ssn, e.dno from employee e left join employee2 e2 on e.ssn = e2.ssn
where e2.ssn is null;

GROUP BY with HAVING

Recall Query 25 from last week: for each project, give the project number, the project name, and the number of employees who worked on it. Our solution was the following:

select p.pnumber, p.pname, count(*) as employees
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname

In this query, we listed the project number, project name and employee count for each project. Suppose we want (Query 26) this information for each project with more than two employees. We introduce the having clause, which works like the where clause but is used to select entire groups.

select p.pnumber, p.pname, count(*)
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname
having count(*) > 2;

Note that we first apply the where clause to build the "ungrouped" table, and then apply the group by clause to divide the rows into groups, and finally restrict the output to only certain groups by using the having clause. Any aggregation functions in the having clause work just as they would in the select clause: they represent aggregation over the individual groups. In the case above, count(*) represents the count of each group.

Here's another example (query 27): for each project, retrieve the project number & name, and the number of dept-5 employees who work on it. E&N's solution is here:

select p.pnumber, p.pname, count(*) -- WRONG
from PROJECT p, WORKS_ON w, EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5
group by p.pnumber, p.pname;

This is actually wrong, if you interpret it as wanting all projects even if there are zero dept-5 employees at work on them. With the default data, project 30 (Newbenefits) is in this category. The problem is similar to that solved by outer joins: if there are no dept-5 employees on the project, then no join row is created at all.

Problem: GROUP BY will never include empty groups. If you want to include groups with count(*) = 0, use an inner query

This is an important point. Let's leave out the count(*) and the group by (and add an order by p.pnumber); the result of the query is:

+---------+-----------------+
| pnumber | pname           |
+---------+-----------------+
|       1 | ProductX        |
|       1 | ProductX        |
|       2 | ProductY        |
|       2 | ProductY        |
|       2 | ProductY        |
|       3 | ProductZ        |
|       3 | ProductZ        |
|      10 | Computerization |
|      20 | Reorganization  |
+---------+-----------------+

Doing the grouping manually, we see that we have groups of sizes 2 (pnumber=1), 3 (pnumber=2), 2, 1 and 1, as expected. But there are no groups of size zero shown!

Here is a working version:

select p.pnumber, p.pname,

(select count(*) from WORKS_ON w, EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5) as dept5_count

from PROJECT p;

This technique can often be used to write a query involving group by as one without.

(Why did I introduce the column name dept5_count here?)

Another group-by/having issue is illustrated in Query 28: give the number of employees in each department whose salaries exceed $40,000, but only for departments with more than five employees. In order to get a nonempty result, let's change this to $29999 and >2 employees. The book gives the following wrong version first (correctly identifying it as wrong!):

select d.dname, count(*) as highpaid -- WRONG
from department d, employee e
where d.dnumber = e.dno and e.salary > 29999
group by d.dname
having count(*) > 2;

Note that the Administration department has three employees, with the following salaries; the department has >2 employees and there is one with a salary > 29999:

+----------+---------+----------+
| Alicia   | Zelaya  | 25000.00 |
| Jennifer | Wallace | 43000.00 |
| Ahmad    | Jabbar  | 25000.00 |
+----------+---------+----------+

What shows up in the output of the query above? Recall that the where clause is evaluated at the beginning; if we put the salary constraint there, then we ignore employees making less than that, and thus potentially undercount some departments.

Problem: count(*) always counts the same thing, in a query, whether it is in the select clause or the having clause. If, as here, you want count(*) to mean one thing in one place (count of dept employees with high salary) and another thing in another place (count of all dept employees), you need an inner query.

Here is a working version (actually, the book's version of this is syntactically broken, as they left out the "e.dno in"; I think this is a typo rather than a design error).

select   d.dnumber, d.dname, count(*) as highpaid
from    department d, employee e
where d.dnumber = e.dno and e.salary > 29999 and
e.dno in (select e.dno from    employee e group by e.dno having count(*) > 2)
group by d.dnumber, d.dname;

VIEWS

Here's a simple view, that is, a sort of "virtual table". The view is determined by the select statement within. The point, though, is not that we set works_on1 here to be the output of the select, but that as future changes are made to the underlying tables, the view is automatically updated.

create view works_on1 AS
select e.fname, e.lname, p.pname, w.hours
from employee e, project p, works_on w
where e.ssn = w.essn and w.pno= p.pnumber;

We can now do:

select * from works_on1;

select w.lname, sum(w.hours)
from works_on1 w
group by w.lname;

Now let's add an employee:

insert into employee values ('Ralph', null, 'Wiggums', '000000001', '1967-03-15', null, 'M', 22000, '333445555', 5);
insert into works_on values ('000000001', 1, 9.8);

Ralph is there (working on 'ProductX'). To delete:

delete from works_on where essn='000000001';
delete from employee where ssn='000000001';

Here's a typical use of views for a single table: hiding some columns (I've hidden salary for privacy reasons, and address for space):

create view employee1 AS
select fname, minit, lname, ssn, bdate, sex, super_ssn, dno from employee;

Finally, here's another example from the book, demonstrating column renaming and use of aggregation columns:

create view deptinfo (dept_name, employee_count, total_salary) AS
select d.dname, count(*), sum(e.salary)
from department d, EMPLOYEE e
where d.dnumber = e.dno
group by dname;

There are two primary implementation strategies for views:

query rewriting
materialization

It is typically a problem to update views, though we can do this for the employee1 view so long as the missing fields do not have not null constraints (and also that we include the underlying employee table's primary key in the view).

In creating mailing lists for Microsoft Access, a given list is based not on a query (which would make the most sense), but on a named view.

Triggers

Chapter 4 had two uses of the CHECK statement (neither of which we did at the time): attribute checks and record checks. Two further checks are assertions and triggers. These sometimes have efficiency consequences, however.

1. Attribute checks

example: for the department table,
dnumber INT not null check (dnumber >0 and dnumber < 100)
These are applied whenever an individual attribute is set.

Here's another important attribute check (MySQL syntax)

create table employee2 (
    name varchar(50) not null,
    ssn    char(9) primary key not null,
    dno   INT not null check (dno > 0 and dno < 100),
    check (ssn RLIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
) engine innodb;

We can populate this from table employee:

insert into employee2
select e.lname, e.ssn, e.dno from employee e;

Some additional insertions that should fail:

insert into employee2 values ('anonymous', 'anonymous', 37);
insert into employee2 values ('joe hacker', '12345678O', 37); // that is the letter 'O'
insert into employee2 values ('jack hack', '12345689', 25); // not enough digits

They do not fail! MySQL ignores checks (even on dno). Actually, MySQL does not even check that there are parentheses around the body of the check condition.

How about postgres?

create table employee2 (
    name varchar(50) not null,
    ssn    char(9) primary key not null,
    dno   INT not null check (dno > 0 and dno < 100),
    check (ssn SIMILAR TO '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
);

This should work.

You can also do this in postgres:

create table employee3 (
    name varchar(50) not null,
    ssn    char(9) primary key not null,
    dno   INT not null check (dno > 0 and dno < 100),
    check (char_length(ssn) = 9)
);

2. Record (tuple) checks

create table DEPARTMENT (
       dname ....
       dnumber ...
       mgr_ssn ...
       mgr_start   DATE   not null,
       dept_creation DATE   not null,
       check (dept_creation <= mgr_start)
);

These are applied whenever a particular row is updated or added to that table.

3. Assertions

Assertions are general requirements on the entire database, to be maintained at all times.

create assertion salary_constraint
check ( not exists

(select * from employee e, employee s, department d
where e.salary > s.salary and e.dno = d.dnumber and d.mgr_ssn = s.ssn)

);

The problem with this (aside from the fact that this is not always a good business idea) is when and where it is applied. After every database update? But only some updates even potentially result in violations: changes to employee salaries, and changes of manager.

From the manual:
PostgreSQL does not implement assertions at present.

4. Triggers

Due to the "expense" of assertions, some databases (eg Oracle) support triggers: assertions that are only checked at user-specified times (eg update or insert into a particular table).

create trigger salary_violation
before insert or update of salary, super_ssn on employee
for each row when (new.salary >

(select salary from employee e
where e.ssn = new.super_ssn))

inform_supervisor (New.supervisor_ssn, new.ssn) -- some prepared external procedure
);

In postgres, we'd finish off with
for each row execute procedure inform_supervisor);
That is, we can't have the when clause.

Writing the external procedures is beyond the scope of SQL.

Oracle Data Dictionary

The following structures in Oracle 11g are part of the Data Dictionary:

user_catalog
user_tables
user_tab_columns
user_views
user_constraints
user_objects
user_users
...

The user_catalog (or just cat) is a list of the current user's tables; additional information about these (mostly storage-related) is found in user_tables. The user_views table contains information about views (below), which are a sort of pseudo-table.

Last week we looked at some information from user_constraints. I had earlier used this information to find the existing foreign-key constraints and re-create them with sensible names:

select table_name, constraint_name from user_constraints where constraint_type = 'R';

If we run the following query:

select index_name, table_name from user_indexes;

in Oracle, we find that the tables project and department have two indexes. Why?

User_objects contains tables and indexes, and also (if defined) views, functions, procedures, synonyms, triggers.

These user_... "tables" are actually views, as discussed below. They list the things "owned" by the current user, or, more precisely, belonging to the current user's "schema".

If a user replaces the prefix user_ with all_, the tables (views) show all objects accessible by the current user. If one has sufficient privileges there is also a prefix dba_, which includes everything.

The data dictionary is the repository for all metadata. It is normally understood to be read-only. Some privileged accounts can update the data dictionary directly; this is often unwise.

A number of "scripts" are available for accessing information in the data dictionary. There are also some built-in commands, such as describe tablename. In MySQL, most metadata is accessed not from tables/views but via commands, eg show tables, show create table tablename, etc.

The table user_users lists the users visible by the current user (eg the current user). The all_users table may be more relevant here.

See http://docs.oracle.com/cd/B10500_01/server.920/a96524/c05dicti.htm.

Oracle Schemas

A schema is owned by a user and represents, to a first approximation, the set of tables and other objects owned by that user.

Different users cannot necessarily "see" the objects in each others' schemas; they need to be granted permission.

Each schema has its own "namespace". In other words, users pld and bob can each have their own table "employee", in which case the tables pld.employee and bob.employee are not related.

If user bob is granted permission to view pld's table employee, then bob will access it as pld.employee. Such permission would be granted by pld as follows:

grant select on employee to bob;

For production use, the schema owner is usually an abstract entity rather than an individual, eg "admin_dept". Rights to create, view and modify tables can then be granted by the DBA to individual users.

In MySQL, one would give different users their own "database". A similar dotted notation, dbname.tablename, would be used to access tables in another database.