Comp 353/453: Database Programming, Corboy
523, 7:00 Thursdays
Week 4, Feb 9
Read in Elmasri & Navathe (EN)
- Chapter 3, The Relational Data Model ...., section 3: Update
Operations
- Chapter 4: Basic SQL
- Chapter 5: 5.1 all
Homework 2: E&N exercises
5.5, 5.6, 5.7, 5.8 (actual text on Blackboard), due Friday, Feb 24
Blackboard dropboxes have been created for homeworks 1 and 2.
Outer Joins and alternatives
Last week ago we considered how to list employees and their
qualifications, where some had no qualifications. We found we had to do
this:
select e.lname, q.qual from employee e left outer join qualification q on
e.ssn = q.ssn;
However, here is an alternative without using outer joins that does
work:
select e.fname, e.lname, (select q.qual
from qualification q where e.ssn = q.ssn)
from employee e;
Note that this does not work:
select e.lname, q.qual from EMPLOYEE e,
QUALIFICATION q
where (e.ssn = q.ssn) OR (not exists (select q.qual from QUALIFICATION
q where e.ssn=q.ssn));
What goes wrong? Look at the output. If someone has no
qualifications, the second half of the OR is always true. So that
employee appears with all
qualifications.
Multi-value nested queries
In these queries, the inner query returns (or potentially returns) a
set of rows, or rows with multiple columns. In this case, the following
operations are allowed:
- IN
- EXISTS / NOT EXISTS
- UNIQUE
- = ANY / = SOME
- op ALL (op is typically a comparison like > or >=)
Suppose we want to find the names of employees who have worked on
project 10. We can do this as a traditional join, but we can also do:
select e.fname, e.lname from employee e
where e.ssn in
(select w.essn from works_on w where w.pno = 10);
E&N Q4A, p 118: List all project
numbers for projects that
involve an employee whose lastname is 'Smith', either as worker or
manager.
We did this before with a standard join; here is a nested
version:
select distinct p.pnumber
from project p
where p.pnumber IN
(select p1.pnumber from project p1,
department d, employee e
where p1.dnum = d.dnumber and d.mgr_ssn = e.ssn and e.lname = 'Smith')
or
p.pnumber IN
(select w.pno from works_on w, employee
e
where w.essn = e.ssn and e.lname = 'Smith');
This one is not correlated;
the inner queries can be evaluated stand-alone.
In the middle of p 118, E&N states
SQL allows the use of tuples
of values in comparisons by placing them within parentheses. To
illustrate this, consider the following query [which I edited slightly
-- pld]
select distinct w.essn
from works_on w
where (w.pno, w.hours) IN
(select w1.pno, w1.hours from works_on w1 where w1.essn = '123456789');
Just what question does this answer?
Query 16: an example of a nested correlated query:
Retrieve the name of each employee who
has a dependent with the same first name and is the same sex as the
employee:
Here is E&N's solution:
'
select e.fname, e.lname from employee e
where e.ssn in
(select d.essn from dependent d where e.fname = d.dependent_name and
e.sex = d.sex);
Let's add a row to make this nonempty:
insert into dependent values ('123456789', 'John', 'M', '1990-02-24',
'son');
delete
from dependent where
bdate='1990-02-24';
// to delete the record above
Note that this query only returns rows where e.ssn = d.essn, athough
this is not stated as an equality check (if we don't have e.ssn =
d.essn, then e.ssn will not be in
the set of d.essn). But note that if we add the above son
John to another employee, '987654321', then the inner query for
employee John Smith,
'123456789', would amount to
select d.essn from dependent d where 'John' =
d.dependent_name and 'M' = d.sex;
and would return '987654321'. But the in
clause would then not match (as desired).
It can be confusing for an inner query to return such "irrelevant"
results, even if the end result is correct.
Another way to write this query might be:
select e.fname, e.lname from employee e join dependent d on e.ssn =
d.essn
where e.fname = d.dependent_name and e.sex = d.sex;
Note that queries using = or in
to connect the inner and outer queries can always be done without nesting,
using an additional join. Is this more or less efficient?
GROUP BY
Suppose we want to count the employees in each department. We could run
the following (sorting if preferred) and count each department by hand:
select e.dno from EMPLOYEE e;
But we really want SQL to do this; that is, we want a table of
⟨dept_no, count⟩. Here's how:
select e.dno, count(*)
from EMPLOYEE e
group by dno;
What is going on here is that we select the e.dno records, and then group them by dno, and then select
a single row for each group.
The selection values can either be conventional attributes that are
known to be constant
for the group, or else one of the aggregate functions count, sum,
max,
min, avg applied to a non-constant attribute. For count(), we can use
the form count(*) to count all records; for the others, we must apply
them to specific numeric attributes.
Conventional attributes are known to be constant for the group if they
appear in the group by clause.
In other words, every non-aggregate attribute in the select clause must also be in the group by clause. Usually all attributes in the group by clause are used in the select as well.
Here's the department information with average salaries added (and also
the birthdate of the oldest employee) (Query
24)
select e.dno, count(*), avg(e.salary),
min(bdate)
from EMPLOYEE e
group by dno;
What is going on conceptually is that, after the rows are selected
(there is no where clause in
the example above, so this row-selection is trivial), the remaining
rows are then partitioned according to the group by condition. Then, the select results are given for each group:
they will be either constant for the group, or else an aggregate
function that will now be computed for the group. Recall again that any
non-aggregate entry in the select
clause must be listed in the group by
clause; this is what guarantees that non-aggregate entries are constant
for each group.
If we want to add department names to the above we can do a simple
join. Note that we must
(except in MySQL) add dname to
the group by clause.
select d.dname, e.dno, count(*),
avg(e.salary), min(bdate)
from EMPLOYEE e, DEPARTMENT d
where e.dno = d.dnumber
group by dno, d.dname;
Here's another example: Query 25:
for each project, give the project number, the
project name, and the number of employees who worked on it. Note that
it is usually appropriate to add an as
title for count(*) columns to identify what is being counted.
select p.pnumber, p.pname, count(*) as employees
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname
The output is:
pnumber
|
pname
|
employees
|
1
|
ProductX
|
2
|
2
|
ProductY
|
4
|
3
|
ProductZ
|
2
|
10
|
Computerization
|
3
|
20
|
Reorganization
|
3
|
30
|
Newbenefits
|
3
|
Demo: this fails if we leave either attribute out of the group by
clause (at least in postgres). (Note that, nonetheless, it is clear
that if we leave out
p.name then this attribute is still constant for p.pnumber groups;
p.pnumber is after all the key
for the PROJECT table.)
Also note that this might be a candidate for leaving p.pnumber off the select clause, if p.pname is
sufficient.
Here's a tabular representation of the full join table here, with all
attributes from works_on; it
represents Fig 5.1(b) of E&N. The group
by option means we take each group,
identified in the virtual final column; count(*) refers to the number
of rows in that group.
p.pname
|
p.pnumber
|
w.essn
|
w.pno
|
w.hours
|
|
ProductX
|
1 |
123456789 |
1 |
32.5 |
group 1
|
ProductX |
1 |
453453453 |
1 |
20 |
ProductY |
2 |
453453453 |
2 |
20 |
group 2
|
ProductY |
2 |
333445555 |
2 |
10 |
ProductY |
2 |
987654321 |
2 |
7.5 |
ProductY |
2 |
123456789 |
2 |
7.5 |
ProductZ |
3 |
333445555 |
3 |
10 |
group 3
|
ProductZ |
3 |
666884444 |
3 |
40 |
Computerization |
10 |
987987987 |
10 |
35 |
group 10
|
Computerization |
10 |
333445555 |
10 |
10 |
Computerization |
10 |
999887777 |
10 |
10 |
Reorganization |
20 |
333445555 |
20 |
10 |
group 20
|
Reorganization |
20 |
987654321 |
20 |
15 |
Reorganization |
20 |
888665555 |
20 |
0 |
Newbenefits |
30 |
999887777 |
30 |
30 |
group 30
|
Newbenefits |
30 |
987987987 |
30 |
5 |
Newbenefits |
30 |
987654321 |
30 |
20 |
All the aggregation functions (count, sum, min, max, avg) can be
applied in select clauses to
ungrouped queries, but then if any aggregation functions appear, then everything in the select clause must be an aggregation
function. What we get are various "statistics" for the entire table.
Example:
select count(*), min(e.bdate),
sum(e.salary), avg(e.dno) from employee e;
This is a relatively limited situation; it is more common for the
aggregation functions to be combined with group by.
Remember
E&N 4.12(c)?
For each section taught by Professor
King, retrieve the course number, semester, year, and number of
students who took the section
Here is a solution that works in MySQL, but is not standard SQL. Oracle and
postgres do not allow this:
select s.course_number, s.semester,
s.year, COUNT(*) as NUM_STUDENTS
from SECTION s, GRADE_REPORT g
where g.section_identifier = s.section_identifier and
s.instructor='King';
The COUNT(*) counts all records, and when it is present, everything else in the select clause should be "constant"
(either there should be
nothing else, or else the query should use group by for all the non-aggregated
attributes.
Here's a set of examples that use count(*) appropriately (and work
under postgres):
- select student_number from
GRADE_REPORT;
// no aggregation
- select count(*) from
GRADE_REPORT;
// no non-aggregated attributes
- select count(student_number) from GRADE_REPORT;
- select count(distinct student_number) from GRADE_REPORT;
- select count(*) from (select distinct student_number,
section_identifier from grade_report) as dummy;
Here are some examples that are wrong,
though, as usual, some work in MySQL
- select student_number, count(*) from
GRADE_REPORT;
// needs GROUP BY
- select count(distinct student_number, section_identifier) from
GRADE_REPORT
- select count(distinct *) from GRADE_REPORT;
/* fails in mysq*/
select count(distinct ____) ... requires a single column attribute!
Nested queries with Exists and Unique
Here's the above example again of a query to find all employees with a
dependent of the same name and sex, this time done with exists instead of in:
select e.fname, e.lname from EMPLOYEE e
where exists
(select * from DEPENDENT d where e.ssn = d.essn and e.fname =
d.dependent_name and e.sex = d.sex);
(Note here that the inner query will return no records if the result is
empty, unlike the in version.)
What about the names of employees with no
dependents (Query 6 of
E&N)? We can do a query with COUNT(*) to check if an employee e
qualifies:
select count(*) from DEPENDENT d where d.essn = e.ssn;
/* e is a "free variable" here */
The full query is:
select e.fname, e.lname from EMPLOYEE e
where (select count(*) from DEPENDENT d where d.essn = e.ssn) = 0;
But here it is with not exists:
select e.fname, e.lname from EMPLOYEE e
where not exists
(select * from DEPENDENT d where e.ssn = d.essn);
Query 7: list the names of managers
with at least one dependent:
select e.fname, e.lname from EMPLOYEE e
where
exists (select * from DEPENDENT d where d.essn = e.ssn)
and
exists( select * from DEPARTMENT d where e.ssn = d.mgr_ssn);
Or perhaps
select e.fname, e.lname from employee
e, department d
where
e.ssn = d.mgr_ssn
and
exists (select * from dependent dd where dd.essn = e.ssn);
Generally speaking, joins are slightly preferred, simply because it is
likely they are better optimized.
Suppose we want to find all employees
who have worked on project 3. We can do this with a straightforward
join:
select e.fname, e.lname from employee
e, works_on w
where e.ssn = w.essn and w.pno = 3;
or we can write:
select e.fname, e.lname from EMPLOYEE e
where exists (select * from
WORKS_ON w where w.essn = e.ssn and w.pno = 3);
But now suppose we want all employees who have not worked on project 3. If this
can be done with a straightforward join, it is beyond me. But it is
easy with the exists pattern
above:
select e.fname, e.lname from EMPLOYEE e
where not exists (select *
from WORKS_ON w where w.essn = e.ssn and w.pno = 3);
How about all the employees who have worked more hours on some project
than anyone in department 4?
select e.fname, e.lname from EMPLOYEE
e, works_on w
where e.ssn = w.essn and w.hours >all (
select w.hours from WORKS_ON w, EMPLOYEE e2 where
w.essn= e2.ssn and e2.dno = 4
);
We could also write that
where e.ssn = w.essn and w.hours >
(select max(w.hours) from WORKS_ON w, EMPLOYEE e2 where w.essn = e2.ssn
and e2.dno = 4)
Here's a way to do the intersect
example from earlier, without using intersect.
Recall that the point was to list all employees who worked on project 2
and project 3.
select e.fname, e.lname
from EMPLOYEE e
where
e.ssn in (select e.ssn from EMPLOYEE e,
WORKS_ON w where e.ssn = w.essn and w.pno = 2)
and
e.ssn in (select e.ssn from EMPLOYEE e, WORKS_ON w where e.ssn = w.essn
and w.pno = 3);
What about the name of the oldest employee? This nested query isn't
even correlated.
select e.fname, e.lname from EMPLOYEE e
where e.bdate = (select min(e2.bdate) from EMPLOYEE e2);
And if we want the names of all employees older than everyone in
department 5:
select e.fname, e.lname from EMPLOYEE e
where e.bdate < all (select
e.bdate from EMPLOYEE e where e.dno = 5);
What about all employees who have worked on a dept 5 project and also a
dept 4? This can be done as a join and also as:
select e.fname, e.lname from EMPLOYEE e
where
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and
w.pno = p.pnumber and p.dnum = 4)
and
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and
w.pno = p.pnumber and p.dnum = 5);
(Wong has worked on projects 2 & 3 in dept 5 and 10 in dept 4, and
Wallace has worked on projects 30 in dept 4 and the following adds her
to project
2 in dept 5)
insert into works_on values
('987654321', 2,
3);
// project 2 is in dept 5; the 3 is the hours/week worked
delete from works_on where essn='987654321' and pno=2;
How about employees who have worked on a project in a department
other than their "home" department? Again, this can be done as a join
but also as:
select e.fname, e.lname from EMPLOYEE e
where exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn
and w.pno = p.pnumber and e.dno <> p.dnum);
That last <> isn't really a join condition; we already have
enough of those.
Queries with ALL (Universal
Quantification)
Here's a harder one: list the employees who have worked for ALL
departments, or, in more detail, list
all employees e such that for every
department d, e worked for a project of d.
You might try to approach this by listing employees for which the set
of departments they have worked for equals the set of all departments:
select e.lname from employee e
where
(select distinct p.dnum from project p, works_on w
where w.pno = p.pnumber and w.essn = e.ssn order by
p.dnum) // projects e has worked on
=
(select d.dnumber from department d);
However, SQL does not allow arbitrary set comparisons: MySQL yields the
error "subquery returns more than 1 row".
A way to think of this that SQL can
handle is by is asking for
all employees such that there is not
a department that the employee has not worked on any of that
department's projects, or list all employees e such that it is there
is no department d such that e
did not work for any project
of d.
The following two ways of describing a set are the same, where D is the
domain of y:
- { x | for all y ∈ D, foo(x,y) }
- { x | there does not exist y ∈ D for which ¬foo(x,y) }
That is, "for every d, foo(d) is true" is to say that "there is
no d such that foo(d) is false". We might call the latter the double-negative formulation.
We cannot phrase the first bulleted form directly in SQL, but we can express the second form, using
at least one and usually two not exist
clauses.
Back to the example "list all employees e such that for every
department d, e worked for a project of d". By the above, this is the
same as "list all employees e such that there does not exist a
department d for which e did not work on any project of d". To
list the projects that e worked on in department d, we could use the
following (free variables e and d highlighted):
select * from works_on w, project p where e.ssn = w.essn and w.pno = p.pnumber
and p.dnum = d.dnumber
We can say "e did not work on any project of d" by saying the above
list is empty:
not exists (select * from works_on w, project p
where e.ssn = w.essn and w.pno
= p.pnumber and p.dnum = d.dnumber)
So here is the final query: there does not exist a department e did not
work for
select e.fname, e.lname from EMPLOYEE e
where not exists (
select d.dnumber from department d where
not exists (select * from
WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and
p.dnum = d.dnumber)
);
Queries involving ALL in the qualification (as opposed to "list ALL
senior students") tend to be hard to follow, at least until the "double
negative" reformulation becomes more natural.
As another example, suppose we
want the names of each employee who works on all
the projects controlled by department 4 (Query Q3). (Those projects are
pnumbers
10 and 30 using the E&N original data.) To do this one, we need to
select the department-4 projects.
Here is E&N's version of this query, Q3B (p 122, 6th ed), slightly
rearranged
select e.fname, e.lname from EMPLOYEE e
where
not exists (
select * from WORKS_ON w1 where
w1.pno in (select p.pnumber from
PROJECT p where p.dnum = 4)
and
not
exists (select * from WORKS_ON w2 where w2.essn = e.ssn and
w2.pno
= w1.pno)
But here is an arguably more "natural" version, or at least more directly using the double-negative style:
select e.fname, e.lname from EMPLOYEE e
where
not exists (
select * from project p where p.dnum = 4
and
not exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno =
p.pnumber) /* e did not work on p */
);
The idea here is that someone has worked on all department-4 projects if there
is no department-4 project
that the employee did not
work on.
Now suppose we want employees who have worked on at least two projects in department
4. This is easier.
select e.fname, e.lname from EMPLOYEE e
where
(select count(*) from WORKS_ON w, PROJECT p where e.ssn = w.essn and
w.pno = p.pnumber and p.dnum = 4) >= 2;
One of the homework-2 exercises asks you to list all students who have
received all A's. This is another example of this pattern: it is
equivalent to asking for all students where there does not exist a grade that is not A. Here, though, the inner query
would be the set of grades received by the student that are ≠ A; there
would not be a second not exist
clause.
GROUP BY and HAVING
In Query 25, above, we listed the project number, project name and
employee count for each project. Suppose we want (Query 26) this information for each project with more than two
employees. We introduce the having
clause, which works like the where
clause but is used to select entire
groups.
select p.pnumber, p.pname,
count(*)
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname
having count(*) > 2;
Note that we first apply the where
clause to build the "ungrouped" table, and then apply the group by clause to divide the rows
into groups, and finally restrict the output to only certain groups by
using the having clause.
Here's another example (query 27):
for each project, retrieve the
project number & name, and the number of dept-5 employees who work
on it. E&N's solution is here:
select p.pnumber, p.pname,
count(*)
from PROJECT p, WORKS_ON w, EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5
group by pnumber, pname;
This is actually wrong, if you interpret it as wanting all projects
even if there are zero
dept-5 employees at work on them. With the default data, project 30
(Newbenefits) is
in this category. The problem is similar to that solved by outer joins:
if there are no dept-5 employees on the project, then no join row is
created at all.
This is an important point. Let's leave out the count(*) and the group by (and add an order by p.pnumber); the result of
the query is:
+---------+-----------------+
| pnumber | pname |
+---------+-----------------+
| 1 | ProductX |
| 1 | ProductX |
| 2 | ProductY |
| 2 | ProductY |
| 2 | ProductY |
| 3 | ProductZ |
| 3 | ProductZ |
| 10 | Computerization |
| 20 | Reorganization |
+---------+-----------------+
Doing the grouping manually, we see that we have groups of sizes 2, 3,
2, 1 and 1, as expected. But there
are no groups of size zero!
Here is a working version:
select p.pnumber, p.pname,
(select count(*) from WORKS_ON w,
EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5) as dept5_count
from PROJECT p
;
This technique can often be used to write a query involving group by as one without.
(Why did I introduce the column name dept5_count here?)
Another group-by/having issue
is illustrated in Query 28:
give the number of
employees in each department whose salaries exceed $40,000, but only
for departments with more than five employees. In order to get a
nonempty result, let's change this to $29999 and >2 employees. The
book gives the following wrong
version first (correctly identifying it as wrong!):
select d.dname, count(*) as
highpaid
from department d, employee e
where d.dnumber = e.dno and e.salary > 29999
group by d.dname
having count(*) > 2;
Note that the Administration department has three employees, with
the following salaries; the department has >2 employees and there is
one with a salary > 29999:
+----------+---------+----------+
| Alicia | Zelaya | 25000.00 |
| Jennifer | Wallace | 43000.00 |
| Ahmad | Jabbar | 25000.00 |
+----------+---------+----------+
What shows up in the output of the query above? Recall that the where
clause is evaluated at the beginning; if we put the salary constraint
there, then we ignore employees making less than that, and thus
potentially undercount some departments.
Here is a working version (actually, the book's version of this is
syntactically broken, as they left out the "e.dno in"; I think this is
a typo rather than a design error).
select d.dnumber, d.dname,
count(*) as highpaid
from department d, employee e
where d.dnumber = e.dno and e.salary > 29999 and
e.dno in (select e.dno from employee
e group by e.dno having count(*) > 2)
group by d.dnumber, d.dname;