Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 4
Tutoring: Faraz Khan, Wed 5:00-7:00 pm, fkhan10@luc.edu.
Homework 2
Joins
Whenever you list multiple tables in the from
line, you have a join. Without a join condition between two tables, you
will have N×M records in the result; this is seldom what you want. With
more than two tables, you don't need a join between every pair; you
just need a "chain" of joins uniting everything. In general, for N
tables you need N−1 join conditions.
Join conditions in the where
clause serve to join different tables; they can be identified by the
format table1.attribute1 = table2.attribute2 (sometimes other
relational operations than = are involved, but those ar rare). Every
other condition in the where clause has to do with selecting a subset
of the total number of rows, thus discarding some rows from
consideration. It is possible to see join conditions this way, too,
discarding rows from the cross product of the two tables. But this is not terribly helpful in practice.
All this is one reason some prefer the newer "from table1 join table2 on join-condition".
Outer Joins and alternatives
Two weeks ago we considered how to list employees and their
qualifications, where some had no qualifications. We found we had to do
this:
select e.lname, q.qual from employee e left outer join qualifications q on e.ssn = q.ssn;
However, here is an alternative without using outer joins that does work:
select e.fname, e.lname, (select q.qual from qualifications q where e.ssn = q.ssn)
from employee e;
Note that this does not work:
select e.lname, q.qual from EMPLOYEE e, QUALIFICATIONS q
where (e.ssn = q.ssn) OR (not exists (select q.qual from QUALIFICATIONS q where e.ssn=q.ssn));
What goes wrong?
MySQL under linux
The linux filesystem is case sensitive. If you want table names to be case-insensitive, edit /etc/mysql/my.cnf to add the following in the mysqld section:
lower_case_table_names=1
Then restart your mysql server (eg with /etc/init.d/mysql restart).
In the future, my plan is to capitalize table names, as is fairly standard practice.
Why is SQL hard?
Basically, it's hard because it is a non-imperative language: you don't issue a sequence of commands. Instead, you write a single big expression. It's hard to visualize big expressions, and hard to test them incrementally. Nested queries do seem to help in this regard.
(Actually, the data types of SQL are tables, and you can in fact save a table in a "table variable". But this is rare.)
GROUP BY
Suppose we want to count the employees in each department. We could run
the following (sorting if preferred) and count each department by hand:
select e.dno from EMPLOYEE e;
But we really want SQL to do this; that is, we want a table of ⟨dept_no, count⟩. Here's how:
select e.dno, count(*)
from EMPLOYEE e
group by dno;
What is going on here is that we select the e.dno records, and then group them by dno, and then select a single row for each group. The selection values can either be conventional attributes that are known to be constant
for the group, or else one of the aggregate functions count, sum, max,
min, avg applied to a non-constant attribute. For count(), we can use
the form count(*) to count all records; for the others, we must apply
them to specific numeric attributes.
Conventional attributes are known to be constant for the group if they appear in the group by clause. In other words, every non-aggregate attribute in the select clause must also be in the group by clause. Usually all attributes in the group by clause are used in the select as well.
Here's the department information with average salaries added (and also the birthdate of the oldest employee)
select e.dno, count(*), avg(e.salary), min(bdate)
from EMPLOYEE e
group by dno;
What is going on conceptually is that, after the rows are selected (there is no where clause in the example above, so this row-selection is trivial), the remaining rows are then partitioned according to the group by condition. Then, the select results are given for each group: they will be either constant for the group, or else an aggregate function that will now be computed for the group.
If we want to add department names to the above we can do a simple join. Note that we must (except in MySQL) add dname to the group by clause.
select d.dname, e.dno, count(*), avg(e.salary), min(bdate)
from EMPLOYEE e, DEPARTMENT d
where e.dno = d.dnumber
group by dno, d.dname;
Here's another example: for each project, give the project number, the
project name, and the number of employees who worked on it. Note that
it is usually appropriate to add an as title for count(*) columns to identify what is being counted.
select p.pnumber, p.pname, count(*) as employees
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname
The output is:
1
|
ProductX
|
2
|
2
|
ProductY
|
4
|
3
|
ProductZ
|
2
|
10
|
Computerization
|
3
|
20
|
Reorganization
|
3
|
30
|
Newbenefits
|
3
|
Demo: this fails if we leave either attribute out of the group by
clause. (Note that, nonetheless, it is clear that if we leave out
p.name then this attribute is still constant for p.pnumber groups;
p.pnumber is after all the key for the PROJECT table.)
Also note that this might be a candidate for leaving p.pnumber off the select clause, if p.pname is sufficient.
Here's a tabular representation of the full join table, with all attributes from works_on; it represents Fig 5.1(b) of E&N. The group by option means we take each group, identified in the virtual final column; count(*) refers to the number of rows in that group.
p.pname
|
p.pnumber
|
w.essn
|
w.pno
|
w.hours
|
|
ProductX |
1 |
123456789 |
1 |
32.5 | group 1
|
ProductX |
1 |
453453453 |
1 |
20 |
ProductY |
2 |
453453453 |
2 |
20 | group 2
|
ProductY |
2 |
333445555 |
2 |
10 |
ProductY |
2 |
987654321 |
2 |
7.5 |
ProductY |
2 |
123456789 |
2 |
7.5 |
ProductZ |
3 |
333445555 |
3 |
10 | group 3
|
ProductZ |
3 |
666884444 |
3 |
40 |
Computerization |
10 |
987987987 |
10 |
35 | group 10
|
Computerization |
10 |
333445555 |
10 |
10 |
Computerization |
10 |
999887777 |
10 |
10 |
Reorganization |
20 |
333445555 |
20 |
10 | group 20
|
Reorganization |
20 |
987654321 |
20 |
15 |
Reorganization |
20 |
888665555 |
20 |
0 |
Newbenefits |
30 |
999887777 |
30 |
30 | group 30
|
Newbenefits |
30 |
987987987 |
30 |
5 |
Newbenefits |
30 |
987654321 |
30 |
20 |
All the aggregation functions (count, sum, min, max, avg) can be applied in select clauses to ungrouped queries, but then if any aggregation functions appear, then everything in the select clause must be an aggregation function. What we get are various "statistics" for the entire table. Example:
select count(*), min(e.bdate), sum(e.salary), avg(e.dno) from employee e;
This is a relatively limited situation; it is more common for the aggregation functions to be combined with group by.
Remember E&N 4.12(c)?
For each section taught by Professor
King, retrieve the course number, semester, year, and number of
students who took the section
Here's the official answer:
select course_number, semester, year,
(select count(*) from GRADE_REPORT g where g.section_identifier = s.section_identifier) AS NUM_STUDENTS
from SECTION s where s.instructor='King';
Many of you did it this way, though:
select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from SECTION s, GRADE_REPORT g
where g.section_identifier = s.section_identifier and s.instructor='King';
But this is wrong!
Well, it works in MySQL. But it is not considered correct. The COUNT(*) counts all records, and when it is present, everything else in the select clause should be "constant" (either there should be nothing else, or else the query should use group by for all the other attributes.
A set of MySQL examples; all should be invalid (except the first) becaue they use count(*) inappropriately:
select student_number from GRADE_REPORT;
select student_number, count(*) from GRADE_REPORT;
select count(*) from GRADE_REPORT;
select count(distinct *) from GRADE_REPORT; /* FAILS */
select count(student_number) from GRADE_REPORT;
select count(distinct student_number) from GRADE_REPORT;
Nested queries with Exists and Unique
Here's last week's example again, of a query to find all employees with a dependent of the same name and sex:
select e.fname, e.lname from EMPLOYEE e
where exists
(select * from DEPENDENT d where e.ssn = d.essn and e.fname = d.dependent_name and e.sex = d.sex);
What about the names of employees with no dependents? We can do a query with COUNT(*), verifying this is zero:
select count(*) from DEPENDENT d where d.essn = e.ssn; /* e is a "free variable" here */
The full query is:
select e.fname, e.lname from EMPLOYEE e
where (select count(*) from DEPENDENT d where d.essn = e.ssn) = 0;
But here it is with not exists:
select e.fname, e.lname from EMPLOYEE e
where not exists
(select * from DEPENDENT d where e.ssn = d.essn);
Query 7: list the names of managers with at least one dependent:
select e.fname, e.lname from EMPLOYEE e
where
exists (select * from DEPENDENT d where d.essn = e.ssn)
and
exists( select * from DEPARTMENT d where e.ssn = d.mgr_ssn);
Suppose we want to find all employees who have worked on project 3. We can do this with a straightforward join, or we can write:
select e.fname, e.lname from EMPLOYEE e
where exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno = 3);
But now suppose we want all employees who have not worked on project 3. If this can be done with a straightforward join, it's beyond me. But it's easy with the pattern above:
select e.fname, e.lname from EMPLOYEE e
where not exists (select * from WORKS_ON w where w.essn = e.ssn and w.pno = 3);
How about all the employees who have worked more hours on some project than anyone in department 4?
select e.fname, e.lname from EMPLOYEE e, works_on w
where e.ssn = w.essn and w.hours >all (
select w.hours from WORKS_ON w, EMPLOYEE e2 where w.essn= e2.ssn and e2.dno = 4
);
We could also write that
where e.ssn = w.essn and w.hours > (select max(w.hours) from WORKS_ON w, EMPLOYEE e2 where w.essn = e2.ssn and e2.dno = 4)
Here's a way to do the intersect example from the top of week 3, without using intersect. Recall that the point was to list all employees who worked on project 2 and project 3.
select e.fname, e.lname
from EMPLOYEE e
where
e.ssn in (select e.ssn from EMPLOYEE e, WORKS_ON w where e.ssn = w.essn and w.pno = 2)
and
e.ssn in (select e.ssn from EMPLOYEE e, WORKS_ON w where e.ssn = w.essn and w.pno = 3);
What about the name of the oldest employee? This query isn't even correlated.
select e.fname, e.lname from EMPLOYEE e
where e.bdate = (select min(e2.bdate) from EMPLOYEE e2);
And if we want the names of all employees older than everyone in department 5:
select e.fname, e.lname from EMPLOYEE e
where e.bdate < all (select e.bdate from EMPLOYEE e where e.dno = 5);
What about all employees who have worked on a dept 5 project and also a dept 4? This can be done as a join and also as:
select e.fname, e.lname from EMPLOYEE e
where
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 4)
and
exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 5);
(Wong has worked on projects 2 & 3 in dept 5 and 10 in dept 4, and
Wallace has worked on projects 30 in dept 4 and I added her to project
2 in dept 5).
How about employees who have worked on a project in a departent
other than their "home" department? Again, this can be done as a join
but also as:
select e.fname, e.lname from EMPLOYEE e
where exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and e.dno <> p.dnum);
That last <> isn't really a join condition; we already have enough of those.
Here's a harder one: list the employees who have worked for ALL departments. One way to think of this is asking for
all employees such that there is not a project they have not worked on:
select e.fname, e.lname from EMPLOYEE e
where not exists (
select d.dnumber from department d where
not exists (select * from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = d.dnumber)
);
Queries involving ALL in the qualification (as opposed to "list ALL
senior students) tend to be complex. As another example, suppose we
want the names of each employee who works on all
the projects controlled by department 4. (Those projects are pnumbers
10 and 30 using the E&N original data.) To do this one, we need to
select the department-4 projects.
select e.fname, e.lname from EMPLOYEE e
where
not exists (
select * from WORKS_ON w1 where
w1.pno in (select p.pnumber from PROJECT p where p.dnum = 4)
and
not exists (select * from WORKS_ON w2 where w2.essn = e.ssn and w2.pno
= w1.pno)
);
The idea here is that someone has worked on all department-4 projects if there is no department-4 project that the employee did not work on.
Now suppose we want employees who have worked on at least two projects in department 4. This is easier.
select e.fname, e.lname from EMPLOYEE e
where
(select count(*) from WORKS_ON w, PROJECT p where e.ssn = w.essn and w.pno = p.pnumber and p.dnum = 4) >= 2;
GROUP BY and HAVING
Suppose we want (as in Query 26) the project number, name, and employee
count for each project with more than two employees. We introduce the having clause, which works like the where clause but is used to select entire groups.
select p.pnumber, p.pname, count(*)
from PROJECT p, WORKS_ON w
where p.pnumber = w.pno
group by p.pnumber, p.pname
having count(*) > 2;
Note that we first apply the where clause to build the "ungrouped" table, and then apply the group by clause to divide the rows into groups, and finally restrict the output to only certain groups by using the having clause.
Here's another example (query 27): for each project, retrieve the
project number & name, and the number of dept-5 employees who work
on it.
select p.pnumber, p.pname, count(*)
from PROJECT p, WORKS_ON w, EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5
group by pnumber, pname;
This is actually wrong in the book, if you interpret it as wanting all projects even if there are zero
dept-5 employees at work on them. With the default data, project 30 is
in this category. The problem is similar to that solved by outer joins:
if there are no dept-5 employees on the project, then no join row is
created at all. Here's a working version:
select p.pnumber, p.pname,
(select count(*) from WORKS_ON w, EMPLOYEE e
where p.pnumber = w.pno and w.essn = e.ssn and e.dno = 5)
from PROJECT p
;
This technique can often be used to write a query involving group by as one without.
The book's own example of this is Query 28: give the number of
employees in each department whose salaries exceed $40,000, but only
for departments with more than five employees. In order to get a
nonempty result, let's change this to $29999 and >2 employees. The
book gives the following wrong version first:
select d.dname, count(*)
from department d, employee e
where d.dnumber = e.dno and e.salary > 29999
group by d.dname
having count(*) > 2;
and then the working version (actually, the book's version of this is broken)
select d.dnumber, d.dname, count(*)
from department d, EMPLOYEE e
where d.dnumber = e.dno and e.salary > 29999 and
e.dno in (select e.dno from employee e group by e.dno having count(*) > 2)
group by d.dnumber, d.dname;
The Administration department has more than two employees, but only one
of them maks over $29999, so in the first version, this department does
not get counted. Recall that the where
clause is evaluated at the beginning; if we put the salary constraint
there, then we ignore employees making less than that, and thus
potentially undercount some departments.
Triggers
Chapter 4 had two uses of the CHECK statement:
Attribute checks
example: for the department table,
dnumber INT not null check (dnumber >0 and dnumber < 100)
These are applied whenever an individual attribute is set.
Here's another important attribute check (MySQL syntax)
create table employee2 (
name varchar(50) not null,
ssn char(9) primary key not null,
dno INT not null check (dno > 0 and dno < 100),
check (ssn RLIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
) engine innodb;
Some that should fail:
insert into employee2 values ('joe hacker', '12345678O', 37);
insert into employee2 values ('jack hack', '12345689', 25);
They do not fail! MySQL ignores checks (even on dno). Actually, MySQL
does not even check that there are parentheses around the body of the
check condition.
How about postgres?
create table employee2 (
name varchar(50) not null,
ssn char(9) primary key not null,
dno INT not null check (dno > 0 and dno < 100),
check (ssn SIMILAR TO '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
);
This should work.
You can also do this in postgres:
create table employee3 (
name varchar(50) not null,
ssn char(9) primary key not null,
dno INT not null check (dno > 0 and dno < 100),
check (char_length(ssn) = 9)
);
record (tuple) checks
create table DEPARTMENT (
dname ....
dnumber ...
mgr_ssn ...
mgr_start DATE not null,
dept_creation DATE not null,
check (dept_creation <= mgr_start)
);
These are applied whenever a particular row is updated or added.
You can do more.