Comp 353/453: Database Programming, Corboy 523, 7:00 Thursdays
Week 3, Feb 2
Read in Elmasri & Navathe (EN)
- Chapter 3, The Relational Data Model ...., section 3: Update Operations
- Chapter 4: Basic SQL
- Chapter 5: 5.1.1, 5.1.2, 5.1.6, 5.1.7, 5.1.8
Homework 1: be wary of unicode characters in your SQL; unicode ' is not the same as '.
Also, 4.12(c) cannot be done using the methods of Chapter 4.
MySQL under linux
The linux filesystem is case sensitive. If you want table names to be case-insensitive, edit /etc/mysql/my.cnf to add the following in the mysqld section:
lower_case_table_names=1
Then restart your mysql server (eg with /etc/init.d/mysql restart).
What joins do
Consider our earlier join example:
select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;
You can think of e and d as cursors,
representing rows in the respective tables employee and department.
Linear search through the employee table , as in the SQL example
select e.lname, e.salary from employee e where e.salary >= 30000;
might be represented by the following java-like loop, showing a linear search through the table:
for (e in employee) {
if (e.salary >= 30000) {
print(e.lname, e.salary)
}
}
Now let's write this for the join; here we get a nested loop and
quadratic runtime (specifically, O(nm), where n=employee.size() and
m=department.size())
for (e in employee) {
for (d in department) {
if (e.dno == d.dnumber) {print (e.fname, e.lname, d.dname);}
}
}
Clearly, for triple joins over large databases, performance is an issue. This is addressed two ways: query optimization, to eliminate "obvious" inefficiencies, and internal indexes. The primary key defines a candidate for indexing, though not all indexing is on key fields.
Self-join
Suppose we want to print a list of all employees and their supervisors. Here's the approach we might take:
select e.fname, e.lname, s.fname, s.lname
from employee e, employee s
where e.super_ssn = s.ssn;
The idea here, in terms of cursors, is for e to traverse linearly the
employee table. For each employee, we get e.super_ssn, and then use s
to traverse the same table, looking for a row with s.ssn = e.super_ssn.
This particular example is much more readable if we relabel the supervisor columns:
select e.fname, e.lname, s.fname AS super_fname, s.lname AS super_lname
from employee e, employee s
where e.super_ssn = s.ssn;
Note that the AS here is not optional, and plays a rather different
role than the AS that can be used in the from section (from employee AS
e, employee AS s).
WHERE conditions
In many cases, the where
conditions are either join conditions or are simple Boolean expressions
about a single row. Sometimes, however, the question being aske relates
to other rows, as in "List employees who worked on more than one
project"; if we form the join table of employees and their projects we
still cannot answer this row-by-row. Another query might be to list
employees who are part of some set of employees whose salaries add up
to exactly 100,000.
The answer last week to the employees-who-worked-on-multiple-projects query was a "triple" join with inequality as one of the join conditions:
select e.fname, e.lname from employee e, works_on w1, works_on w2
where e.ssn = w1.essn and e.ssn = w2.essn and w1.pno <> w2.pno;
Most joins (but not this one!) are equijoins;
that is, joins involving an equality relationship. Inequalities can be
harder to understand. Recall that the above version introduces many duplicates, one for each way of choosing two unequal projects. "Select distinct" is better.
Actually, this particular example is easier done using a subquery and the count operator (which we'll cover below):
select e.fname, e.lname from employee e
where (select count(w.pno) from works_on w where w.essn = e.ssn) > 1
In general, WHERE conditions that just select rows of a table
satisfying a given Boolean condition are fundamentally trivial. Joins
in WHERE conditions are a little less trivial, and WHERE conditions
that select a record (or part of a record) depending on other records
in the table tend to be the hardest. Example: list all students who got
straight A's; this lists the student_id values from grade_report where
all of the matching rows have an A in the grade column. This last
category often involves grouping, or subqueries, or some other advanced
feature.
Why is SQL hard?
Basically, it's hard because it is a non-imperative language: you don't issue a sequence of commands. Instead, you write a single big expression. It's hard to visualize big expressions, and hard to test them incrementally. Nested queries do seem to help in this regard.
(Actually, the data types of SQL are tables, and you can in fact save a table in a "table variable". But this is rare.)
Perhaps because of this non-imperative nature of the language, it is
relatively easy to read several SQL examples and still have trouble
knowing where to start when trying to write one from scratch.
Examples from week 2:
Find all employees who have worked on project 2:
This one involves a single join between employees (to get the names)
and the works_on table. From the works_on table we can easily get the
SSNs of the employees who worked on project 2; we need to join with
table employee to convert SSNs to names.
If we change the query to
Find all employees who have worked on a project in Stafford
we can do the following. Using the project table, we can get the
project number from the location. Using the works_on table, we can find
the employees (by essn) who have worked on that project. And using the
employee table, we can convert to employee names:
select e.fname, e.lname
from project p, works_on w, employee e
where p.plocation = 'stafford' and p.pnumber = w.pno and w.essn = e.ssn;
Note that only data from table employee is actually part of the select statement.
Query 2: For every project located in Stafford, list the project number, the
controlling department number, and the department manager's lname,
address, bdate.
From the project table we can get the project location, number and department:
select p.pnumber, p.dnum from project p where p.plocation='stafford';
To get the department-manager information (all from one table) we needed two joins: first a join between the project and department tables (on
dnum/dnumber) to get the department mgr_ssn, and then between department and employee (on mgr_ssn and ssn).
select p.pnumber, p.dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';
Note that no department fields are included in the query results. The works_on table is not involved.
Query 4: Here we combine both the above: we want all the projects
in which Wong has participated, either as employee or as manager. For
the employee case, we have a join between works_on and employee. For
the manager, we have a join between projects and departments, to get
the dept manager of the controlling department, and then with the
employee table.
We did the same thing with the self-join example above: given an employee e, we need to look up e.mgr_ssn in the employee table.
Joins again
Whenever you list multiple tables in the from
line, you have a join. Without a join condition between two tables, you
will have N×M records in the result; this is seldom what you want. With
more than two tables, you don't need a join between every pair; you
just need a "chain" of joins uniting everything. In general, for N
tables you need N−1 join conditions.
Join conditions in the where
clause serve to join different tables; they can be identified by the
format table1.attribute1 = table2.attribute2 (sometimes other
relational operations than = are involved, but those ar rare). Every
other condition in the where clause has to do with selecting a subset
of the total number of rows, thus discarding some rows from
consideration. It is possible to see join conditions this way, too,
discarding rows from the cross product of the two tables. But this is not terribly helpful in practice.
All this is one reason some prefer the newer "from table1 join table2 on join-condition".
Postgres
In the 1970's, the Ingres database project began at Berkeley. Ingres originally stood for Interactive graphics retrieval system, though shortly thereafter the "graphics" part became much less important.
After Ingres (which does still exist) came Postgres, for "post-ingres".
Later, to emphasize its support for SQL, Postgres became PostgreSQL.
Note that it is virtually impossible to pronounce PostgreSQL using the
"sequel" form for SQL; if you want a short form, use "Postgres".
A decade ago, MySQL was the "fast one" and Postgres was the one that
accurately implemented the SQL standard. That's less true now, though
MySQL is still missing intersection (and a few other things).
Postgres can be found at http://postgresql.org.
After you install the software, you still have to create appropriate accounts; the postgres account is the only one installed by default. Here's what I had to do:
1. Log in as user "postgres", and run the psql command. The "bash>" prompt below is the shell prompt (as user postgres); the "post=> " prompt is the psql SQL prompt.
bash> sudo -u postgres psql
post=> create user 'pld';
2. From the shell, still as user postgres
bash> createdb pld
Then, as user pld, the command "psql" works. If you created several
databases above, eg "pldoffice", then the command to connect to
postgres using the pldoffice database would be psql pldoffice. Note that I did not actually figure out how to give myself the right to create new databases.
Some postgres commands
Your psql prompt should show you the current database.
SHOW ALL -- shows status variables only
\d same as mysql show tables;
\d employee same as mysql describe employee;
Here's another way to list your tables:
select table_name from information_schema.tables where table_schema='public' and table_type='BASE TABLE';
Intersect example:
This now actually works, under postgres, but I made a mistake:
old:
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn and w.pno = 3);
What is the problem?
new:
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2)
intersect
(select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 3);
See below for another way to do this.
I made the remark that this would have been trivial to implement in MySQL. That's only partly true. It is trivial to implement the intersect operator in order to get the correct resuts. It is nontrivial to implement it in a way that does reasonable-quality query optimization.
Outer joins
What if we want to list all employees and, if they have a qualification, that too? See the table qualification:
+-----------+-----------+
| ssn | qual |
+-----------+-----------+
| 123456789 | bachelors |
| 333445555 | masters |
| 987654321 | masters |
| 666884444 | doctorate |
| 987987987 | ccie |
| 888665555 | doctorate |
+-----------+-----------+
This is created with
create table qualification (
ssn char(9) primary key not null,
qual varchar(20) not null,
foreign key (ssn) references employee(ssn)
);
Standard join syntax:
select e.lname, q.qual from employee e, qualification q where e.ssn = q.ssn;
Alternative join syntax:
select e.lname, q.qual from employee e join qualification q on e.ssn = q.ssn;
"Left [outer] join" syntax
select e.lname, q.qual from employee e left join qualification q on e.ssn = q.ssn;
What is happening here? The point is that with the left join, then all rows of the left table are included, even when there is no match in the right table.
Let's add an extra line to the qualification table:
insert into qualification values ('345456567',
'hacker'); // no such existing employee!
Now let's do the left and right joins (replace left with right, or reverse the order).
select e.lname, q.qual from employee e right join qualification q on e.ssn = q.ssn;
select e.lname, q.qual from qualification q left join employee e on e.ssn = q.ssn;
What changed?
Finally, note that we can do this to make the output prettier:
select e.lname, case when q.qual is null then '' else q.qual end as emp_qual
from employee e left join qualification q on e.ssn = q.ssn;
What's going on here is that we're using an expression in the select part, rather than a field name, and using the
case when boolean_expression then expression1 else expression2 end
as our expression. We're also giving the column a name, "emp_qual".
While expressions in select columns are often important, note that some kinds of output formatting are better done in the front end rather than in SQL itself.
To delete the row,
delete from qualification where qual='hacker';
Null
As we have discussed, NULL can represent any of:
- A value we just do not know at the moment
- A value that is more-or-less-permanently unavailable (eg NULL for a middle initial)
- A value that is not applicable, given some other attributes of the row
SQL supports the operators ____ IS NULL and ____ IS NOT NULL, for
testing whether a value is null. However, the = and <> operators
do not work. Similarly, an
equality comparison between two null fields of two records (eg as part
of a join) will fail. The reasoning is that null is not to be thought
of as a specific value, but rather as meaning "unknown". If two people
have unknown phone numbers (both NULL), that does not mean they have the same number!
Compare
- select lname from employee where super_ssn is null;
- select lname from employee where super_ssn is not null;
- select lname from employee where super_ssn = null;
- select lname from employee where not super_ssn = null;
- select lname from employee where super_ssn <> null;
The latter returns no records, even if you change it to "super_ssn
<> null" or "not (super_ssn = null)". This is because any
comparison involving NULL behaves as if it returns a third truth value,
"unknown".
This "unknown" value propagates upwards in any boolean expression it
is part of, according to appropriate truth-table rules; in particular,
"true and unknown" is "unknown", and "not unknown" is "unknown". The
only times
the end result is not also "unknown" are "true or unknown", which
evaluates to "true", and "false and unknown", which evaluates to
"false". Consider
select * from employee where dno = 5 or super_ssn = null;
This returns the employees for which dno=5; in this case the Boolean expression becomes "true or unknown", which is "true".
Nested queries
Queries can be nested; this is how we will construct complex queries
where inclusion of a record in the result depends on other records in
the database.
When queries are nested, we use the term outer query for the enclosing query, and inner query (queries) for the enclosed query/queries.
Single-row nested queries
The simplest case is perhaps when only a single row and column is
retrieved; that value can be used in a boolean or arithmetic expression. (We should really call these single-value queries.)
For example, we can find the ssn of the big boss (ie the one with no supervisor) with
select e.ssn from employee e where e.super_ssn is null;
Now let's find everyone who works for the boss:
select e.fname, e.lname from employee e where e.super_ssn =
(select e.ssn from employee e where e.super_ssn is null);
Note that we reused the letter e above; the two uses (inner and outer) don't interact at all.
Exercise: how can we do this as a "traditional" query?
Now let's find everyone who makes at least half of what the boss makes:
select e.fname, e.lname from employee e where 2*e.salary >
(select e.salary from employee e where e.super_ssn is null);
How about:
select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.super_ssn is null)/2;
For any given employee with ssn X, we can find the ssn of the employee's manager with
select e1.super_ssn from employee e1 where e1.ssn = X;
We can find the supervisor's salary with
select e1.salary from employee e1 where e1.ssn = X's super_ssn
Here's a query with two inner single-row queries. Note that Wong
and Wallace are the two managers reporting directly to Mr Borg. My main
point here is that the results of the inner selects can be used inside
arithmetic expressions.
select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.6*(
(select e.salary from employee e where e.lname = 'Wong')
+
(select e.salary from employee e where e.lname = 'Wallace')
);
Now let's list all employees who make at least 70% of what their manager earns.
select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.7* (select e1.salary from employee e1 where e1.ssn = e.super_ssn);
The query above is correlated: the inner query refers to the outer one. In the earlier examples, we could do query evaluation from the inside out; here we can not. Instead, our evaluation strategy turns out to be more like an inner loop.
Example of failure if multiple values are returned:
select e.fname, e.lname from employee e where e.salary >
(select e.salary from employee e where e.lname LIKE 'W%');
Finally, we can use the COUNT() function. First, an example of its
standalone use, to get the number of employees in department 5; the count(*) means to count all records returned.
select count(*) from employee e where e.dno = 5;
Next, here's an example where the inner query contains count(). This lists all departments with at least 3 members:
select d.dname from department d
where (select count(*) from employee e where e.dno = d.dnumber) >=3;
Finally, here's an example where we use the count in the select part, to list all departments and their number of employees:
select d.dname, (select count(*) from employee e where e.dno = d.dnumber) AS dept_size
from department d;
Note the use of AS here, to name the computed column; it is very different from the (usually omitted) use of AS in the from section.
This is not the preferred way to use count(), though; more on the COUNT() function follows.
Grouping functions
Sometimes the inner query returns multiple rows (usually numeric), but we reduce those rows to a single value using one of the aggregate functions (E&N §5.1.7):
count
sum
avg
max
min
For example, here is how we'd find all employees who make at least 25% of their department total:
select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 *(select sum(e1.salary) from employee e1 where e1.dno = e.dno);
Notice we can not do:
select e.fname, e.lname, e.salary from employee e
where e.salary >= 0.25 * sum(select e1.salary from employee e1 where e1.dno = e.dno);
Let's take a look at E&N 4.12(c):
For each section taught by Professor
King, retrieve the course number, semester, year, and number of
students who took the section
The easiest way to do this is to include in the select part of the outer query a computed column of the form
(select count(*) from grade_report g where g.section_identifier = s.section_identifier) AS NUM_STUDENTS
You can do this query a few other ways, too. First, note the following peculiar query:
select s.course_number, s.semester, s.year
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';
It is peculiar because the involvement of grade_report is unclear; all it seems to do (all it does
do) is to generate a separate copy of the fields
⟨s.course_number,s.semester,s.year⟩ for each separate student (each
record in grade_report for that particuar section_identifier).
One way to use COUNT(*) which is technically wrong but which works in MySQL is:
// wrong
select s.course_number, s.semester, s.year, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King';
The right way to do this is to use the GROUP BY option. We have
multiple records for each ⟨s.course_number,s.semester,s.year⟩; we group
these common fields together (by s.section_identifier) for counting:
select s.course_number, COUNT(*) as NUM_STUDENTS
from section s, grade_report g
where g.section_identifier = s.section_identifier and s.instructor='King'
GROUP BY s.course_number;
We'll look into the GROUP BY in more detail later; for now, note that if aggregation is used in some of the select columns, then all the non-aggregated ("ordinary") columns must be included in the GROUP BY clause.