Database week 2

Comp 353/453: Database Programming, Corboy L08, 4:15 Mondays

Week 2

Read in Elmasri & Navathe (EN)

Chapter 1, Databases and Database Users
Chapter 2, Database System Concepts an Architecture, Section 1 on Data Models, Schemas and Instances
Chapter 3, The Relational Data Model ...., now including section 3
Chapter 4: Basic SQL

Homework 1

Due Fri, Feb 15

E&N chapter 4, page 112 exercises 4.7, 4.10abc and 4.12abcd.

For 4.7, note that there are seven referential-integrity (foreign-key) constraints listed:

1.	BOOK.Publisher_name	⟶	PUBLISHER.Name
2.	BOOK_AUTHORS.Book_id	⟶	BOOK.Book_id
3.	BOOK_COPIES.Book_id	⟶	BOOK.Book_id
4.	BOOK_COPIES.Branch_id	⟶	LIBRARY_BRANCH.Branch_id
5.	BOOK_LOANS.Book_id	⟶	BOOK.Book_id
6.	BOOK_LOANS.Branch_id	⟶	LIBRARY_BRANCH.Branch_id
7.	BOOK_LOANS.Card_no	⟶	BORROWER.Card_no

For 4.10 and 4.12, note that your queries should work even if additional data is added to the tables.

Also, your answers should be in the form of a single query; do not retrieve a value with one query and then manually plug that value into a second query.

Every table referred to in the FROM clause should be named, that is:
select e.lname, d.dname from employee e, department d ...

All SQL should be entered in a format that I can copy and paste directly into a mysql window. In particular, make sure there are no leading tabs or spaces, and that all "prompt" characters such as "->" have been stripped out. Make sure you use regular quotation marks, not special unicode quotation marks. Unicode-quoted ‛foo’ is not the same as 'foo'.

Also, 4.12(c) cannot be done using the methods of Chapter 4.

Here is the 4.12(c) question:

For each section taught by Professor King, retrieve the course number, semester, year and number of students who took the section.

To do this, we need to use the count(*) function. You may just drop the number of students from your answer, or (better) give a record for each student. The latter approach yields

+---------------+----------+------+----------------+
| course_number | semester | year | student_number |
+---------------+----------+------+----------------+
| MATH2410 | Fall | 2007 | 8 |
| MATH2410 | Fall | 2007 | 9 |
+---------------+----------+------+----------------+

A First Look At Constraints

Databases involve several kinds of constraints:

0. Fundamental rules enforced by the table structure

1. Type constraints on column values

2. Key constraints: in each table, any given declared key can occur in only one row. This is not a property of a table at a particular moment, but rather a rule that says that a second record with a duplicate key can never be added.

3. Referential-integrity constraints: Consider again the university database on p 8 of EN. In the SECTION table, the key is Section_identifier. A typical constraint would be that we are not allowed to have a record that has a Course_number value that is not found in the COURSE table.

Similarly, in the GRADE_REPORT table, the key is ⟨Student_number, Section_identifier⟩. We do not allow rows for which Student_number does not refer to a valid entry in the STUDENT table, or for which Section_identifier does not refer to a valid entry in the SECTION table.

Note the implicit constraint of having the STUDENT table, with key Student_number. Suppose we add an address column. Because the key for the table is Student_number, the key constraint means we cannot have the same student with two addresses.

Now suppose instead we redesigned the database to include the student name and address directly into the GRADE_REPORT table, along with the Student_number. This is slightly wasteful of space, but that is a minor concern. The more serious problem is that this now allows inconsistency: we can have student 17 have two different addresses in two different records, or even two different names.

This is the data-consistency problem that the relational model was so successful at solving. If tables are designed appropriately, the potential for duplicate entries is simply eliminated.

Here's another view of inconsistency. Suppose we have a table with records like the following::

    Purchase, CustomerName, CustomerAddr

This allows us to have two records for two different purchases, one with
        purchase1, Peter, LakeShore
and one with
        purchase2, Peter, WaterTower
Oops! Peter is now getting duplicate mdailings (one forwarded).

So the relationships are "factored" (more on this later) into multiple tables so as to prevent this. In this case, we would want a table of CustomerName and CustomerAddr (perhaps also with CustomerID), and a second table with columns Purchase and CustomerID. Now it is not possible to have one customer with two addresses.

Core relational-database concept:
Divide data into multiple Tables (mathematically, RELATIONS)
in such a way that there is ENOUGH division to enforce consistency
and NOT TOO MUCH division to cause problems with reassembly

Major concepts for this course (references are to Parts of EN)

The basic relational-db model, and SQL (part 2)
DB modeling (part 3)
DB programming, JDBC (part 5)
Normalization & other design theory (part 6)
internal DB structures (part 7)
optimization & tuning (part 8)
concurrency, locking, and transactions (part 9)

Relational DBs (chapter 3)

A relation is any set of tuples
The set of all possible tuples is the CROSS PRODUCT of some domains
        col1 × col2 × col3 × ... × colN
Example: A = {1,2,3}, B = {x,y} C = {1,2}
        A × B
        A × C
        < relation in A × C
        <= relation in A × C

DB relations are not defined by rule, but by tabulation!

Given attribute sets A₁, A₂, ..., A_n, a relation is a subset of the cartesian product A₁×A₂×...×A_n; that is, a set of tuples ⟨a₁,a₂,...,a_n⟩ where each a_i∈A_i. These tuples may also be called records.

Relations in a DB are represented as tables. EN also uses the term relation state to refer to a specific set of records in a table.

STUDENT table, EN p 63

Name	SSn	Home_phone	Address	Office_phone	Age	GPA
Benjamine Bayer	305-61-2435	817-373-1616	2918 bluebonnet Lane	NULL	19	3.21
Chung-cha Kim	381-62-1245	817-375-4409	125 Kirby Road	NULL	18	2.89
Dick Davidson	422-11-2320	NULL	3452 Elgin Road	817-749-1253	25	3.53
Rohan Panchal	489-22-1100	817-376-9821	265 Lark Lane	817-749-6492	28	3.93
Barbara Benson	533-69-1238	817-839-8461	7384 Fontana Lane	NULL	19	3.25

Note the

attributes
tuples

Also note that some entries are NULL. This means undefined or not available or not known; unfortunately, these three options are not synonymous or interchangeable. NULL values are essential, but they do introduce some complications. The first is that records with NULL entries are not in fact elements of A₁×A₂×...×A_n; they are elements of
(A₁ ∪ {NULL}) × (A₂ ∪ {NULL}) × ... × (A_n ∪ {NULL})
EN also gives an alternative definition of a relation, as a set of maps from the attribute set to the set of attribute values, where the attribute set is essentially the set of names of columns. With this approach, a null entry is represented by a partial map, undefined for some attributes.

Note that we must be careful when comparing null values: if two people have NULL as their Office_phone, it does not mean they have the same phone! Worse, we simply do not know if the NULL means we don't know their phone, or if they simply do not have one.

Section 3.2: DB Constraints

inherent constraints
schema-based (explicit) constraints

Domain constraints

NOT NULL
primary key
foreign key
CHECK
Triggers

semantic constraints (business logic)

The COMPANY database

Schema: EN p 71
Data: EN p 72
basic table definitions
table definitions plus data, with ALTER
Spreadsheet
zip file

The tables are (with primary key in bold):
    employee:       name, ssn, bdate, address, sex, salary, super_ssn, dno
    department:     dname, dnumber, mgr_ssn, mgr_start
    dept_locations: dnumber, dlocation
    project:             pname, pnumber, plocation, dnum
    works_on:        essn, pno, hours
    dependent:       essn, dependent_name, sex, bdate, relationship

As we saw briefly last week, the join is the operation of creating all records merged from two (or more) tables, where one attribute of one table is required to match a corresponding attribute of another. Usually, but not always, the column-matching is based on equality of corresponding attributes.

Examples:
University:

Listing all students in Section 112
Printing all of each student's grades, by joining the Student_number fields of STUDENT and GRADE_REPORT
Printing all sections including Course_name, joining COURSE and SECTION on the Course_number field

The first example we did last week as follows:

select s.name from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;

Company:

Printing the name and address of all employees who work in the 'Research' dept (Query1 on EN p 100; uses employee and department tables)
Printing the project number, dept number, and the dept manager's name, for all projects located in 'Stafford' (Query 2 on EN p 100; uses project and department tables)
Printing each employee's name and his or her supervisor's name (Query 8, EN p 101; uses employee table joined to itself)

Demos of these

A full Cartesian product would be denoted in SQL by, eg,
select * from employee, department;
where there is no WHERE clause establishing a relation between the two tables.

Note on join: it is conceptually somewhat inefficient.
Lots of behind-the-scenes optimization makes it fast.

Loading the database

If you have a command-line window, and want to load up a file of SQL statements,

use "cd" in the shell window to move to the directory where your files are located
Start mysql
From within mysql, type source filename;

Alternatively, you can paste the entire file into a mysql command window (you will probably need the menu paste command, as CNTL-V is likely to mean something else). It helps if there are no tab characters in the file.

If you have a file mydata.text consisting of tab-separated fields, with \N for NULL entries, you can load it as follows:

LOAD DATA LOCAL INFILE '/home/pld/453/mydata.text' INTO TABLE mytable;

You might also need LINES TERMINATED BY '\r\n'

Note that using "null" or "NULL" in place of \N does NOT have the desired effect. Also, '\n' is newline, not the same as '\N'

More on keys

A KEY is any set of columns that is guaranteed to uniquely determine a row.
Primary Key: the key the database developer thinks is most important; usually a single attribute if there is one
Composite Key: multiple columns (eg the GRADE_REPORT table). Note that there is no single-column key here.

Secondary Keys: other column combinations that are keys, but not the one intended

Note that keys are not properties of particular tables, but rather of the "table schema". They represent design constraints.

Foreign Keys

Key constraints are one kind of constraint. What about the use of dno in table Employees? It should be clear that we probably want all dnos to refer to real departments, that is, to match an existing dnumber in table Department. This is done through a foreign key constraint: we declare in table Employee that attribute dno is a foreign key: a reference to a key of another table. The declaration looks like

foreign key (dno) references department(dnumber)

We can also give this constraint a name:
constraint FK_department_employee foreign key (dno) references department(dnumber)

Note that the constraint here applies to adding (or updating) records in Employee, and also to deleting records in Department.

Foreign keys are notorious for introducing circularity problems. What happens if we enforce foreign keys and try to load the database as originally written? With all tables empty, we can't add any employee because no dno value we might use would appear in the empty Department table, and we cannot add a department because the mgr_ssn is a foreign key referencing Employee.

In principle, there is no reason to require that the foreign key actually be a key in the other table. In practice, it almost always is; in database schemas generated through so-called Entity-Relationship diagrams it always is.

Life can be quite frustrating if you forget the circularity problem. Once two tables with a "foreign-key embrace" (each uses the other as a foreign key) are created, they can be difficult to remove. Sometimes one has to resort to dropping the entire database. If I load my file company.brokenalter.text, these all fail:

drop table employee;
drop table department;
alter table employee drop foreign key dno;

The last one above, however, fails simply because it is wrong; I shouldn't have used the column name (dno), but rather the constraint name (in this case, department_ibfk_1). Some people like to name foreign-key constraints for this reason.

To drop table T, you must first drop all foreign key constraints from other tables to T.

The command

alter table department drop foreign key department_ibfk_1

does work. The constraint name can be determined from show create table department.

Another thing that does work is this:

set foreign_key_checks=0;

Other constraints

Examples might be that the employee salary is in a given range, or is less than the supervisor's salary, etc. These non-key constraints can sometimes be addressed at the DB level, but are often easier to address at the level of the user interface to the DB; that is, the web interface can contain the necessary business logic.

Insert, Delete and Update

Examples:

insert into department values ('Sales', 6, '888665555', '2012-01-19');
update department set mgr_ssn = '333445555' where dnumber = 6;
delete from department where dnumber = 6;

See operations in section 3.3:

Inserts: must not violate key constraints or fkey constraints

Deletes: can never violate key constraints, only fkey constraints in another table.

Updates:
    of key value; can violate key constraints, or fkey constraints in another table (eg changing dnumber)
    of fkey value: can violate fkey constraints
    non-key/fkey operations are safe

Demos

(There are more demos below in the Chapter 4 material)

insert into employee values ('ralph', 'j', 'wiggums', '123456798', NULL, NULL, 'M', 9999, '333445555', 7);
delete from employee where ssn = '123456798';

update employee set dno = 107 where ssn = '333445555'; // originally dno = 5 here
update employee set dno = 5 where ssn = '333445555';

Lists in Relational DBs

Suppose I want a table of ⟨instructor, list-of-classes⟩ and want to enter ⟨pld, [317,343,353]⟩
If you look through your oracle manual (let alone your MySQL manual), you won't find it.

Lists are non-atomic and are a problem.

The Relational way:
Create table CLASSES_TAUGHT: ⟨instructor, class⟩
Add entries
    ⟨pld, 317⟩
    ⟨pld, 343⟩
    ⟨pld, 353⟩
    ⟨sam, 101⟩
    ⟨sam, 202⟩
    ⟨sam, 303⟩
etc

What is the key here?

Lists are sometimes said to be multiple-valued entries; that is, in the first case the value of the Courses column for row "pld" consists of the multiple values 317, 343, and 353.

This doesn't make them any less problematic for relational databases.

Elimination of lists is basically the process of putting a table into FIRST NORMAL FORM (1NF).

A few complaints about "personal" DBs, like MS Access:

no transaction-processing across network; instead, the whole DB is copied
no table locking; allows for inconsistent updates
client failure does not result in unlocking
no log to ensure atomicity in a bunch of transactions (debit from account1, credit to account2)

Some of these may have been addressed by now.

Objects

By now, everyone does object-oriented programming. The basic relational model doesn't include that. Do we need it?

Some people think we do, and so there are "object-oriented" databases.

But note that method calls are tricky; SQL isn't about that.

Furthermore, we can usually simulate the data extension portion of an object by adding a table. Suppose we have a table PERSONS, and want to create a subobject STUDENTS, with additional student-specific fields MAJOR and ENROLL_DATE. We can accomplish this by creating a table STUDENTS, and filling it with records of the form

⟨person_id, major, enroll_date⟩

where the person_id field is a reference to the remaining data about the student in the PERSONS table. Mathematically, this STUDENTS table looks like PERSONS × MAJORS × DATE, but with only a key to the PERSONS table.

SQL (E&N chapter 4)

Scenarios where we use the join:

1. The "extension" case: table2 in some sense extends information that could have been put in table1 except for redundancies. For each row in table1, find the unique matching row in table2. The join column in this case is likely to be a key in table2, and declared as a foreign key in table1. Example:

select e.fname, e.lname, d.dname from employee e, department d where e.dno = d.dnumber;

2. The "relationship" case: table2 defines some relationship; it has a dual-column key and the join column is a key in table1 and one of the key columns in table2. Example:

select e.fname, e.lname, w.pno from employee e, works_on w where e.ssn = w.essn;

Actually, this case is exactly the same as the first, with table1 and table2 reversed. Except that we tend to think of the employee table as representing things (people), and the works_on table as representing relationships (who works on what).

We can also have multiple matches in table2, or zero matches. Multiple matches occur in the second example (technically, multiple rows in table1 in the first example match the row in table2 with dnumber=5).

create table: office example, with and without changes to allow foreign keys

We can also create a new table from an old one (or ones):

create table emp_names as select fname,lname,ssn from employee;
select * from emp_names;
drop table emp_names;

SQL constraints: Section 4.2

constraint specification
constraints can be given names; there is some debate as to whether this is helpful.

Four constraints:

column type
not null
primary key
foreign key

Foreign key constraints: what happens if we insert the dname,dnumber for a department, then add employees, then do one of the following:

delete the dept record
update it, changing the dnumber from 4 to 40

From E&N p 95, but with differently named constraints

create table department(
    dname varchar(20),
    dnumber int,
    mgr_ssn char(9) not null default = '888665555',
    mgr_start date,
    constraint dept_primary_key primary key (dnumber),
    constraint dept_secondary_key unique (dname),
    constraint dept_mgr_foreign_key
       foreign key (mgr_ssn) references employee(ssn)
       on delete set default
       on update cascade
);

Named constraints mean that you will be told what constraint is violated. This is less helpful than it seems. Named constraints also means that constraints can be deleted by name.

But consider the on_delete / on_update clauses. What these mean is that if you delete an employee e from the employee db who is a dept manager (ie that employee's ssn is used as department.mgr_ssn for some row), then the mgr_ssn is set to the default value of 888665555. Furthermore, if you update an employee entry to correct a department manager's ssn, then that corrected value is cascaded into the appropriate row(s) of the department table.

The SQL select-from-where statement: 4.3

select columns from tables where boolean condition selecting rows

The tricky part is that boolean conditions can involve joins.

Also, note that SQL is a "nonimperative" language: it has (essentially) no assignment operator. (Ok, you can save tables as intermediate results, but you should not do that.)

Insert

insert into tablename values (f1val, f2val, ... , fnval);
You can also name columns:
insert into employee(fname, lname, ssn, dno) values ('john', 'smith', 345678912, 4);

Finally, instead of values you can provide a select query that returns a set of rows from some existing table, with matching "type signature" (matching column types).

Inserts can violate all four constraints.

insert into employee(fname,lname,ssn,dno)
values ('robert', 'Hatcher', '456789123', 2);

insert into employee(fname,lname,dno)
values ('robert', 'Hatcher', 5);

Update
Basic form:
    update employee
    set salary = salary*1.10
    where dno = 5;

Updates can also violate all four constraints, though the not null constraint can be violated only by setting a column to null, and the primary key constraint can only be violated by updating the primary key. Foreign key constraints are relatively easy to violate, though.

Delete
delete from tablename where boolean condition selecting rows

Here, we can only violate foreign key constraints, eg by deleting from table department the department of some existing worker.

SQL data types

INT (etc)
VARCHAR(n)
Boolean
DATE (& TIME, etc)
blobs

The DATE format is yyyy-mm-dd, eg 2013-01-27. This is actually standardized by ISO 8601. The United States usage, mm-dd-yyyy, was apparently used in England as well until the 20th century, when England switched to dd-mm-yyyy for greater consistency with the rest of Europe. See http://www.antimoon.com/forum/t1952.htm.

Dates in MySQL are entered as if they were strings, but they are most certainly not stored that way. Try inserting a record with date '07-04-1980'; MySQL will give up and set the date (silently!) to zero.

SQL examples

Some queries are from Ramakrishnan & Gehrke 2002 but are modified to be appropriate for E&N's Office database; the others are from E&N directly.

Find all employees with salary >= 30000

select * from employee where salary >= 30000;

select e.fname, e.lname, e.salary from employee e where e.salary >=30000;

Note my use of the e name. I recommend this style for readabililty. You can think of e as a variable that ranges over all the rows, though it looks syntactically more like a table variable.

Query 2 of E&N

For every project located in Stafford, list the project number, the controlling department number, and the department manager's lname, address, bdate.

Solution as written:

select pnumber, dnum, lname, address, bdate
from project, department, employee
where dnum = dnumber and mgr_ssn = ssn and plocation = 'Stafford';

Note that this is a dual-join example.

Compare with the version with names:

select p.pnumber, dnum, e.lname, e.address, e.bdate
from project p, department d, employee e
where p.dnum = d.dnumber and d.mgr_ssn = e.ssn and p.plocation = 'Stafford';

Find the list of supervisor ssns (that is, omit duplicates)

select distinct e.super_ssn from employee e;

Looking at the output, it might be good to add "where e.super_ssn is not null"

Find all employees who have worked on project 2

(Note that in our works_on table, no employee works on the same project more than once. Compare the solution to obtaining a list of employees who have worked on project 2, where the table is works_on_by_week: ⟨essn, pno, week, hours⟩

select e.fname, e.lname from employee e, works_on w where e.ssn = w.essn and w.pno = 2;

Note this is a join example (not our first). We could also have written it:

select e.fname, e.lname from employee e join works_on w on e.ssn = w.essn where w.pno = 2;

Find all employees who have worked >20 hours on a project

select e.fname, e.lname, w.pno, w.hours from employee e, works_on w
where e.ssn = w.essn and w.hours >20;

E&N Query 4: List all project numbers of projects involving employee 'Smith', either as manager or worker

Note that projects are managed by the manager of the project's controlling department.

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = 'Smith')
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = 'Smith');

Smith, however, hasn't managed any projects. It works better for Wong:

set @NAME = 'Wong';

(select distinct p.pnumber
from project p, department d, employee e
where d.dnumber = p.dnum and d.mgr_ssn = e.ssn and e.lname = @NAME)
union
(select distinct p.pnumber
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and e.lname = @NAME);