Database week 1

Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays

Week 1

Read in Elmasri & Navathe (EN)

Chapter 1
Chapter 2 Section 1 on DB schemas
Chapter 3, Sections 1 & 2

Bookstore notice

Installation assignment:
MySQL on your own machine

Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"

Oracle v MySQL: two databases.
Oracle has more proprietary extensions. Some are not necessarily a good idea.

Note that, with Oracle's purchase of Sun Microsystems, Oracle now owns MySQL. Some people think this is why they bought Sun. (Others think it was for Java.)

Oracle tends to support more features than MySQL. Transactions are one, though this is now beginning to appear in MySQL. Another is key constraints, and foreign key constraints in particular. However, consider the following:

Application programs generally have to explicitly check for foreign key values, anyway; otherwise, it is difficult to respond natually to a user error.
Now that Oracle has acquired the innodb database engine for MySQL, foreign key constraints are now implemented whenever the DB adminstrator chooses the innodb engine.

Note also that MySQL is dual-licensed: as open-source (which means your additions must also be open-source) and as a proprietary product (meaning you can develop proprietary extensions to MySQL, provided you pay the licensing fee).

Installing Oracle 11g (more later)

bandwidth to download 2+ GB
disk space

SQL*Plus v web interface v MySQL
localhost:8080/apex
browse around to find SQL

Some notes on installing mysql

These assume that you will use a command-line interface.

Basically, you need to create a user, and then a database for that user. A database is a collection of tables.

Step 1: do the actual installation (eg apt-get install mysql-server, or run the .exe install file). During this step, you should be asked for a master password.

Downloads for the MySQL Community Server are at http://www.mysql.com/downloads/mysql. At a minimum, you will need MySQL Server.

Step 2. log in using the master password.

mysql --user root -p
(give master password)

Step 3. Now new users and databases can be created, eg pld. (note password is in quotes)

mysql> create user pld identified by 'cranberry';
mysql> create database plddb;
mysql> grant all on plddb.* to 'pld';

We have created a user pld, and a database plddb, and given user pld full control over plddb, and only over plddb. If you wanted to allow pld to create databases, you could use

mysql> grant all on *.* to 'pld';

You can also grant individual actions:

grant SELECT, INSERT, UPDATE, DELETE, CREATE, DROP ON *.* TO 'pld';

Step 4. Now log in as pld (after logging out as root):

mysql --user=pld --password=cranberry plddb

But let's assume you are user pld, in unix-world, and that you don't wish to type your password on the command line. Then use:

mysql -p plddb

Step 5: Create a table (while logged in as pld). Nava/C users take note: the comma is a separator in the field declarations, not a terminator (ie there is no comma after the final declaration, and the field name precedes the type name.

mysql> create table foo (id int, name varchar(100));

mysql> insert into foo values (1, 'peter');
mysql> insert into foo values (2, 'paul');

mysql> show tables;

mysql> select * from foo;

The last is our first example of a SQL select statement.

History: relational databases came out of theoretical work of Edgar Codd, in a 1970 paper titled A Relational Model of Data for Large Shared Data Banks. At the time, the approach was seen as too computationally expensive. But by 1980, both IBM DB2 and (early version of) Oracle were out, and it was becoming generally recognized that relational databases were the wave of the future.

Codd's relational approach solved a major data consistency problem. We will look at that below.

While there are now other types of databases, notably object-oriented databases, it is fair to say that no later development has offered a convincing solution to a general problem that is intractable with the relational model.

SEQUEL / SQL

In the early 1970's, IBM introduced SEQUEL: Structured English Query Language. For trademark reasons, the name had to be changed; IBM chose SQL (Structured Query Language). To this day, many people prefer to pronounce SQL as "sequel" rather than as "ess queue ell"; note that in this case the "sequel" pronunciation actually came first.

The "official" pronounciation for MySQL is "My ess queue ell", but they tolerate "mysequel". The following is from the MySQL reference manual, §1.3.2:

MySQL is named after co-founder Monty Widenius's daughter, My.

SQL is fundamentally command-line
Rather non-procedural language!

Wrapping a good Human-Computer Interface (HCI) around SQL is good practice (well-nigh essential); generally this is done in some high-level language with a SQL interface (eg JDBC). However, actual queries are still done in SQL.

Terminology

Student database, EN p 6

Tables
Rows == Records
Columns == attributes
Keys
Database == set of tables
Relationships between tables

Keys:

STUDENT: Student_number
COURSE: Course_number
SECTION: Section_identifier (this is not completely obvious)
GRADE_REPORT: ⟨Student_number, Section_identifier⟩

Tables with two columns as key often serve to represent relationships.

Basic operations: querying & updating

SQL: SELECT * FROM [tablename]
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and Section_identifier <= 115;

The second query above is somewhat misleading, as Section_identifiers are usually not intended for comparison.

Figure 2.1 on EN p 32: a schema for the students database

Some other issues

Database tables are usually "self-describing", in that the table description and/or relationships to other tables is often embedded in the table description. At a minimum, columns have types.

The table definitions, column names and their types is generally know as the database schema.

Database tables can be changed without the need to recompile programs using that table: new columns can be added, or entire new tables. Essentially this is because DBMS queries are interpreted, and table columns are identified by name rather than offset. (C programs that access record fields do need recompilation if fields are changed.)

Databases often support different views, perhaps for users with different privilege levels. A view may be a subset of the original set of columns, or it may contain some computed columns in lieu of the original columns. Excluding columns such as social_security_num or salary is relatively common.

Transactions are sets of related updates, eg removing money from one account and adding it to another, or perhaps dropping one class and adding another (sadly, remarkably many student DBMSs lack the latter transaction operation). EN defines a transaction to be the result of an executing program rather than a set of related updates; this latter definition might include adding multiple new rows to one table.

Transaction processing must satisfy the isolation property, that transactions appear to execute in isolation from one another, and the atomicity property, which says that each transaction is completely executed or not executed at all. (Transactions must meet the ACID test: Atomicity, Consistency, Isolation, Durability.)

Here is a simple failure of isolation: two transactions are transferring money to other accounts.

transaction 1	transaction 2	Acct A	Acct B	Acct C
		1000	0	0
get value of Acct A: $1000		1000	0	0
add $100 to B		1000	100	0
	Debit A by $50	950	100	0
Store $1000 - $100 in A		900	100	0
	Credit C by $50	900	100	50

Multi-user DBMSs need concurrency control.

Section 1.6 of EN talks about some of the advantages of having a central DBMS rather than individual DBMSs maintained by each administrative group. This is pretty much a settled issue now, though it does mean that the "natural owners" of data in an organization (eg registration and records at Loyola, for student registration data) will not in fact own that data.

Constraints

Databases involve several kinds of constraints:

0. Fundamental rules enforced by the table structure

1. Type constraints on column values

2. Key constraints: in each table, any given declared key can occur in only one row. This is not a property of a table at a particular moment, but rather a rule that says that a second record with a duplicate key can never be added.

3. Referential-integrity constraints: Consider again the students database on p 8 of EN. In the SECTION table, the key is Section_identifier. A typical constraint would be that we are not allowed to have a record that has a Course_number value that is not found in the COURSE table.

Similarly, in the GRADE_REPORT table, the key is ⟨Student_number, Section_identifier⟩. We do not allow rows for which Student_number does not refer to a valid entry in the STUDENT table, or for which Section_identifier does not refer to a valid entry in the SECTION table.

Note the implicit constraint of having the STUDENT table. Suppose we add an address row. Because the key for the table is Student_number, we cannot have the same student with two addresses.

Suppose instead we enter the student name and address directly into the GRADE_REPORT table, along with the Student_number. This is slightly wasteful of space, but that is a minor concern. The serious problem is that this now allows inconsistency: we can have student 17 have two different addresses in two different records, or even two different names.

This is the data-consistency problem that the relational model was so successful at solving. If we design our tables appropriately, the potential for duplicate entries is simply eliminated.

Here's another view of inconsistency. Suppose we have a table with records like the following::

    Purchase, CustomerName, CustomerAddr

This allows us to have two records for two different purchases, one with
        Peter, LakeShore
and one with
        Peter, WaterTower
Oops! Peter is now getting duplicate mailings (one forwarded).

So the relationships are "factored" (more on this later) into multiple tables so as to prevent this. In this case, we would want a table of CustomerName and CustomerAddr (perhaps also with CustomerID), and a second table with columns Purchase and CustomerID. Now it is not possible to have one customer with two addresses.

Core concept:
Divide data into multiple Tables (mathematically, RELATIONS)
in such a way that there is ENOUGH division to ensure consistency
and NOT TOO MUCH division to cause problems with reassembly

Major concepts for this course (references are to Parts of EN)

The basic relational-db model, and SQL (part 2)
DB modeling (part 3)
DB programming, JDBC (part 5)
Normalization & other design theory (part 6)
internal DB structures (part 7)
optimization & tuning (part 8)
concurrency, locking, and transactions (part 9)

Relational DBs (chapter 3)

A relation is any set of tuples
The set of all possible tuples is the CROSS PRODUCT of some domains
        col1 × col2 × col3 × ... × colN
Example: A = {1,2,3}, B = {x,y} C = {1,2}
        A × B
        A × C
        < relation in A × C
        <= relation in A × C

DB relations are not defined by rule, but by tabulation!

Given attribute sets A₁, A₂, ..., A_n, a relation is a subset of the cartesian product A₁×A₂×...×A_n; that is, a set of tuples ⟨a₁,a₂,...,a_n⟩ where each a_i∈A_i. These tuples may also be called records.

Relations in a DB are represented as tables. EN also uses the term relation state to refer to a specific set of records in a table.

STUDENT table, EN p 63

Name	SSn	Home_phone	Address	Office_phone	Age	GPA
Benjamine Bayer	305-61-2435	817-373-1616	2918 bluebonnet Lane	NULL	19	3.21
Chung-cha Kim	381-62-1245	817-375-4409	125 Kirby Road	NULL	18	2.89
Dick Davidson	422-11-2320	NULL	3452 Elgin Road	817-749-1253	25	3.53
Rohan Panchal	489-22-1100	817-376-9821	265 Lark Lane	817-749-6492	28	3.93
Barbara Benson	533-69-1238	817-839-8461	7384 Fontana Lane	NULL	19	3.25

Note the

attributes
tuples

Also note that some entries are NULL. This means undefined or not available or not known; unfortunately, these three options are not synonymous or interchangeable. NULL values are essential, but they do introduce some complications. The first is that records with NULL entries are not in fact elements of A₁×A₂×...×A_n; they are elements of
(A₁ ∪ {NULL}) × (A₂ ∪ {NULL}) × ... × (A_n ∪ {NULL})
EN also gives an alternative definition of a relation, as a set of maps from the attribute set to the set of attribute values, where the attribute set is essentially the set of names of columns. With this approach, a null entry is represented by a partial map, undefined for some attributes.

Note that we must be careful when comparing null values: if two people have NULL as their Office_phone, it does not mean they have the same phone! Worse, we simply do not know if the NULL means we don't know their phone, or if they simply do not have one.

Section 3.2: DB constraints

inherent constraints
schema-based (explicit) constraints

Domain constraints

NOT NULL
primary key
foreign key
CHECK
Triggers

semantic constraints (business logic)

The OFFICE database

Schema: EN p 71
Data: EN p 72
Spreadsheet

A JOIN is the operation of creating all records merged from two (or more) tables, where one attribute of one table is required to be equal to a corresponding attribute of another.

Examples:
Students:

Printing all of each student's grades, by joining the Student_number fields of STUDENT and GRADE_REPORT
Printing all sections including Course_name, joining COURSE and SECTION on the Course_number field

Office:

Printing the name and address of all employees who work in the 'Research' dept (Query1 on EN p 100)
Printing the project number, dept number, and the dept manager's name, for all projects located in 'Stafford' (Query 2 on EN p 100)
Printing each employee's name and his or her supervisor's name (Query 8, EN p 101)

A full Cartesian product would be denoted in SQL by, eg,
select * from employee, department;
where there is no WHERE clause establishing a relation between the two tables.

Note on join: it is conceptually somewhat inefficient.
Lots of behind-the-scenes optimization makes it fast.

More on keys

A KEY is any set of columns that is guaranteed to uniquely determine a row.
Primary Key: single column
Composite Key: multiple columns (eg the GRADE_REPORT table)

Secondary Keys: other column combinations that are keys, but not the one intended

Lists in Relational DBs

Suppose I want a table of ⟨instructor, list-of-classes⟩ and want to enter ⟨pld, [317,343,353]⟩
If you look through your oracle manual, you won't find it.

Lists are non-atomic and are a problem.

Better way:
Create table CLASSES_TAUGHT: ⟨instructor, class⟩
Add entries
    ⟨pld, 317⟩
    ⟨pld, 343⟩
    ⟨pld, 353⟩
    ⟨sam, 101⟩
    ⟨sam, 202⟩
    ⟨sam, 303⟩
etc

What is the key here?

Lists are sometimes said to be MULTIPLE-VALUED ENTRIES; that is, in the first case the value of the COURSES column for row "pld" consists of the multiple values 317, 343, and 353.

This still doesn't make them anything other than a bad idea.

Elimination of lists is basically the process of putting a table into FIRST NORMAL FORM (1NF).

A few complaints about "personal" DBs, like MS Access:

no transaction-processing across network; instead, the whole DB is copied
no table locking; allows for inconsistent updates
client failure does not result in unlocking
no log to ensure atomicity in a bunch of transactions (debit from account1, credit to account2)

Some of these may have been addressed by now.

Objects

By now, everyone does object-oriented programming. The basic relational model doesn't include that. Do we need it?

Some people think we do, and so there are "object-oriented" databases.

But note that method calls are tricky; SQL isn't about that.

Furthermore, we can usually simulate the data extension portion of an object by adding a table. Suppose we have a table PERSONS, and want to create a subobject STUDENTS, with additional student-specific fields MAJOR and ENROLL_DATE. We can accomplish this by creating a table STUDENTS, and filling it with records of the form

⟨person_id, major, enroll_date⟩

SQL (EN chapter 4)

create table: office example, with and without changes to allow foreign keys

SQL data types

INT (etc)
VARCHAR(n)
Boolean
DATE
blobs

SQL constraints: Section 4.2

The SQL select-from-where statement: 4.3

Examples

RELATIONAL ALGEBRA: we'll only do a quick look at this

Union: R1 ∪ R2, where R1, R2 are subsets of same cross product. This is logically like OR

Intersection: Logically like AND

Difference: Logical "R1 but NOT R2"

    Proj2003 - Proj2002, other order

PROJECTION: ignoring some designated columns,    = vertical slices

SELECTION:

choosing subset of entries
corresponds to logical operations defining the subset
Sel(Parts:cost >10.0)

Product:
    can be of two tables as well as two domains

Assignment
    syntactic sugar

Natural Join (equijoin):
    Records from Table1 and Table2 where Table1.colN = Table2.colM
    Join as Product,Selection,Projection

Division: A / B: which records in A appear in EVERY row of B
A records are presumed "extensions of" B records

examples:
Note that solution is a SEQUENCE of steps

RELATIONAL CALCULUS

Like Algebra, except that there is no sequence; solutions are a single expression.

SQL is based on relational calculus

    (column list) : r1 in Table1, r2 in Table2 ... and EXPRESSION

Examples: