Comp 353/453: Database Programming, LT 410, 4:15 Tuesdays
Week 1
Read in Elmasri & Navathe (EN)
- Chapter 1
- Chapter 2 Section 1 on DB schemas
- Chapter 3, Sections 1 & 2
Bookstore notice
Installation assignment:
MySQL on your own machine
Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"
Oracle v MySQL: two databases.
Oracle has more proprietary extensions. Some are not necessarily a good idea.
Note that, with Oracle's purchase of Sun Microsystems, Oracle now
owns MySQL. Some people think this is why they bought Sun. (Others
think it was for Java.)
Oracle tends to support more features than MySQL. Transactions are one, though this is now beginning to appear in MySQL. Another is key constraints, and foreign key constraints in particular. However, consider the following:
- Application programs generally have to explicitly check for
foreign key values, anyway; otherwise, it is difficult to respond
natually to a user error.
- Now that Oracle has acquired the innodb database engine for MySQL, foreign key constraints are now implemented whenever the DB adminstrator chooses the innodb engine.
Note also that MySQL is dual-licensed: as open-source (which means your
additions must also be open-source) and as a proprietary product
(meaning you can develop proprietary extensions to MySQL, provided you
pay the licensing fee).
Installing Oracle 11g (more later)
- bandwidth to download 2+ GB
- disk space
SQL*Plus v web interface v MySQL
localhost:8080/apex
browse around to find SQL
Some notes on installing mysql
These assume that you will use a command-line interface.
Basically, you need to create a user, and then a database for that user. A database is a collection of tables.
Step 1: do the actual installation (eg apt-get install mysql-server, or
run the .exe install file). During this step, you should be asked for a
master password.
Downloads for the MySQL Community Server are at http://www.mysql.com/downloads/mysql. At a minimum, you will need MySQL Server.
Step 2. log in using the master password.
mysql --user root -p
(give master password)
Step 3. Now new users and databases can be created, eg pld. (note password is in quotes)
mysql> create user pld identified by 'cranberry';
mysql> create database plddb;
mysql> grant all on plddb.* to 'pld';
We have created a user pld, and a database plddb, and given user
pld full control over plddb, and only over plddb. If you wanted to
allow pld to create databases, you could use
mysql> grant all on *.* to 'pld';
You can also grant individual actions:
grant SELECT, INSERT, UPDATE, DELETE, CREATE, DROP ON *.* TO 'pld';
Step 4. Now log in as pld (after logging out as root):
mysql --user=pld --password=cranberry plddb
But let's assume you are user pld, in unix-world, and that you don't wish to type your password on the command line. Then use:
mysql -p plddb
Step 5: Create a table (while logged in as pld). Nava/C users take note: the comma is a separator in the field declarations, not a terminator (ie there is no comma after the final declaration, and the field name precedes the type name.
mysql> create table foo (id int, name varchar(100));
mysql> insert into foo values (1, 'peter');
mysql> insert into foo values (2, 'paul');
mysql> show tables;
mysql> select * from foo;
The last is our first example of a SQL select statement.
History: relational databases came out of theoretical work of Edgar
Codd, in a 1970 paper titled A Relational Model of Data for Large Shared Data Banks. At the time, the approach was seen as too computationally
expensive. But by 1980, both IBM DB2 and (early version of) Oracle were
out, and it was becoming generally recognized that relational databases were the wave of the future.
Codd's relational approach solved a major data consistency problem. We will look at that below.
While there are now other types of databases, notably object-oriented
databases, it is fair to say that no later development has offered a
convincing solution to a general problem that is intractable with the
relational model.
SEQUEL / SQL
In the early 1970's, IBM introduced SEQUEL: Structured English Query
Language. For trademark reasons, the name had to be changed; IBM chose
SQL (Structured Query Language). To this day, many people prefer to
pronounce SQL as "sequel" rather than as "ess queue ell"; note that in
this case the "sequel" pronunciation actually came first.
The "official" pronounciation for MySQL is "My ess queue ell", but they
tolerate "mysequel". The following is from the MySQL reference manual,
§1.3.2:
MySQL is named after co-founder Monty Widenius's daughter, My.
SQL is fundamentally command-line
Rather non-procedural language!
Wrapping a good Human-Computer Interface (HCI) around SQL is good practice (well-nigh essential); generally this is done in some
high-level language with a SQL interface (eg JDBC). However, actual
queries are still done in SQL.
Terminology
Student database, EN p 6
- Tables
- Rows == Records
- Columns == attributes
- Keys
- Database == set of tables
- Relationships between tables
Keys:
- STUDENT: Student_number
- COURSE: Course_number
- SECTION: Section_identifier (this is not completely obvious)
- GRADE_REPORT: ⟨Student_number, Section_identifier⟩
Tables with two columns as key often serve to represent relationships.
Basic operations: querying & updating
SQL: SELECT * FROM [tablename]
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and Section_identifier <= 115;
The second query above is somewhat misleading, as Section_identifiers are usually not intended for comparison.
Figure 2.1 on EN p 32: a schema for the students database
Some other issues
Database tables are usually "self-describing", in that the table
description and/or relationships to other tables is often embedded in
the table description. At a minimum, columns have types.
The table definitions, column names and their types is generally know as the database schema.
Database tables can be changed without the need to recompile programs using that
table: new columns can be added, or entire new tables. Essentially this
is because DBMS queries are interpreted, and table columns are identified by name rather than offset. (C programs that access record fields do need recompilation if fields are changed.)
Databases often support different views,
perhaps for users with different privilege levels. A view may be a
subset of the original set of columns, or it may contain some computed
columns in lieu of the original columns. Excluding columns such as social_security_num or salary is relatively common.
Transactions are sets of
related updates, eg removing money from one account and adding it to
another, or perhaps dropping one class and adding another (sadly,
remarkably many student DBMSs lack the latter transaction operation).
EN defines a transaction to be the result of an executing program
rather than a set of related updates; this latter definition might
include adding multiple new rows to one table.
Transaction processing must satisfy the isolation property, that transactions appear to execute in isolation from one another, and the atomicity property,
which says that each transaction is completely executed or not executed
at all. (Transactions must meet the ACID test: Atomicity, Consistency,
Isolation, Durability.)
Here is a simple failure of isolation: two transactions are transferring money to other accounts.
transaction 1
|
transaction 2
|
Acct A
|
Acct B
|
Acct C
|
|
|
1000
|
0
|
0
|
get value of Acct A: $1000
|
|
1000
|
0
|
0
|
add $100 to B
|
|
1000
|
100
|
0
|
|
Debit A by $50
|
950
|
100
|
0
|
Store $1000 - $100 in A
|
|
900
|
100
|
0
|
|
Credit C by $50
|
900
|
100
|
50
|
Multi-user DBMSs need concurrency control.
Section 1.6 of EN talks about some of the advantages of having a
central DBMS rather than individual DBMSs maintained by each
administrative group. This is pretty much a settled issue now, though
it does mean that the "natural owners" of data in an organization (eg
registration and records at Loyola, for student registration data) will
not in fact own that data.
Constraints
Databases involve several kinds of constraints:
0. Fundamental rules enforced by the table structure
1. Type constraints on column values
2. Key constraints: in each table, any given declared key
can occur in only one row. This is not a property of a table at a
particular moment, but rather a rule that says that a second record
with a duplicate key can never be added.
3. Referential-integrity constraints: Consider again the students database on p 8 of EN. In the SECTION table, the key is
Section_identifier. A typical constraint would be that we are not allowed to have a record that has a
Course_number value that is not found in the COURSE table.
Similarly, in the GRADE_REPORT table, the key is ⟨Student_number,
Section_identifier⟩. We do not allow rows for which Student_number does
not refer to a valid entry in the STUDENT table, or for which
Section_identifier does not refer to a valid entry in the SECTION table.
Note the implicit constraint of having the STUDENT table. Suppose we
add an address row. Because the key for the table is Student_number, we
cannot have the same student with two addresses.
Suppose instead we enter the student name and address directly into the
GRADE_REPORT table, along with the Student_number. This is slightly
wasteful of space, but that is a minor concern. The serious problem is
that this now allows inconsistency: we can have student 17 have two different addresses in two different records, or even two different names.
This is the data-consistency
problem that the relational model was so successful at solving. If we
design our tables appropriately, the potential for duplicate entries is
simply eliminated.
Here's another view of inconsistency. Suppose we have a table with records like the following::
Purchase, CustomerName, CustomerAddr
This allows us to have two records for two different purchases, one with
Peter, LakeShore
and one with
Peter, WaterTower
Oops! Peter is now getting duplicate mailings (one forwarded).
So the relationships are "factored" (more on this later) into multiple
tables so as to prevent this. In this case, we would want a table of
CustomerName and CustomerAddr (perhaps also with CustomerID), and a second table with columns Purchase and CustomerID. Now it is not possible to have one customer with two addresses.
Core concept:
Divide data into multiple Tables (mathematically, RELATIONS)
in such a way that there is ENOUGH division to ensure consistency
and NOT TOO MUCH division to cause problems with reassembly
Major concepts for this course (references are to Parts of EN)
- The basic relational-db model, and SQL (part 2)
- DB modeling (part 3)
- DB programming, JDBC (part 5)
- Normalization & other design theory (part 6)
- internal DB structures (part 7)
- optimization & tuning (part 8)
- concurrency, locking, and transactions (part 9)
Relational DBs (chapter 3)
A relation is any set of tuples
The set of all possible tuples is the CROSS PRODUCT of some domains
col1 × col2 × col3 × ... × colN
Example: A = {1,2,3}, B = {x,y} C = {1,2}
A × B
A × C
< relation in A × C
<= relation in A × C
DB relations are not defined by rule, but by tabulation!
Given attribute sets A1, A2, ..., An, a relation is a subset of the cartesian product A1×A2×...×An; that is, a set of tuples ⟨a1,a2,...,an⟩ where each ai∈Ai. These tuples may also be called records.
Relations in a DB are represented as tables. EN also uses the term relation state to refer to a specific set of records in a table.
STUDENT table, EN p 63
Name
|
SSn
|
Home_phone
|
Address
|
Office_phone
|
Age
|
GPA
|
Benjamine Bayer
|
305-61-2435
|
817-373-1616
|
2918 bluebonnet Lane
|
NULL
|
19
|
3.21
|
Chung-cha Kim
|
381-62-1245
|
817-375-4409
|
125 Kirby Road
|
NULL
|
18
|
2.89
|
Dick Davidson
|
422-11-2320
|
NULL
|
3452 Elgin Road
|
817-749-1253
|
25
|
3.53
|
Rohan Panchal
|
489-22-1100
|
817-376-9821
|
265 Lark Lane
|
817-749-6492
|
28
|
3.93
|
Barbara Benson
|
533-69-1238
|
817-839-8461
|
7384 Fontana Lane
|
NULL
|
19
|
3.25
|
Note the
Also note that some entries are NULL. This means undefined or not available or not known; unfortunately, these three options are not synonymous or interchangeable.
NULL values are essential, but they do introduce some complications.
The first is that records with NULL entries are not in fact elements of
A1×A2×...×An; they are elements of
(A1 ∪ {NULL}) × (A2 ∪ {NULL}) × ... × (An ∪ {NULL})
EN also gives an alternative definition of a relation, as a set of maps from the attribute set to the set of attribute values, where the attribute set is essentially the set of names of columns. With this approach, a null entry is represented by a partial map, undefined for some attributes.
Note that we must be careful when comparing null values: if two people have NULL as their Office_phone, it does not
mean they have the same phone! Worse, we simply do not know if the NULL
means we don't know their phone, or if they simply do not have one.
Section 3.2: DB constraints
- inherent constraints
- schema-based (explicit) constraints
- NOT NULL
- primary key
- foreign key
- CHECK
- Triggers
- semantic constraints (business logic)
The OFFICE database
Schema: EN p 71
Data: EN p 72
Spreadsheet
A JOIN is the operation of creating all records merged from two (or
more) tables, where one attribute of one table is required to be equal
to a corresponding attribute of another.
Examples:
Students:
- Printing all of each student's grades, by joining the Student_number fields of STUDENT and GRADE_REPORT
- Printing all sections including Course_name, joining COURSE and SECTION on the Course_number field
Office:
- Printing the name and address of all employees who work in the 'Research' dept (Query1 on EN p 100)
- Printing the project number, dept number, and the dept manager's
name, for all projects located in 'Stafford' (Query 2 on EN p 100)
- Printing each employee's name and his or her supervisor's name (Query 8, EN p 101)
A full Cartesian product would be denoted in SQL by, eg,
select * from employee, department;
where there is no WHERE clause establishing a relation between the two tables.
Note on join: it is conceptually somewhat inefficient.
Lots of behind-the-scenes optimization makes it fast.
More on keys
A KEY is any set of columns that is guaranteed to uniquely determine a row.
Primary Key: single column
Composite Key: multiple columns (eg the GRADE_REPORT table)
Secondary Keys: other column combinations that are keys, but not the one intended
Lists in Relational DBs
Suppose I want a table of ⟨instructor, list-of-classes⟩ and want to enter ⟨pld, [317,343,353]⟩
If you look through your oracle manual, you won't find it.
Lists are non-atomic and are a problem.
Better way:
Create table CLASSES_TAUGHT: ⟨instructor, class⟩
Add entries
⟨pld, 317⟩
⟨pld, 343⟩
⟨pld, 353⟩
⟨sam, 101⟩
⟨sam, 202⟩
⟨sam, 303⟩
etc
What is the key here?
Lists are sometimes said to be MULTIPLE-VALUED ENTRIES; that is, in the
first case the value of the COURSES column for row "pld" consists of
the multiple values 317, 343, and 353.
This still doesn't make them anything other than a bad idea.
Elimination of lists is basically the process of putting a table into FIRST NORMAL FORM (1NF).
A few complaints about "personal" DBs, like MS Access:
- no transaction-processing across network; instead, the whole DB is copied
- no table locking; allows for inconsistent updates
- client failure does not result in unlocking
- no log to ensure atomicity in a bunch of transactions (debit from account1, credit to account2)
Some of these may have been addressed by now.
Objects
By now, everyone does object-oriented programming. The basic relational model doesn't include that. Do we need it?
Some people think we do, and so there are "object-oriented" databases.
But note that method calls are tricky; SQL isn't about that.
Furthermore, we can usually simulate
the data extension portion of an object by adding a table. Suppose we
have a table PERSONS, and want to create a subobject STUDENTS, with
additional student-specific fields MAJOR and ENROLL_DATE. We can
accomplish this by creating a table STUDENTS, and filling it with
records of the form
⟨person_id, major, enroll_date⟩
SQL (EN chapter 4)
create table: office example, with and without changes to allow foreign keys
SQL data types
- INT (etc)
- VARCHAR(n)
- Boolean
- DATE
- blobs
SQL constraints: Section 4.2
The SQL select-from-where statement: 4.3
Examples
RELATIONAL ALGEBRA: we'll only do a quick look at this
Union: R1 ∪ R2, where R1, R2 are subsets of same cross product. This is logically like OR
Intersection: Logically like AND
Difference: Logical "R1 but NOT R2"
Proj2003 - Proj2002, other order
PROJECTION: ignoring some designated columns, = vertical slices
SELECTION:
- choosing subset of entries
- corresponds to logical operations defining the subset
- Sel(Parts:cost >10.0)
Product:
can be of two tables as well as two domains
Assignment
syntactic sugar
Natural Join (equijoin):
Records from Table1 and Table2 where Table1.colN = Table2.colM
Join as Product,Selection,Projection
Division: A / B: which records in A appear in EVERY row of B
A records are presumed "extensions of" B records
examples:
Note that solution is a SEQUENCE of steps
RELATIONAL CALCULUS
Like Algebra, except that there is no sequence; solutions are a single expression.
SQL is based on relational calculus
(column list) : r1 in Table1, r2 in Table2 ... and EXPRESSION
Examples: