Comp 353/453: Database Programming, Corboy 523, 7:00 Thursdays
Week 1
Read in Elmasri & Navathe (EN)
- Chapter 1, Databases and Database Users
- Chapter 2, Database System Concepts an Architecture, Section 1 on Data Models, Schemas and Instances
- Chapter 3, The Relational Data Model ...., Sections 1 & 2
Course work:
- midterm
- final
- SQL assignments
- programming project (PHP or Java)
Installation assignment:
Install MySQL on your own machine
Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"
Oracle v MySQL: two databases.
Oracle has more proprietary extensions. Some are there more to lock you into Oracle than because they are a good idea.
Note that, with Oracle's purchase of Sun Microsystems, Oracle now
owns MySQL. Some people think this is why they bought Sun. (Others
think it was for Java.)
Oracle tends to support more features than MySQL. Transactions are one, though this is now beginning to appear in MySQL. Another is key constraints, and foreign key constraints in particular. However, consider the following:
- Application programs generally have to explicitly check for
foreign key values, anyway; otherwise, it is difficult to respond
natually to a user error.
- Now that Oracle has acquired the innodb database engine for MySQL, foreign key constraints are now implemented whenever the DB adminstrator chooses the innodb engine.
Note also that MySQL is dual-licensed: as open-source (which means your
additions must also be open-source, if you release them) and as a proprietary product
(meaning you can develop and sell proprietary extensions to MySQL, provided you
pay the licensing fee).
Installing Oracle 11g (more later)
- bandwidth to download 2+ GB
- disk space
SQL*Plus v web interface v MySQL
localhost:8080/apex
browse around to find SQL
Some notes on installing mysql
These assume that you will use a command-line interface.
Basically, you need to create a user, and then a database for that user. A database is a collection of tables.
Step 1: do the actual installation (eg apt-get install mysql-server, or
run the .exe install file). During this step, you should be asked for a
master password.
Downloads for the MySQL Community Server are at http://www.mysql.com/downloads/mysql. At a minimum, you will need MySQL Server.
Step 2. log in using the master password.
mysql --user root -p
(give master password)
Step 3. Now new users and databases can be created, eg pld. (note password is in quotes)
mysql> create user pld identified by 'cranberry';
mysql> create database plddb;
mysql> grant all on plddb.* to 'pld';
We have created a user pld, and a database plddb, and given user
pld full control over plddb, and only over plddb. If you wanted to
allow pld to create databases, you could use
mysql> grant all on *.* to 'pld';
You can also grant individual actions:
grant SELECT, INSERT, UPDATE, DELETE, CREATE, DROP ON *.* TO 'pld';
Step 4. Now log in as pld (after logging out as root):
mysql --user=pld --password=cranberry plddb
But let's assume you are user pld, in unix-world, and that you don't wish to type your password on the command line. Then use:
mysql -p plddb
Step 5: Create a table (while logged in as pld). Nava/C users take note: the comma is a separator in the field declarations, not a terminator (ie there is no comma after the final declaration, and the field name precedes the type name.
mysql> create table foo (id int, name varchar(100));
mysql> insert into foo values (1, 'peter');
mysql> insert into foo values (2, 'paul');
mysql> show tables;
mysql> select * from foo;
The last is our first example of a SQL select statement.
History: relational databases came out of theoretical work of Edgar
Codd, in a 1970 paper titled A Relational Model of Data for Large Shared Data Banks. At the time, the approach was seen as too computationally
expensive. But by 1980, both IBM DB2 and (early version of) Oracle were
out, and it was becoming generally recognized that relational databases were the wave of the future.
Codd's relational approach solved a major data consistency problem. We will look at that below.
While there are now other types of databases, notably object-oriented
databases, it is fair to say that no later development has offered a
convincing solution to a general organizational problem that is intractable with the
relational model. That said, relational databases sometimes have serious scaling problems, and there is a significant No-SQL movement.
SEQUEL / SQL
In the early 1970's, IBM introduced SEQUEL: Structured English Query
Language. For trademark reasons, the name had to be changed; IBM chose
SQL (Structured Query Language). To this day, many people prefer to
pronounce SQL as "sequel" rather than as "ess queue ell"; note that in
this case the "sequel" pronunciation actually came first.
The "official" pronounciation for MySQL is "My ess queue ell", but they
tolerate "mysequel". The following is from the MySQL reference manual,
§1.3.2:
MySQL is named after co-founder Monty Widenius's daughter, My.
SQL is fundamentally command-line. It is also a rather non-procedural language!
Wrapping a good Human-Computer Interface (HCI) around SQL is good practice (well-nigh essential); generally this is done in some
high-level language with a SQL interface (eg JDBC). However, actual
queries are still done in SQL.
Terminology
University database, EN p 6
A record is a list of data fields.
The fields do not have to have the same type, and are usually
referenced by name rather than by position (or number). Example:
⟨'Peter', 'Dordal', '123561234', 32020, 15.7, true, 37⟩. The fields may
also be called columns or attributes.
A table is a set of records. In
some object-oriented databases we might identify a record with a class
type, and a single table might consist of records from a parent class
and from multiple base classes. However, in relational databases all
records of a table have the same type: all records have the same
fields, in the same order.
If we imagine a table laid out graphically, with records as rows, the use of the word column to describe a field makes more sense.
A database is a set of tables. The database may impose some constraints between tables.
A key field is a field we can
use for searching a table, that is guaranteed to return at most a
single record. That is, the SSN field of an Employee database is a key
(we hope!), but the FirstName field is probably not. Given a key field,
a database might maintain an index
to speed up lookups on that key; non-key fields can also have indexes
however, and without an index we can always use linear search.
In the University database of EN, here are the keys:
- STUDENT: Student_number
- COURSE: Course_number
- SECTION: Section_identifier (this is not completely obvious)
- GRADE_REPORT: ⟨Student_number, Section_identifier⟩
Tables with two columns as key often serve to represent relationships;
the GRADE_REPORT table identifies the "has-taken-the-class"
relationship between STUDENTs and SECTIONs: Peter has-taken-the-class
COMP-353-001.
We can use the four tables here to provide a list of what students have
taken what courses in what semesters, and what grades they earned. This
sort of mixing and matching of data from multiple tables is the
hallmark of the relational database approach.
Abstractly, a relation is a set
of lists/rows/tuples that are all of the same type: there are the same
number of fields in each row, and each field consists of elements of a
single type. This corresponds to a subset of the cartesian product A1×A2×...×An of the underlying attribute sets. Relational databases almost always enforce this restriction: rows of a given table must be homogeneous: all the rows must be of the same length and with the same respective field types.
Mathematically, however, one can have a relation that is a subset of A×B, where B is a union of a base class C and two derived classes C×D and C×E: C ∪ C×D ∪ C×E. This would allow the relation to have heterogeneous tuples like ⟨a,c⟩ ⟨a,c,d⟩ and ⟨a,c,e⟩, though these might be more accurately parsed as ⟨a,c⟩ ⟨a,⟨c,d⟩⟩ and ⟨a,⟨c,e⟩⟩. Some
object-oriented databases do in fact provide support for constructions
such as this, although relational databases usually do not.
Another hallmark of the relational approach is that all fields are atomic
types; that is, they are not lists or subrelations (strings, however,
are allowed, even with substring operations). For example, suppose we
want to store the student name, id, and list of courses:
⟨'peter', address, 23456, [343, 346, 353]⟩
⟨'paul', address2, 45678, [443, 372]⟩
Lists are not atomic; the relational approach would likely decompose
this into two tables, one of ⟨name, address, student_id⟩ data, indexed
by student_id, and a second table of ⟨student_id, coursenum⟩, indexed
by both fields and containing
23456
|
343
|
23456
|
346
|
23453
|
353
|
45678
|
443
|
45678
|
372
|
Note that doing this allows us to recover the original lists of courses
on a per-student_id basis, by mixing and matching from the two new
tables. Note also that the atomic-types rule would disallow the C ∪ C×D
∪ C×E example above, as union types are not atomic.
Relational DBs, SQL and NoSQL
We will define the concept of a relational database more below, though the rules above are a start.
The SQL language is designed as a way to ask questions about a
relational database without specifying exactly how the data is to be
retrieved. Consider the following from http://www.aosabook.org/en/nosql.html by Adam Marcus:
SQL is a declarative language for querying data. A declarative
language is one in which a programmer specifies what they want
the system to do, rather than procedurally defining how the
system should do it. A few examples include: find the record for
employee 39, project out only the employee name and phone number from
their entire record, filter employee records to those that work in
accounting, count the employees in each department, or join the data
from the employees table with the managers table.
That sounds dandy: you just tell the computer what to do, not how to do it. But there is a downside:
To a first approximation, SQL allows you to ask these questions without
thinking about how the data is laid out on disk, which indices to use
to access the data, or what algorithms to use to process the data. A
significant architectural component of most relational databases is a
query optimizer, which decides which of the many logically
equivalent query plans to execute to most quickly answer a query.
These optimizers are often better than the average database user, but
sometimes they do not have enough information or have too simple a
model of the system in order to generate the most efficient execution.
It turns out to be surprisingly easy to write inefficient queries,
particularly if complex "inner-query" searches are needed. SQL allows
you to search on any field, not just a key field for which there is an
index; for very large datasets, this is ill-advised. SQL allows, in
effect, any query regardless of how efficient it is. Some restrictions
imposed by the NoSQL world (or, more accurately, imposed by the real world and implemented by the NoSQL world) are:
- Allowing search only on key fields, for which an index is provided
- Limiting the use of table joins; avoiding table decomposition as an alternative
- Restricted guarantees on transactions, data consistency and durability
- Moving more-complex query parts from the query language to the application logic
- Supporting "document" data components (eg XML documents with a required structure)
- Supporting structured field data, as a way to keep complex data in a single place instead of spread over multiple tables
These limitations are essentially always done for the sake of efficiency and performance.
Basic SQL operations: querying & updating
SQL:
select * from STUDENT
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and Section_identifier <= 115;
The last query above is somewhat misleading, as Section_identifiers are usually not intended for linear comparison.
Figure 2.1 on EN p 32: a schema for the students database
In the queries above, the * means all columns of the selected rows; sometimes we want fewer:
select name from student;
select name, major from student where student_number = 17;
select course_number, semester from section where 101<= section_identifier and section_identifier <= 115;
Finally, when we are working with multiple tables, it is excellent practice to name
the tables, and then qualify all column names with the table names, as
follows. We will use this style consistently throughout the semester;
it is particularly useful when the where clause involves multiple tables (as in joins).
select s.name from student s;
select s.name, s.major from student s where s.student_number = 17;
Table Joins
Suppose in the university database we want to know
the names of everyone in section 112. (A peculiarity of the specific
data given as example is that no section has more than one student!)
The GRADE_REPORT table has only student numbers; we need to match these
up with names from the STUDENT table. This operation, of matching
corresponding rows of different tables, is known as the join. Here is the SQL for the query, where the join condition is in bold:
select s.name from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;
We are retrieving records from two tables here, but restricting attention to pairs of records that "match up" according to the join condition.
The join can also be done with the following alternative syntax:
select s.name from student s join grade_report gr
on s.student_number = gr.student_number
where gr.section_identifier = 112;
Finally, maybe we want the name and the grade:
select s.name, gr.grade from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;
The join operation was once derided as introducing too much
inefficiency. Technical advances in the 1980's made this issue less
important, but the rise of huge datasets in this century has made this
again relevant.
Some other issues
Database tables are usually "self-describing", in that the table
description and/or relationships to other tables is often embedded in
the table description. At a minimum, columns have types.
The table definitions, column names and their types are generally known, collectively, as the database schema.
Database tables can be changed without the need to recompile programs using that
table: new columns can be added, or entire new tables. Essentially this
is because DBMS queries are interpreted, and table columns are identified by name rather than offset. (C programs that access record fields do need recompilation if fields are changed.)
Databases often support different views,
perhaps for users with different privilege levels. A view may be a
subset of the original set of columns, or it may contain some computed
columns in lieu of the original columns. Excluding columns such as social_security_num or salary is relatively common.
Transactions are sets of
related updates, eg removing money from one account and adding it to
another, or perhaps dropping one class and adding another (sadly,
remarkably many student DBMSs lack the latter transaction operation).
EN defines a transaction to be the result of an executing program
rather than a set of related updates; this latter definition might
include adding multiple new rows to one table.
Transaction processing must satisfy the isolation property, that transactions appear to execute in isolation from one another, and the atomicity property,
which says that each transaction is completely executed or not executed
at all. (This is sometimes described as that transactions must meet the
ACID test: Atomicity, Consistency,
Isolation, Durability. However, consistency and durability are relevant
to all database updates; they are not particular to transactions.)
Here is a simple failure of isolation: two transactions are
transferring money to other accounts. Transaction 1 is transferring
$100 from A to B; transaction 2 is transferring $50 from A to C. The
total amount of money should be $1000.
transaction 1
|
transaction 2
|
Acct A
|
Acct B
|
Acct C
|
|
|
1000
|
0
|
0
|
get value of A: $1000
|
|
1000
|
0
|
0
|
add $100 to B
|
|
1000
|
100
|
0
|
|
Debit A by $50
|
950
|
100
|
0
|
Store $1000 - $100 in A
|
|
900
|
100
|
0
|
|
Credit C by $50
|
900
|
100
|
50
|
Multi-user DBMSs need concurrency control.
Section 1.6 of EN talks about some of the advantages of having a
central DBMS rather than individual DBMSs maintained by each
administrative group. This is pretty much a settled issue now, though
it does mean that the "natural owners" of data in an organization (eg
registration and records at Loyola, for student registration data) will
not in fact own that data.
Constraints
Databases involve several kinds of constraints:
0. Fundamental rules enforced by the table structure
1. Type constraints on column values
2. Key constraints: in each table, any given declared key
can occur in only one row. This is not a property of a table at a
particular moment, but rather a rule that says that a second record
with a duplicate key can never be added.
3. Referential-integrity constraints: Consider again the university database on p 8 of EN. In the SECTION table, the key is
Section_identifier. A typical constraint would be that we are not allowed to have a record that has a
Course_number value that is not found in the COURSE table.
Similarly, in the GRADE_REPORT table, the key is ⟨Student_number,
Section_identifier⟩. We do not allow rows for which Student_number does
not refer to a valid entry in the STUDENT table, or for which
Section_identifier does not refer to a valid entry in the SECTION table.
Note the implicit constraint of having the STUDENT table, with key Student_number. Suppose we
add an address column. Because the key for the table is Student_number, the key constraint means we
cannot have the same student with two addresses.
Now suppose instead we redesigned the database to include the student name and address directly into the
GRADE_REPORT table, along with the Student_number. This is slightly
wasteful of space, but that is a minor concern. The more serious problem is
that this now allows inconsistency: we can have student 17 have two different addresses in two different records, or even two different names.
This is the data-consistency
problem that the relational model was so successful at solving. If we
design our tables appropriately, the potential for duplicate entries is
simply eliminated.
Here's another view of inconsistency. Suppose we have a table with records like the following::
Purchase, CustomerName, CustomerAddr
This allows us to have two records for two different purchases, one with
purchase1, Peter, LakeShore
and one with
purchase2, Peter, WaterTower
Oops! Peter is now getting duplicate mailings (one forwarded).
So the relationships are "factored" (more on this later) into multiple
tables so as to prevent this. In this case, we would want a table of
CustomerName and CustomerAddr (perhaps also with CustomerID), and a second table with columns Purchase and CustomerID. Now it is not possible to have one customer with two addresses.
Core relational-database concept:
Divide data into multiple Tables (mathematically, RELATIONS)
in such a way that there is ENOUGH division to enforce consistency
and NOT TOO MUCH division to cause problems with reassembly
Major concepts for this course (references are to Parts of EN)
- The basic relational-db model, and SQL (part 2)
- DB modeling (part 3)
- DB programming, JDBC (part 5)
- Normalization & other design theory (part 6)
- internal DB structures (part 7)
- optimization & tuning (part 8)
- concurrency, locking, and transactions (part 9)
Relational DBs (chapter 3)
A relation is any set of tuples
The set of all possible tuples is the CROSS PRODUCT of some domains
col1 × col2 × col3 × ... × colN
Example: A = {1,2,3}, B = {x,y} C = {1,2}
A × B
A × C
< relation in A × C
<= relation in A × C
DB relations are not defined by rule, but by tabulation!
Given attribute sets A1, A2, ..., An, a relation is a subset of the cartesian product A1×A2×...×An; that is, a set of tuples ⟨a1,a2,...,an⟩ where each ai∈Ai. These tuples may also be called records.
Relations in a DB are represented as tables. EN also uses the term relation state to refer to a specific set of records in a table.
STUDENT table, EN p 63
Name
|
SSn
|
Home_phone
|
Address
|
Office_phone
|
Age
|
GPA
|
Benjamine Bayer
|
305-61-2435
|
817-373-1616
|
2918 bluebonnet Lane
|
NULL
|
19
|
3.21
|
Chung-cha Kim
|
381-62-1245
|
817-375-4409
|
125 Kirby Road
|
NULL
|
18
|
2.89
|
Dick Davidson
|
422-11-2320
|
NULL
|
3452 Elgin Road
|
817-749-1253
|
25
|
3.53
|
Rohan Panchal
|
489-22-1100
|
817-376-9821
|
265 Lark Lane
|
817-749-6492
|
28
|
3.93
|
Barbara Benson
|
533-69-1238
|
817-839-8461
|
7384 Fontana Lane
|
NULL
|
19
|
3.25
|
Note the
Also note that some entries are NULL. This means undefined or not available or not known; unfortunately, these three options are not synonymous or interchangeable.
NULL values are essential, but they do introduce some complications.
The first is that records with NULL entries are not in fact elements of
A1×A2×...×An; they are elements of
(A1 ∪ {NULL}) × (A2 ∪ {NULL}) × ... × (An ∪ {NULL})
EN also gives an alternative definition of a relation, as a set of maps from the attribute set to the set of attribute values, where the attribute set is essentially the set of names of columns. With this approach, a null entry is represented by a partial map, undefined for some attributes.
Note that we must be careful when comparing null values: if two people have NULL as their Office_phone, it does not
mean they have the same phone! Worse, we simply do not know if the NULL
means we don't know their phone, or if they simply do not have one.
Section 3.2: DB constraints
- inherent constraints
- schema-based (explicit) constraints
- NOT NULL
- primary key
- foreign key
- CHECK
- Triggers
- semantic constraints (business logic)
The COMPANY database
Schema: EN p 71
Data: EN p 72
basic table definitions
table definitions plus data, with ALTER
Spreadsheet
zip file
As stated above, a join is the operation of creating all records merged from two (or
more) tables, where one attribute of one table is required to be equal
to a corresponding attribute of another.
Examples:
University:
- Printing all of each student's grades, by joining the Student_number fields of STUDENT and GRADE_REPORT
- Printing all sections including Course_name, joining COURSE and SECTION on the Course_number field
Company:
- Printing the name and address of all employees who work in the 'Research' dept (Query1 on EN p 100)
- Printing the project number, dept number, and the dept manager's
name, for all projects located in 'Stafford' (Query 2 on EN p 100)
- Printing each employee's name and his or her supervisor's name (Query 8, EN p 101)
A full Cartesian product would be denoted in SQL by, eg,
select * from employee, department;
where there is no WHERE clause establishing a relation between the two tables.
Note on join: it is conceptually somewhat inefficient.
Lots of behind-the-scenes optimization makes it fast.
More on keys
A KEY is any set of columns that is guaranteed to uniquely determine a row.
Primary Key: the key the database developer thinks is most important; usually a single attribute if there is one
Composite Key: multiple columns (eg the GRADE_REPORT table). Note that there is no single-column key here.
Secondary Keys: other column combinations that are keys, but not the one intended
Note that keys are not properties of particular tables, but rather of the "table schema". They represent design constraints.
Foreign Keys
Key constraints are one kind of constraint. What about the use of dno
in table Employees? It should be clear that we probably want all dnos
to refer to real departments, that is, to match an existing dnumber in table Department. This is done through a foreign key constraint: we declare in table Employee that attribute dno is a foreign key: a reference to a key of another table. The declaration looks like
foreign key (dno) references department(dnumber)
Note that the constraint here applies to adding (or updating) records in Employee, and also to deleting records in Department.
Foreign keys are notorious for introducing circularity problems. What
happens if we enforce foreign keys and try to load the database as
originally written? With all tables empty, we can't add any employee
because no dno value we might use would appear in the empty Department
table, and we cannot add a department because the mgr_ssn is a foreign
key referencing Employee.
In principle, there is no reason to require that the foreign key
actually be a key in the other table. In practice, it almost always is;
in database schemas generated through so-called Entity-Relationship
diagrams it always is.
Other constraints
Examples might be that the employee salary is in a given range, or is
less than the supervisor's salary, etc. These non-key constraints can
sometimes be addressed at the DB level, but are often easier to address
at the level of the user interface to the DB; that is, the web
interface can contain the necessary business logic.