Comp 353/453: Database Programming, Corboy
L08, 4:15 Mondays
Week 1
Read in Elmasri & Navathe (EN)
- Chapter 1, Databases and Database Users
- Chapter 2, Database System
Concepts an Architecture, Section 1 on Data Models, Schemas and
Instances
- Chapter 3, The Relational Data Model ...., Sections 1 & 2
Course work:
- midterm
- final
- SQL assignments
- programming project (PHP or Java; I'm still thinking about Ruby)
Installation assignment:
Install MySQL on your own machine
Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"
Oracle v MySQL: two databases.
Oracle has more proprietary extensions. Some are there more to lock you into
Oracle than because they are a good idea.
Note that, with Oracle's purchase of Sun Microsystems, Oracle now owns
MySQL. Some people think this is why they bought Sun. (Others think it was
for Java.)
Oracle tends to support more features than MySQL. Transactions
are one, though this is now beginning to appear in MySQL. Another is key
constraints, and foreign key constraints
in particular. However, consider the following:
- Application programs generally have to explicitly check for foreign
key values, anyway; otherwise, it is difficult to respond natually to a
user error.
- Now that Oracle has acquired the innodb
database engine for MySQL, foreign key constraints are now implemented
whenever the DB adminstrator chooses the innodb engine.
Note also that MySQL is dual-licensed: as open-source (which means your
additions must also be open-source, if you release them) and as a
proprietary product (meaning you can develop and
sell proprietary extensions to MySQL, provided you pay the
licensing fee).
Installing Oracle 11g (more later)
- bandwidth to download 2+ GB
- disk space
SQL*Plus v web interface v MySQL
localhost:8080/apex
browse around to find SQL
Some notes on installing mysql
These assume that you will use a command-line interface.
Basically, you need to create a user,
and then a database for that user.
A database is a collection of tables.
Step 1: do the actual installation (eg apt-get install mysql-server, or run
the .exe install file). During this step, you should be asked for a master
password.
Downloads for the MySQL Community Server are at http://www.mysql.com/downloads/mysql.
At a minimum, you will need MySQL Server.
Step 2. log in using the master password.
mysql --user root -p
(give master password)
Step 3. Now new users and databases can be created, eg pld. (note password
is in quotes)
mysql> create user pld identified by
'cranberry';
(or: create user pld @ localhost identified by 'cranberry';
mysql> create database plddb;
mysql> grant all on plddb.* to pld;
We have created a user pld, and a database plddb, and given user pld full
control over plddb, and only over plddb. If you wanted to allow pld to
create databases, you could use
mysql> grant all on *.* to 'pld';
You can also grant individual actions:
grant SELECT, INSERT, UPDATE, DELETE,
CREATE, DROP ON *.* TO 'pld';
Step 4. Now log in as pld (after logging out as root):
mysql --user=pld --password=cranberry plddb
But let's assume you are user pld,
in unix-world, and that you don't wish to type your password on the command
line. Then use:
mysql -p plddb
Step 5: Create a table (while logged in as pld). Nava/C users take note: the
comma is a separator in the field
declarations, not a terminator (ie there is no comma after the final
declaration, and the field name precedes
the type name.
mysql> create table foo (id int, name
varchar(100));
mysql> insert into foo values (1, 'peter');
mysql> insert into foo values (2, 'paul');
mysql> show tables;
mysql> select * from foo;
The last is our first example of a SQL select
statement.
If you are using a unix-based system, you might find the following helpful
in your my.cnf file (eg in /etc/mysql). It must be in the [mysqld] section.
lower_case_table_names=1
With this setting, the following are equivalent:
select * from employee;
select * from EMPLOYEE;
History
Relational databases came out of theoretical work of Edgar Codd, in a 1970
paper titled A Relational Model of Data
for Large Shared Data Banks. At the time, the approach was seen as
too computationally expensive. But by 1980, both IBM DB2 and (early version
of) Oracle were out, and it was becoming generally recognized that
relational databases were the wave of the future.
Codd's relational approach solved a major data
consistency problem. We will look at that below.
While there are now other types of databases, notably object-oriented
databases, it is fair to say that no later development has offered a
convincing solution to a general organizational problem that is intractable
with the relational model. That said, relational databases sometimes have
serious scaling problems, and
there is a significant No-SQL movement.
SEQUEL / SQL
In the early 1970's, IBM introduced SEQUEL: Structured English Query
Language. For trademark reasons, the name had to be changed; IBM chose SQL
(Structured Query Language). To this day, many people prefer to pronounce
SQL as "sequel" rather than as "ess queue ell"; note that in this case the
"sequel" pronunciation actually came first.
The "official" pronounciation for MySQL is "My ess queue ell", but they
tolerate "mysequel". The following is from the MySQL reference manual,
§1.3.2:
MySQL is named after co-founder Monty
Widenius's daughter, My.
SQL is fundamentally command-line. It is also a rather non-procedural
language!
Wrapping a good Human-Computer Interface (HCI) around SQL is good practice
(well-nigh essential); generally
this is done in some high-level language with a SQL interface (eg JDBC).
However, actual queries are still done in SQL.
Maps
Underlying all databases is the Map data structure. A Map object has a key
field, say of type K, and a data field, say of type D,
with the following operations:
insert(K key, D data); // inserts or updates a
⟨key,data⟩ record
D lookup(K key);
// retrieves the data portion given the key
There are some minor variants; for example, sometimes insert() requires that
the key not already be present, and an update(K key, D data) is provided
when the key is already present. Sometimes there are separate methods to
look up to see if a key is present.
From this perspective, a database is a collection of Map structures, with
the following additions:
- The data type D -- and, for that matter, the key type K -- can consist
of multiple subfields
- We can have multiple keys.
- The lookup() operation generally returns the entire record, key
included.
Terminology
University database, EN p 6
A record is a list of data fields.
The fields do not have to have the same type, and are usually referenced by
name rather than by position (or number). Example: ⟨'Peter', 'Dordal',
'123561234', 32020, 15.7, true, 37⟩. The fields may also be called columns
or attributes.
A table is a set of records. In
some object-oriented databases we might identify a record with a class type,
and a single table might consist of records from a parent class and from
multiple base classes. However, in relational databases all records of a
table have the same type: all records have the same fields, in the same
order.
If we imagine a table laid out graphically, with records as rows, the use of
the word column to describe a field
makes more sense.
A database is a set of tables. The
database may impose some constraints between tables.
A key field is a field we can use
for searching a table, that is guaranteed to return at most a single record.
That is, the SSN field of an Employee database is a key (we hope!), but the
FirstName field is probably not. Given a key field, a database might
maintain an index to speed up
lookups on that key; non-key fields can also have indexes however, and
without an index we can always use linear search.
In the University database of EN, here are the keys:
- STUDENT: Student_number
- COURSE: Course_number
- SECTION: Section_identifier (this is not completely obvious)
- GRADE_REPORT: ⟨Student_number, Section_identifier⟩
Tables with two columns as key often serve to represent relationships;
the GRADE_REPORT table identifies the "has-taken-the-class" relationship
between STUDENTs and SECTIONs: Peter has-taken-the-class COMP-353-001.
We can use the four tables here to provide a list of what students have
taken what courses in what semesters, and what grades they earned. This sort
of mixing and matching of data from multiple tables is the hallmark of the relational database approach.
Abstractly, a relation is a set of
lists/rows/tuples that are all of the same type: there are the same number
of fields in each row, and each field consists of elements of a single type.
This corresponds to a subset of the cartesian product A1×A2×...×An
of the underlying attribute sets. Relational
databases almost always enforce this restriction: rows of a given
table must be homogeneous: all the
rows must be of the same length and with the same respective field types.
Mathematically, however, one can have a
relation that is a subset of A×B, where B is a union
of a base class C and two derived classes C×D and C×E: C ∪ C×D ∪ C×E. This
would allow the relation to have heterogeneous
tuples like ⟨a,c⟩ ⟨a,c,d⟩ and ⟨a,c,e⟩, though these might be more
accurately written as ⟨a,c⟩ ⟨a,⟨c,d⟩⟩ and ⟨a,⟨c,e⟩⟩. Some object-oriented
databases do in fact provide support for constructions such as this,
although relational databases usually do not.
Another hallmark of the relational approach is that all fields are atomic
types; that is, they are not lists or subrelations (strings, however, are
allowed, even with substring operations). For example, suppose we want to
store the student name, id, and list
of courses:
⟨'peter', address, 23456, [343,
346, 353]⟩
⟨'paul', address2, 45678, [443,
372]⟩
Lists are not atomic; the relational approach would likely decompose this
into two tables, one of ⟨name, address, student_id⟩ data, indexed by
student_id, and a second table of ⟨student_id, coursenum⟩, indexed by both fields and containing
23456
|
343
|
23456
|
346
|
23453
|
353
|
45678
|
443
|
45678
|
372
|
Note that doing this allows us to recover the original lists of courses on a
per-student_id basis, by mixing and matching from the two new tables. Note
also that the atomic-types rule would disallow the C ∪ C×D ∪ C×E example
above, as union types are not atomic.
Relational DBs, SQL and NoSQL
We will define the concept of a relational database more below, though the
rules above are a start.
The SQL language is designed as a way to ask questions about a relational
database without specifying exactly how the data is to be retrieved.
Consider the following from http://www.aosabook.org/en/nosql.html
by Adam Marcus:
SQL is a declarative language for querying
data. A declarative language is one in which a programmer specifies what
they want the system to do, rather than procedurally defining how
the system should do it. A few examples include: find the record for
employee 39, project out only the employee name and phone number from
their entire record, filter employee records to those that work in
accounting, count the employees in each department, or join the data from
the employees table with the managers table.
That sounds dandy: you just tell the computer what
to do, not how to do it. But there
is a downside:
To a first approximation, SQL allows you to
ask these questions without thinking about how the data is laid out on
disk, which indices to use to access the data, or what algorithms to use
to process the data. A significant architectural component of most
relational databases is a query optimizer, which decides which
of the many logically equivalent query plans to execute to most quickly
answer a query. These optimizers are often better than the average
database user, but sometimes they do not have enough information or have
too simple a model of the system in order to generate the most efficient
execution.
It turns out to be surprisingly easy to write inefficient queries,
particularly if complex "inner-query" searches are needed. SQL allows you to
search on any field, not just a key field for which there is an index; for
very large datasets, this is ill-advised. SQL allows, in effect, any query
regardless of how efficient it is. Some restrictions imposed by the NoSQL
world (or, more accurately, imposed by the real
world and implemented by the NoSQL
world) are:
- Allowing search only on key fields, for which an index is provided
- Limiting the use of table joins;
avoiding table decomposition as an alternative
- Restricted guarantees on transactions,
data consistency and durability
- Moving more-complex query parts from the query language to the
application logic
- Supporting "document" data components (eg XML documents with a
required structure)
- Supporting structured field data, as a way to keep complex data in a
single place instead of spread over multiple tables
These limitations are essentially always done for the sake of efficiency
and performance.
NoSQL example
The native database used by the Asterisk phone switch consists of
tables (actually called "families") that are ⟨key,value⟩ pairs. Neither
the key nor the value need be atomic, or for that matter the same type
as other entries. (This is done for simplicity, not because the DB is huge.)
Basic SQL operations: querying & updating
SQL:
select * from STUDENT
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and
Section_identifier <= 115;
The last query above is somewhat misleading, as Section_identifiers are
usually not intended for linear comparison.
Figure 2.1 on EN p 32: a schema for the students database
In the queries above, the * means "all columns of the selected rows";
sometimes we want fewer:
select name from student;
select name, major from student where student_number = 17;
select course_number, semester from section where 101<=
section_identifier and section_identifier <= 115;
Finally, when we are working with multiple tables, it is excellent practice
to name the tables, and then
qualify all column names with the table names, as follows. We will use this
style consistently throughout the semester; it is particularly useful when
the where clause involves multiple
tables (as in joins).
select s.name from student s;
select s.name, s.major from student s where s.student_number = 17;
Table Joins
Suppose in the university database we want to know the names of everyone in
section 112. (A peculiarity of the specific data given as example is that no
section has more than one student!) The GRADE_REPORT table has only student
numbers; we need to match these up with names from the STUDENT table. This
operation, of matching corresponding rows of different tables, is known as
the join. Here is the SQL for the
query, where the join condition is
in bold:
select s.name from student s, grade_report
gr
where s.student_number =
gr.student_number and gr.section_identifier = 112;
We are retrieving records from two
tables here, but restricting attention to pairs of records that "match up"
according to the join condition.
The join can also be done with the following alternative syntax:
select s.name from student s join
grade_report gr
on s.student_number = gr.student_number
where gr.section_identifier = 112;
Finally, maybe we want the name and the grade:
select s.name, gr.grade from student s,
grade_report gr
where s.student_number =
gr.student_number and gr.section_identifier = 112;
The join operation was once derided as introducing too much inefficiency.
Technical advances in the 1980's made this issue less important, but the
rise of huge datasets in this century has made this again relevant.
Some other issues
Database tables are usually "self-describing", in that the table description
and/or relationships to other tables is often embedded in the table
description. At a minimum, columns have types.
The table definitions, column names and their types are generally known,
collectively, as the database schema.
Database tables can be changed without the need to recompile programs using
that table: new columns can be added, or entire new tables. Essentially this
is because DBMS queries are interpreted,
and table columns are identified by name
rather than offset. (C programs that access record fields do
need recompilation if fields are changed.)
Databases often support different views,
perhaps for users with different privilege levels. A view may be a subset of
the original set of columns, or it may contain some computed columns in lieu
of the original columns. Excluding columns such as social_security_num
or salary is relatively common.
Transactions are sets of related
updates, eg removing money from one account and adding it to another, or
perhaps dropping one class and adding another (sadly, remarkably many
student DBMSs lack the latter transaction operation). EN defines a
transaction to be the result of an executing program rather than a set of
related updates; this latter definition might include adding multiple new
rows to one table.
Transaction processing must satisfy the isolation
property, that transactions appear to execute in isolation from one
another, and the atomicity
property, which says that each transaction is completely executed
or not executed at all. (This is sometimes described as that transactions
must meet the ACID test: Atomicity, Consistency, Isolation, Durability.
However, consistency and durability are relevant to all database updates;
they are not particular to transactions.)
Here is a simple failure of isolation: two transactions are transferring
money to other accounts. Transaction 1 is transferring $100 from A to B;
transaction 2 is transferring $50 from A to C. The total amount of money
should be $1000.
transaction 1
|
transaction 2
|
Acct A
|
Acct B
|
Acct C
|
|
|
1000
|
0
|
0
|
get value of A: $1000
|
|
1000
|
0
|
0
|
add $100 to B
|
|
1000
|
100
|
0
|
|
Debit A by
$50
|
950
|
100
|
0
|
Store $1000 - $100 in A
|
|
900
|
100
|
0
|
|
Credit C by $50
|
900
|
100
|
50
|
Multi-user DBMSs need concurrency control.
Section 1.6 of EN talks about some of the advantages of having a central
DBMS rather than individual DBMSs maintained by each administrative group.
This is pretty much a settled issue now, though it does mean that the
"natural owners" of data in an organization (eg registration and records at
Loyola, for student registration data) will not in fact own that data.