Database week 1

Comp 305/488: Database Administration, Corboy 201, 4:15 Tuesdays

Week 1

Read in Elmasri & Navathe (EN)

Chapter 1, Databases and Database Users
Chapter 2, Database System Concepts an Architecture, Section 1 on Data Models, Schemas and Instances
Chapter 3, The Relational Data Model ...., Sections 1 & 2

Course work:

midterm
final
SQL assignments
other homework

What does a DBA do?

Manage installation, configuration and upgrade
Manage resources (cpu, RAM, disk, network), and makes sure these are sufficient

Decide on file formats and file locations
Disk assignment; eg putting two files joined frequently on different physical disks
Deciding which files should go on the fast SSD versus the slower magnetic disk

Monitor performance, and adjust configuration to improve performance

Monitor logs
Monitor query performance; optimize frequently used queries

Participate in database design
Decide on database indexes
Make occasional table changes, or other configuration changes
Monitor security
Manage access: who gets to see what data, maintains access policies and documentation
Implement data backups

Our goal in this course is to cover the fundamentals of database operation, focusing primarily but not exclusively on the relational model, and addressing both theoretical and practical concerns. We will, in the process of doing this, cover a fair bit of SQL.

The Oracle Database Administrator's Guide lists the following steps as central for DB administration:

Task 1: Evaluate the Database Server Hardware
Task 2: Install the Oracle Database Software
Task 3: Plan the Database
Task 4: Create and Open the Database
Task 5: Back Up the Database
Task 6: Enroll System Users
Task 7: Implement the Database Design
Task 8: Back Up the Fully Functional Database
Task 9: Tune Database Performance
Task 10: Download and Install Patches
Task 11: Roll Out to Additional Hosts

This is a somewhat more "mechanical" list of tasks than mine.

Here is a rather simple division of database applications into two categories:

Online Transaction Processing (OLTP): eg inventory management, advertising, etc. Transactions are a set of record updates. They are simple but continual. Record locking is needed. Often an entire record is needed.
Data Warehouse: eg customer history. The database is often much larger, but most accesses are read-only, as part of data-mining analysis. The DW might be updated only once a week. Queries tend to be complex and infrequent. Queries often tend to involve just a few columns; retrieving entire records may mean unnecessary I/O.

The traditional Relational Database Management System (RDBMS): like Oracle and MySQL. Data is stored in rows; one disk fetch is needed for one row. The Structured Query Language (SQL) is used to access and update the data.

Problems with traditional RDBMS:

50%-90% overhead on record-level locks and the write-ahead recovery log.
Inefficient table-join operations

There are many alternatives to traditional RDBMSs; see below. These typically are developed to address RDBMS difficulties with massive data stores, although often people then try to apply the new models to ordinary-sized data stores.

The RDBMS-still-central contrarian view: www.mongodb-is-web-scale.com (warning: some (ok, a lot of) rude language, and some cluelessness regarding agriculture (for example, most "bulls" are castrated as calves, not as adults))

Installation assignment:
Install MySQL on your own machine

Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"

Oracle v MySQL: two databases.

Oracle has more proprietary extensions. Some are there more to lock you into Oracle than because they are a good idea. That said, Oracle does do well with big, transactional DBs.

Note that, with Oracle's purchase of Sun Microsystems, Oracle now owns MySQL. Some people think this is why they bought Sun. (Others think it was for Java.)

Oracle tends to support more features than MySQL. Transactions are one, though this is now beginning to appear in MySQL. Another is key constraints, and foreign key constraints in particular. However, consider the following:

Application programs generally have to explicitly check for foreign key values, anyway; otherwise, it is difficult to respond naturally to a user error.
Now that Oracle has acquired the innodb database engine for MySQL, foreign key constraints are now implemented whenever the DB administrator chooses the innodb engine (which is now the default).

Note also that MySQL is dual-licensed: as open-source (which means your additions must also be open-source, if you release them) and as a proprietary product (meaning you can develop and sell proprietary extensions to MySQL, provided you pay the licensing fee).

Installing Oracle 12c (more later)

bandwidth to download 2+ GB
disk space

Some notes on installing mysql

These assume that you will use a command-line interface.

Basically, you need to create a user, and then a database for that user. A database is a collection of tables.

Step 1: do the actual installation (eg apt-get install mysql-server, or run the .exe install file). During this step, you should be asked for a master password. Do not forget this password!

Downloads for the MySQL Community Server are at http://www.mysql.com/downloads/mysql. At a minimum, you will need MySQL Server.

Step 2. log in using the master password.

mysql --user root -p
(give master password)

Step 3. Now new users and databases can be created, eg pld. (note password is in quotes)

mysql> create user pld identified by 'cranberry';
(or: create user pld @ localhost identified by 'cranberry';
mysql> create database plddb;
mysql> grant all on plddb.* to pld;

We have created a user pld, and a database plddb, and given user pld full control over plddb, and only over plddb. If you wanted to allow pld to create databases, you could use

mysql> grant all on *.* to 'pld';

You can also grant individual actions:

grant SELECT, INSERT, UPDATE, DELETE, CREATE, DROP ON *.* TO 'pld';

Step 4. Now log in as pld (after logging out as root):

mysql --user=pld --password=cranberry plddb

But let's assume you are user pld, in unix-world, and that you don't wish to type your password on the command line. Then use:

mysql -p plddb

Step 5: Create a table (while logged in as pld). Java/C users take note: the comma is a separator in the field declarations, not a terminator (ie there is no comma after the final declaration, and the field name precedes the type name.

mysql> create table foo (id int, name varchar(100));

mysql> insert into foo values (1, 'peter');
mysql> insert into foo values (2, 'paul');

mysql> show tables;

mysql> select * from foo;

The last is our first example of a SQL select statement.

If you are using a unix-based system, you might find the following helpful in your my.cnf file (eg in /etc/mysql). It must be in the [mysqld] section.

    lower_case_table_names=1

With this setting, the following are equivalent:
    select * from employee;
    select * from EMPLOYEE;

All these privileges are stored in the mysql database (which you can access as root with "use mysql;"). The user table is where much of it resides. Examples:

use mysql;
show tables;
select user from user;
describe usr;
select user, password from user;

Why are there so many root accounts? This becomes clearer with

select user, host from user;

The pair (user,host) is the primary key of this table.

Some notes on the Oracle Developer Days virtual machine

If you install this, the machine login username/password is oracle/oracle.

Once logged in, create a terminal window, and start sqlplus as follows:

sqlplus / as sysdba

From there:

alter user sys identified by "mynewpassword"        # creating a stronger password is recommended
create user pld identified by "mypass"                # Use your own initials or name instead of 'pld'
grant connect to pld;                                        # Give your account the important privileges
grant create table to pld;                                 #
grant alter any table to pld;                             #
grant unlimited tablespace to pld;                      #

At this point you should be able to login as 'pld', from another terminal:

sqlplus pld

You can create tables, etc. To get information about tables, the following work:

select table_name from user_tables;
select owner, table_name from all_tables;

describe employee;

To get constraints: P = primary key, R = foreign key C = check (eg not null)

select constraint_name,constraint_type from user_constraints where table_name = 'WORKS_ON';
select constraint_name, table_name from user_constraints where constraint_type='R';

History

Relational databases came out of theoretical work of Edgar Codd, in a 1970 paper titled A Relational Model of Data for Large Shared Data Banks. At the time, the approach was seen as too computationally expensive. But by 1980, both IBM DB2 and (early version of) Oracle were out, and it was becoming generally recognized that relational databases were the wave of the future.

Codd's relational approach solved a major data consistency problem. We will look at that below.

While there are now other types of databases, notably object-oriented databases, it is fair to say that no later development has offered a convincing solution to a general organizational problem that is intractable with the relational model. That said, relational databases sometimes have serious scaling problems, and there is a significant No-SQL movement.

SEQUEL / SQL

In the early 1970's, IBM introduced SEQUEL: Structured English Query Language. For trademark reasons, the name had to be changed; IBM chose SQL (Structured Query Language). To this day, many people prefer to pronounce SQL as "sequel" rather than as "ess queue ell"; note that in this case the "sequel" pronunciation actually came first.

The "official" pronunciation for MySQL is "My ess queue ell", but they tolerate "mysequel". The following is from the MySQL reference manual, §1.3.2:

MySQL is named after co-founder Monty Widenius's daughter, My.

Widenius's youngest daughter is Maria. Widenius is currently working on MariaDB, a drop-in replacement for MySQL.

SQL is fundamentally command-line. It is also a rather non-procedural language!

Wrapping a good Human-Computer Interface (HCI) around SQL is good practice (well-nigh essential); generally this is done in some high-level language with a SQL interface (eg JDBC). However, actual queries are still done in SQL.

Maps

Underlying all databases is the Map data structure. A Map object has a key field, say of type K, and a data field, say of type D, with the following operations:

insert(K key, D data); // inserts or updates a ⟨key,data⟩ record
D lookup(K key); // retrieves the data portion given the key

There are some minor variants; for example, sometimes insert() requires that the key not already be present, and an update(K key, D data) is provided when the key is already present. Sometimes there are separate methods to look up to see if a key is present.

From this perspective, a database is a collection of Map structures, with the following additions:

The data type D -- and, for that matter, the key type K -- can consist of multiple subfields
We can have multiple keys.
The lookup() operation generally returns the entire record, key included.

Terminology

University database, EN p 6

A record is a list of data fields. The fields do not have to have the same type, and are usually referenced by name rather than by position (or number). Example: ⟨'Peter', 'Dordal', '123561234', 32020, 15.7, true, 37⟩. The fields may also be called columns or attributes.

A table is a set of records. In some object-oriented databases we might identify a record with a class type, and a single table might consist of records from a parent class and from multiple base classes. However, in relational databases all records of a table have the same type: all records have the same fields, in the same order.

If we imagine a table laid out graphically, with records as rows, the use of the word column to describe a field makes more sense.

A database is a set of tables. The database may impose some constraints between tables.

A key field is a field we can use for searching a table, that is guaranteed to return at most a single record. That is, the SSN field of an Employee database is a key (we hope!), but the FirstName field is probably not. Given a key field, a database might maintain an index to speed up lookups on that key; non-key fields can also have indexes however, and without an index we can always use linear search.

In the University database of EN, here are the keys:

STUDENT: Student_number
COURSE: Course_number
SECTION: Section_identifier (this is not completely obvious)
GRADE_REPORT: ⟨Student_number, Section_identifier⟩

Tables with two columns as key often serve to represent relationships; the GRADE_REPORT table identifies the "has-taken-the-class" relationship between STUDENTs and SECTIONs: Peter has-taken-the-class COMP-353-001.

We can use the four tables here to provide a list of what students have taken what courses in what semesters, and what grades they earned. This sort of mixing and matching of data from multiple tables is the hallmark of the relational database approach.

Abstractly, a relation is a set of lists/rows/tuples that are all of the same type: there are the same number of fields in each row, and each field consists of elements of a single type. This corresponds to a subset of the cartesian product A₁×A₂×...×A_n of the underlying attribute sets. Relational databases almost always enforce this restriction: rows of a given table must be homogeneous: all the rows must be of the same length and with the same respective field types.

Mathematically, however, one can have a relation that is a subset of A×B, where B is a union of a base class C and two derived classes C×D and C×E: C ∪ C×D ∪ C×E. This would allow the relation to have heterogeneous tuples like ⟨a,c⟩ ⟨a,c,d⟩ and ⟨a,c,e⟩, though these might be more accurately written as ⟨a,c⟩ ⟨a,⟨c,d⟩⟩ and ⟨a,⟨c,e⟩⟩. Some object-oriented databases do in fact provide support for constructions such as this, although relational databases usually do not.

Another hallmark of the relational approach is that all fields are atomic types; that is, they are not lists or subrelations (strings, however, are allowed, even with substring operations). For example, suppose we want to store the student name, id, and list of courses:

⟨'peter', address, 23456, [343, 346, 353]⟩
⟨'paul', address2, 45678, [343, 372]⟩

Lists are not atomic; the relational approach would likely decompose this into two tables, one of ⟨name, address, student_id⟩ data, indexed by student_id, and a second table of ⟨student_id, coursenum⟩, indexed by both fields together and containing

23456	343
23456	346
23453	353
45678	343
45678	372

Note that doing this allows us to recover the original lists of courses on a per-student_id basis, by mixing and matching from the two new tables. Note also that the atomic-types rule would disallow the C ∪ C×D ∪ C×E example above, as union types are not atomic.

Relational DBs, SQL and NoSQL

We will define the concept of a relational database more below, though the rules above are a start.

The SQL language is designed as a way to ask questions about a relational database without specifying exactly how the data is to be retrieved. Consider the following from http://www.aosabook.org/en/nosql.html by Adam Marcus:

SQL is a declarative language for querying data. A declarative language is one in which a programmer specifies what they want the system to do, rather than procedurally defining how the system should do it. A few examples include: find the record for employee 39, project out only the employee name and phone number from their entire record, filter employee records to those that work in accounting, count the employees in each department, or join the data from the employees table with the managers table.

That sounds dandy: you just tell the computer what to do, not how to do it. But there is a downside:

To a first approximation, SQL allows you to ask these questions without thinking about how the data is laid out on disk, which indices to use to access the data, or what algorithms to use to process the data. A significant architectural component of most relational databases is a query optimizer, which decides which of the many logically equivalent query plans to execute to most quickly answer a query. These optimizers are often better than the average database user, but sometimes they do not have enough information or have too simple a model of the system in order to generate the most efficient execution.

It turns out to be surprisingly easy to write inefficient queries, particularly if complex "inner-query" searches are needed (though [even] MySQL is making progress on these). SQL allows you to search on any field, not just a key field for which there is an index; for very large datasets, this is ill-advised. SQL allows, in effect, any query regardless of how efficient it is. Some restrictions imposed by the NoSQL world (or, more accurately, imposed by the real world and implemented by the NoSQL world) are:

Allowing search only on key fields, for which an index is provided
Limiting the use of table joins; avoiding table decomposition as an alternative
Restricted guarantees on transactions, data consistency and durability
Moving more-complex query parts from the query language to the application logic
Supporting "document" data components (eg XML documents with a required structure)
Supporting structured field data, as a way to keep complex data in a single place instead of spread over multiple tables

These limitations are essentially always done for the sake of efficiency and performance.

You can get something like the same effect by using a relational database, and just promising yourself not to use it in inefficient ways.

NoSQL example

The native database used by the Asterisk phone switch consists of tables (actually called "families") that are ⟨key,value⟩ pairs. Neither the key nor the value need be atomic, or for that matter the same type as other entries. (This is done for simplicity, not because the DB is huge.)

The MS Windows registry is a similar example.

Basic SQL operations: querying & updating

SQL:

select * from STUDENT
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and Section_identifier <= 115;

The last query above is somewhat misleading, as Section_identifiers are usually not intended for linear comparison.

Figure 2.1 on EN p 32: a schema for the students database

In the queries above, the * means "all columns of the selected rows"; sometimes we want fewer:

select name from student;
select name, major from student where student_number = 17;
select course_number, semester from section where 101<= section_identifier and section_identifier <= 115;

Finally, when we are working with multiple tables, it is excellent practice to name the tables, and then qualify all column names with the table names, as follows. We will use this style consistently throughout the semester; it is particularly useful when the where clause involves multiple tables (as in joins).

select s.name from student s;
select s.name, s.major from student s where s.student_number = 17;

Table Joins

Suppose in the university database we want to know the names of everyone in section 112. (A peculiarity of the specific data given as example is that no section has more than one student!) The GRADE_REPORT table has only student numbers; we need to match these up with names from the STUDENT table. This operation, of matching corresponding rows of different tables, is known as the join. Here is the SQL for the query, where the join condition is in bold:

select s.name from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;

We are retrieving records from two tables here, but restricting attention to pairs of records that "match up" according to the join condition.

The join can also be done with the following alternative syntax:

select s.name from student s join grade_report gr
on s.student_number = gr.student_number
where gr.section_identifier = 112;

Finally, maybe we want the name and the grade:

select s.name, gr.grade from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;

The join operation was once derided as introducing too much inefficiency. Technical advances in the 1980's made this issue less important, but the rise of huge datasets in this century has made this again relevant.

Here are two "classic" joins:

1. Names and grades for everyone in Section 112. The grade_report table has two-column key ⟨student_number,section_identifier⟩. The join is on the student_number attribute, which is the key to the student table. The join can be viewed as taking each grade_report record with section_identifier=112 and using the student table to look up the student number and find the corresponding name.

select s.name , gr.grade from student s, grade_report gr
where s.student_number = gr.student_number and gr.section_identifier = 112;

2. Names of courses for all courses taught in Fall 08. Both tables involved have single-column keys; course_number for course and section_identifier for section. The join involves the key field of the course table, so the join can be viewed as taking each suitable section record and looking up the course number in the course table to find the name.

select c.course_name from course c, section s
where s.course_number = c.course_number and s.year=2008 and s.semester='fall';

Joins "look" symmetrical, but in most cases one table is used as a direct map-type lookup for values appearing in the other table. Reversing the roles here (eg taking each name, and searching the grade_report table for the grades of all matching students in section 112), is a very different kind of operation.

In both the examples above, we used the shorthand s and gr. These are sometimes called table aliases or tuple variables or (in the SQL standard) correlation names. Their use greatly enhances readability, as it makes clear which table a given column comes from. I recommend their use in all queries. See E&N6, p 101.

Sometimes the keyword "as" is inserted: select s.name, gr.grade from student as s, grade_report as gr ....

Sometimes it is helpful to view a table alias as a cursor variable: a variable ranging over each record in the corresponding table. Thus:

    for gr in grade_report:
        for s in student:
               if s.student_number = gr.student_number and gr.section_identifier = 112
               then: print record

This is an oversimplification of how joins are actually implemented, however.

Some other issues

Database tables are usually "self-describing", in that the table description and/or relationships to other tables is often embedded in the table description. At a minimum, columns have names and types.

The table definitions, column names and their types are generally known, collectively, as the database schema.

Database tables can be changed without the need to recompile programs using that table: new columns can be added, or entire new tables. Essentially this is because DBMS queries are interpreted, and table columns are identified by name rather than offset. (C programs that access record fields do need recompilation if fields are changed.)

Databases often support different views, perhaps for users with different privilege levels. A view may be a subset of the original set of columns, or it may contain some computed columns in lieu of the original columns. Excluding columns such as social_security_num or salary is relatively common.

Transactions are sets of related updates, eg removing money from one account and adding it to another, or perhaps dropping one class and adding another (sadly, remarkably many student DBMSs lack the latter transaction operation). EN defines a transaction to be the result of an executing program rather than a set of related updates; this latter definition might include adding multiple new rows to one table.

Transaction processing must satisfy the isolation property, that transactions appear to execute in isolation from one another, and the atomicity property, which says that each transaction is completely executed or not executed at all. (This is sometimes described as that transactions must meet the ACID test: Atomicity, Consistency, Isolation, Durability. However, consistency and durability are relevant to all database updates; they are not particular to transactions.)

Here is a simple failure of isolation: two transactions are transferring money to other accounts.

Transaction 1 is transferring $100 from A to B
Transaction 2 is transferring $50 from A to C

The total amount of money should always remain $1000.

transaction 1	transaction 2	Acct A	Acct B	Acct C
		1000	0	0
get value of A: $1000		1000	0	0
add $100 to B		1000	100	0
	Debit A by $50	950	100	0
Store $1000 - $100 in A		900	100	0
	Credit C by $50	900	100	50

Multi-user DBMSs need concurrency control.

Concurrency control leads to locks, the other great performance bottleneck of RDBMSs.

Section 1.6 of EN talks about some of the advantages of having a central DBMS rather than individual DBMSs maintained by each administrative group. This is pretty much a settled issue now, though it does mean that the "natural owners" of data in an organization (eg registration and records at Loyola, for student registration data) will not in fact own that data.