Comp 305-001/488-305: Database Administration
Spring 2016: Tuesdays, 4:15-6:45, Corboy L08
Week 1
Read in Elmasri & Navathe (EN). Chapter numberings are (7th
edition)/(6th edition)
- Chapter 1/1, Databases and Database Users (skim)
- Chapter 2/2, Database System
Concepts an Architecture, Section 1 on Data Models, Schemas and
Instances
- Chapter 5/3, The Relational Data Model and SQL Sections 1
& 2
Course work:
- midterm
- final
- SQL assignments
- other homework?
Tables
DBA
Installation
History/Maps/Terminology
NoSQL
Simple queries
Joins
Keys and Foreign Keys
What is a database?
Mainly, a set of related tables. Each table is a set of records
(rows, items) of similar type.
Example: the company
database. Let's start with company.pdf.
The term "database" sometimes refers to the software package, eg MySQL or
Postgres, but most modern databases also use it to refer to a particular
collection of related tables. Thus Postgres might support multiple
databases, each with different names. The company database might be one; the
university
database might be another.
We interact with databases through queries, using the SQL
language. Here are a few very simple read-only (data access) queries,
involving only the employee table.
select * from employee e where e.ssn =
'333445555'; -- one unique result
select e.fname, e.lname, e.salary from employee e where e.super_ssn =
'333445555'; -- multiple results
select e.lname, e.ssn, e.salary from employee e where e.salary >=
30000;
We will consistently use table-alias variables, like e
above, although SQL does not absolutely require them. We'll return to SQL
below, but for now note that each query has three parts:
- a select clause specifying the columns
desired
- a from clause specifying the tables
involved
- a where clause specifying the rows
desired
The queries above involved just one table. The most interesting queries
involve joins of two or more tables. The join
condition defines a relationship between the tables.
list each employee with their department
name:
select e.fname, e.lname, d.dname from employee e join department d on
e.dno = d.dnumber;
We used the explicit join notation above, in which the from
clause lists a single, joined table. The implicit join notation would be as
follows; the join condition is now moved to the where
clause. It is deprecated and should not be used in this class.
select e.fname, e.lname, d.dname from
employee e, department d where e.dno = d.dnumber;
Why didn't we just put the department name in the employee table?
list employees working on project 2, with
hours:
select e.fname, e.lname, w.hours, w.pno from employee e join works_on w on
e.ssn = w.essn where w.pno = 2;
Why didn't we just put the employee's projects in the employee table? This
has a very different answer!
Here's a join involving three tables. How would you describe it in English?
select e.fname, e.lname, p.pname
from employee e join works_on w on e.ssn = w.essn join project p on w.pno
= p.pnumber
where w.hours >= 10;
Why don't we put everything into one table? Why do we have
multiple tables at all? Here's a query that would give us the employee,
works_on and project tables in one large table (with some columns edited
out)
select e.fname, e.lname, e.salary, w.hours,
p.pname, p.plocation
from employee e join works_on w on e.ssn = w.essn join project p on w.pno
= p.pnumber;
Suppose somebody tells you "we can just keep one spreadsheet with the above
information", and that therefore you don't need a database with multiple
tables! What could possibly go wrong?
What if we want to change the name of ProductX to FooMatic? What if we want
to give Franklin Wong a raise? What if Reorganization moves from Houston to
Stafford? In each case, we need to update multiple records
to make one change. What if we miss a record?
Division of a database into multiple "independent" tables is a very
important practice. The goal of the decomposition is to retain all the
original information while at the same time avoiding the data-consistency
problem of having multiple locations that should have
the same value, but might not. We "reconnect" separate tables using the join
operation. Unfortunately, if we have a lot of data, joins aren't
always as efficient as we'd like.
What does a DBA (Database Administrator) do?
Here's a traditional, somewhat Oracle-centric list:
- Manage installation, configuration and upgrade
- Manage resources (cpu, RAM, disk, network), and
makes sure these are sufficient
- Decide on file formats and file locations
- Disk assignment; eg putting two files joined frequently on different
physical disks
- Deciding which files should go on the fast SSD versus the slower
magnetic disk
- Monitor performance, and adjust configuration to improve performance
- Monitor logs
- Monitor query performance; optimize frequently used queries
- Participate in database design: what goes into each table, and how do
different tables relate?
- Decide on database indexes
- Make occasional table changes, or other configuration changes
- Monitor security
- Manage access: who gets to see what data, maintains
access policies and documentation
- Implement data backups
We might add a couple broader missions to this list:
- Identify information from the database that will have a significant
impact on the organization's mission
- Identify the database and tools that will allow the extraction of this
information quickly and inexpensively
Our goal in this course is to cover the fundamentals of database
operation, focusing primarily but not exclusively on the relational model,
and addressing both theoretical and practical concerns. We will, in the
process of doing this, cover a fair bit of SQL.
The Oracle
Database Administrator's Guide lists the following steps as central
for DB administration:
Task 1: Evaluate the Database Server
Hardware
Task 2: Install the Oracle Database Software
Task 3: Plan the Database
Task 4: Create and Open the Database
Task 5: Back Up the Database
Task 6: Enroll System Users
Task 7: Implement the Database Design
Task 8: Back Up the Fully Functional Database
Task 9: Tune Database Performance
Task 10: Download and Install Patches
Task 11: Roll Out to Additional Hosts
This is a somewhat more "mechanical" list of tasks than mine.
Here is a rather simple division of database applications into two
categories:
- Online Transaction Processing (OLTP): eg inventory
management, advertising, etc. Transactions are a set of record updates.
They are simple but continual. Record locking is needed. Often an entire
record is needed.
- Data Warehouse: eg customer history. The database is
often much larger, but most accesses are read-only, as part of
data-mining analysis. The DW might be updated only once a week. Queries
tend to be complex and infrequent. Queries often tend to involve just a
few columns; retrieving entire records may mean unnecessary I/O.
The traditional Relational Database Management System (RDBMS): like Oracle
and MySQL. Data is stored in rows; one disk fetch is
needed for one row. The Structured Query Language (SQL) is used to access
and update the data.
Problems with traditional RDBMS:
- 50%-90% overhead on record-level locks and the write-ahead recovery
log.
- Inefficient table-join operations
There are many alternatives to traditional RDBMSs; see below. These
typically are developed to address RDBMS difficulties with massive data
stores, although often people then try to apply the new models to
ordinary-sized data stores.
The RDBMS-still-central contrarian view: www.mongodb-is-web-scale.com
(warning: some (ok, a lot of) rude language, and some cluelessness regarding
agriculture (for example, most "bulls" are castrated as calves, not as
adults))
Installation
assignment:
Install
PostgreSQL on your own machine
Oracle v MySQL
Terms; demo; SQL
Consistency
Relational "lists"
Oracle v MySQL v PostgreSQL: three general-purpose databases.
Oracle has many proprietary extensions. Some are there more to lock you into
Oracle than because they are a good idea. That said, Oracle does
do well with big, transactional DBs. But Oracle is expensive to license.
Note that, with Oracle's purchase of Sun Microsystems, Oracle now owns
MySQL. Some people think this is why they bought Sun. (Others think it was
for Java.)
PostgreSQL has always tried to support (close to) the full SQL standard, and
to support it fairly strictly. In this it resembles Oracle, without the
latter's proprietary extensions.
Performance used to be an issue in Postgres. But in the last decade a great
deal of progress has been made; Postgres is now very fast. Postgres is also
much more similar to Oracle than MySQL (which is a little ironic because
Oracle now owns MySQL, though the usual assumption is that Oracle
doesn't want MySQL to compete with the family jewels).
Traditionally, MySQL often omitted support for some features. Transactions
are one, though this is now beginning to appear in MySQL. Another is key
constraints, and foreign key constraints
in particular. However, consider the following:
- Application programs generally have to explicitly check for foreign
key values, anyway; otherwise, it is difficult to respond naturally to a
user error.
- Now that Oracle has acquired the innodb
database engine for MySQL, foreign key constraints are now implemented
whenever the DB administrator chooses the innodb engine (which is now
the default).
MySQL is dual-licensed: as open-source (which means your additions must also
be open-source, if you release them) and as a proprietary product (meaning
you can develop and sell
proprietary extensions to MySQL, provided you pay the licensing fee).
History
Relational databases came out of theoretical work of Edgar Codd, in a 1970
paper titled A Relational Model of Data
for Large Shared Data Banks. At the time, the approach was seen as
too computationally expensive. But by 1980, both IBM DB2 and (early version
of) Oracle were out, and it was becoming generally recognized that
relational databases were the wave of the future.
Codd's relational approach solved a major data
consistency problem, by eliminating redundancy in data storage. We
looked at that above.
While there are now other types of databases, notably object-oriented
databases, it is fair to say that no later development has offered a
convincing solution to a general organizational problem that is intractable
with the relational model. That said, relational databases sometimes have
serious scaling problems, and
there is a significant No-SQL movement.
SEQUEL / SQL
In the early 1970's, IBM introduced SEQUEL: Structured English Query
Language. For trademark reasons, the name had to be changed; IBM chose SQL
(Structured Query Language). To this day, many people prefer to pronounce
SQL as "sequel" rather than as "ess queue ell"; note that in this case the
"sequel" pronunciation actually came first.
The "official" pronunciation for MySQL is "My ess queue ell", but they
tolerate "mysequel". The following is from the MySQL reference manual,
§1.3.2:
MySQL is named after co-founder Monty
Widenius's daughter, My.
Widenius's youngest daughter is Maria. Widenius is currently working on MariaDB, a drop-in replacement for MySQL.
SQL is fundamentally command-line. It is also a rather non-procedural
language!
Wrapping a good Human-Computer Interface (HCI) around SQL is good practice
(well-nigh essential); generally
this is done in some high-level language with a SQL interface (eg JDBC).
However, actual queries are still done in SQL.
Maps
Underlying all databases is the Map data structure. A Map object has a key
field, say of type K, and a data field, say of type D,
with the following operations:
insert(K key, D data); // inserts or updates a
⟨key,data⟩ record
D lookup(K key);
// retrieves the data portion given the key
There are some minor variants; for example, sometimes insert() requires that
the key not already be present, and an update(K key, D data) is provided
when the key is already present. Sometimes there are separate methods to
look up to see if a key is present.
From this perspective, a database is a collection of Map structures, with
the following additions:
- The data type D -- and, for that matter, the key type K -- can consist
of multiple subfields
- We can have multiple keys.
- The lookup() operation generally returns the entire record, key
included.
Terminology
University database, EN p 6
A record is a list of data fields.
The fields do not have to have the same type, and are usually referenced by
name rather than by position (or number). Example: ⟨'Peter', 'Dordal',
'123561234', 32020, 15.7, true, 37⟩. The fields may also be called columns
or attributes.
A table is a set of records. In
some object-oriented databases we might identify a record with a class type,
and a single table might consist of records from a parent class and from
multiple base classes. However, in relational databases all records of a
table have the same type: all records have the same fields, in the same
order.
If we imagine a table laid out graphically, with records as rows, the use of
the word column to describe a field
makes more sense.
A database is a set of tables. The
database may impose some constraints between tables.
A key field is a field we can use
for searching a table, that is guaranteed to return at most a single record.
That is, the SSN field of an Employee database is a key (we hope!), but the
FirstName field is probably not. Given a key field, a database might
maintain an index to speed up
lookups on that key; non-key fields can also have indexes however, and
without an index we can always use linear search.
In the University database of EN, here are the keys:
- STUDENT: Student_number
- COURSE: Course_number
- SECTION: Section_identifier (this is not completely obvious)
- GRADE_REPORT: ⟨Student_number, Section_identifier⟩
Tables with two columns as key often serve to represent relationships;
the GRADE_REPORT table identifies the "has-taken-the-class" relationship
between STUDENTs and SECTIONs: Peter has-taken-the-class COMP-305-001.
We can use the four tables here to provide a list of what students have
taken what courses in what semesters, and what grades they earned. This sort
of mixing and matching of data from multiple tables is the hallmark of the relational database approach.
Abstractly, a relation is a set of
lists/rows/tuples that are all of the same type: there are the same number
of fields in each row, and each field consists of elements of a single type.
This corresponds to a subset of the cartesian product A1×A2×...×An
of the underlying attribute sets. Relational
databases almost always enforce this restriction: rows of a given
table must be homogeneous: all the
rows must be of the same length and with the same respective field types.
Mathematically, however, one can have a
relation that is a subset of A×B, where B is a union
of a base class C and two derived classes C×D and C×E: C ∪ C×D ∪ C×E. This
would allow the relation to have heterogeneous
tuples like ⟨a,c⟩ ⟨a,c,d⟩ and ⟨a,c,e⟩, though these might be more
accurately written as ⟨a,c⟩ ⟨a,⟨c,d⟩⟩ and ⟨a,⟨c,e⟩⟩. Some object-oriented
databases do in fact provide support for constructions such as this,
although relational databases usually do not.
Another hallmark of the relational approach is that all fields are supposed
to be atomic types; that is, they
are not lists or sub-relations (strings, however, are allowed, even with
substring operations). For example, suppose we want to store the student
name, id, and list of courses:
⟨'peter', address, 23456, [343,
346, 353]⟩
⟨'paul', address2, 45678, [343,
372]⟩
Lists are not atomic; a strict relational approach would decompose this into
two tables, one of ⟨name, address, student_id⟩ data, indexed by student_id,
and a second table of ⟨student_id, coursenum⟩, indexed by both
fields together and containing
23456
|
343
|
23456
|
346
|
23453
|
353
|
45678
|
343
|
45678
|
372
|
Note that doing this allows us to recover the original lists of courses on a
per-student_id basis, by mixing and matching from the two new tables. Note
also that the atomic-types rule would disallow the C ∪ C×D ∪ C×E example
above, as union types are not atomic.
Sometimes the above decomposition is too inefficient. To get the list of
Peter's courses, we have to search a rather large secondary table. Postgres
does allow us to include lists as fields, violating the strict-relational
model.
Relational DBs, SQL and NoSQL
We will define the concept of a relational database more below, though the
rules above are a start.
The SQL language is designed as a way to ask questions about a relational
database without specifying exactly how the data is to be retrieved.
Consider the following from http://www.aosabook.org/en/nosql.html
by Adam Marcus:
SQL is a declarative language for querying
data. A declarative language is one in which a programmer specifies what
they want the system to do, rather than procedurally defining how
the system should do it. A few examples include: find the record for
employee 39, project out only the employee name and phone number from
their entire record, filter employee records to those that work in
accounting, count the employees in each department, or join the data from
the employees table with the managers table.
That sounds dandy: you just tell the computer what
to do, not how to do it. But there
is a downside:
To a first approximation, SQL allows you to
ask these questions without thinking about how the data is laid out on
disk, which indices to use to access the data, or what algorithms to use
to process the data. A significant architectural component of most
relational databases is a query optimizer, which decides which
of the many logically equivalent query plans to execute to most quickly
answer a query. These optimizers are often better than the average
database user, but sometimes they do not have enough information or have
too simple a model of the system in order to generate the most efficient
execution.
It turns out to be surprisingly easy to write inefficient queries,
particularly if complex "inner-query" searches are needed (though [even]
MySQL is making progress on these). SQL allows you to search on any field,
not just a key field for which there is an index; for very large datasets,
this is ill-advised. SQL allows, in effect, any query regardless of how
efficient it is. Some restrictions imposed by the NoSQL world (or, more
accurately, imposed by the real
world and implemented by the NoSQL
world) are:
- Allowing search only on key fields, for which an index is provided
- Limiting the use of table joins;
avoiding table decomposition as an alternative
- Restricted guarantees on transactions,
data consistency and durability
- Moving more-complex query parts from the query language to the
application logic
- Supporting "document" data components (eg XML documents with a
required structure)
- Supporting structured field data, as a way to keep complex data in a
single place instead of spread over multiple tables
These limitations are essentially always done for the sake of efficiency
and performance.
You can get something like the same effect by using a relational database,
and just promising yourself not to use it in inefficient ways.
NoSQL example
The native database used by the Asterisk phone switch consists of tables
(actually called "families") that are ⟨key,value⟩ pairs. Neither the key nor
the value need be atomic, or for that matter the same type as other entries.
(This is done for simplicity, not because the DB is huge.)
The MS Windows registry is a similar example.
Basic SQL operations: querying & updating
SQL:
select * from STUDENT
select * from STUDENT where Student_number = 17;
select * from SECTION where 101 <= Section_identifier and
Section_identifier <= 115;
The last query above is somewhat misleading, as Section_identifiers are
usually not intended for linear comparison.
Figure 2.1 on EN p 32: a schema for the students database
In the queries above, the * means "all columns of the selected rows";
sometimes we want fewer:
select name from student;
select name, major from student where student_number = 17;
select course_number, semester from section where 101<=
section_identifier and section_identifier <= 115;
Finally, when we are working with multiple tables, it is excellent practice
to name the tables, and then
qualify all column names with the table names, as follows. We will use this
style consistently throughout the semester; it is particularly useful when
the where clause involves multiple
tables (as in joins).
select s.name from student s;
select s.name, s.major from student s where s.student_number = 17;
Table Joins
Suppose in the university database we want to know the names of everyone in
section 112. (A peculiarity of the specific data given as example is that no
section has more than one student!) The GRADE_REPORT table has only student
numbers; we need to match these up with names from the STUDENT table. This
operation, of matching corresponding rows of different tables, is known as
the join. Here is the SQL for the
query, where the join condition is
in bold:
select s.name from student s, grade_report
gr
where s.student_number =
gr.student_number and gr.section_identifier = 112;
We are retrieving records from two
tables here, but restricting attention to pairs of records that "match up"
according to the join condition.
The join can also be done with the following alternative syntax:
select s.name from student s join
grade_report gr
on s.student_number = gr.student_number
where gr.section_identifier = 112;
Finally, maybe we want the name and the grade:
select s.name, gr.grade from student s,
grade_report gr
where s.student_number =
gr.student_number and gr.section_identifier = 112;
The join operation was once derided as introducing too much inefficiency.
Technical advances in the 1980's made this issue less important, but the
rise of huge datasets in this century has made this again relevant.
Here are two "classic" joins:
1. Names and grades for everyone in Section 112. The grade_report table has
two-column key ⟨student_number,section_identifier⟩. The join is on the
student_number attribute, which is the key to the student table. The join
can be viewed as taking each grade_report record with section_identifier=112
and using the student table to look up the student number and find the
corresponding name.
select s.name , gr.grade
from student s join grade_report gr on s.student_number
= gr.student_number
where gr.section_identifier = 112;
2. Names of courses for all courses taught in Fall 08. Both tables involved
have single-column keys; course_number for course and section_identifier for
section. The join involves the key field of the course
table, so the join can be viewed as taking each suitable section record and
looking up the course number in the course table to find
the name.
select c.course_name
from course c join section s on s.course_number = c.course_number
where s.year=2008 and s.semester='Fall';
Joins "look" symmetrical, but in most cases one table is used as a direct
map-type lookup for values appearing in the other table. Reversing the roles
here (eg taking each name, and searching the grade_report table for the
grades of all matching students in section 112), is a very different kind of
operation.
Most joins involve a primary-key attribute from one table joined to an
attribute on the other table that has a foreign-key constraint referencing
the primary key of the first table. The primary keys in the examples above
are student.student_number and course.course_number.
In both the examples above, we used the shorthand s and gr.
These are sometimes called table aliases or tuple
variables or (in the SQL standard) correlation names.
Their use greatly enhances readability, as it makes clear which table a
given column comes from. I recommend their use in all queries. See E&N6,
p 101.
Sometimes the keyword "as" is inserted: select s.name, gr.grade from student
as s, grade_report as gr ....
Sometimes it is helpful to view a table alias as a cursor variable:
a variable ranging over each record in the corresponding table. Thus:
for gr in grade_report:
for s in student:
if
s.student_number = gr.student_number and gr.section_identifier = 112
then:
print record
This is an oversimplification of how joins are actually
implemented, however; it is too inefficient for large tables.
Keys and Foreign Keys
Except in usual cases, every table will have a primary key.
For example, the primary key of table 'employee' is the ssn attribute, the
primary key of table 'department' is the dnumber attribute, and the primary
key of table 'works_on' is the pair of attributes (essn,pno).
The database enforces the rule that we cannot have two different records
with the same primary-key value. This is a constraint, not a description. It
happens to be the case that no two employees have the same lname
(or same fname), but lname is not a key field as we can easily add
Samuel Wong. But if we try to add Samuel Wong with ssn 333445555, the
insertion (or update) will fail.
As we will see later, generally an index is created on the
primary-key attribute. To enforce the primary-key constraint, every time a
primary-key value is inserted or updated the index is used to see if another
record has the same key value.
Several of the Company and University tables also have foreign-key
constraints. Such constraints are quite different, and don't involve a key
constraint at all. The classic example is employee.dno, which is intended as
a reference to a row in the department table that contains the full
information for that department. We don't want to allow an employee to have
dno=6, because over in the department table there is no
department with dnumber=6.
Another way to put this is to say that employee.dno has a foreign-key
constraint which references
department.dnumber. The constraint here is on employee.dno: any value for
employee.dno must appear somewhere in the dnumber column of department.
Usually the referenced attribute is a key for its table, as is
department.dnumber here. The attribute employee.dno is sometimes said to be
the foreign key, because it must match a value in a
"foreign" table that is, in that table, a key.
To enforce this foreign-key constraint, the database system must check
whenever a new employee is inserted, or an employee.dno value is updated. It
must also check whenever a department is deleted, or a
department.dnumber value is updated.
The Company database has several other foreign-key constraints:
- department.mgr_ssn has a foreign-key constraint referencing
employee.ssn
- employee.super_ssn has a foreign-key constraint referencing
employee.ssn
- works_on.essn has a foreign-key constraint referencing employee.ssn
- works_on.pno has a foreign-key constraint referencing project.pnumber
Sometimes foreign-key constraints lead to annoying circularities. When the
tables employee and department are created initially, both are empty. We
cannot add the first employee until the corresponding department already
exists. Similarly, due to the first foreign-key constraint in the bulleted
list above, we cannot add the first department until after the
department-manager employee already exists.
First Look at ER Diagrams
Programmers often use Booch diagrams or UML diagrams to display object
relationships visually. DBAs usually use Entity-Relationship, or ER,
diagrams. Entities are the "physical objects" represented by the database,
drawn with rectangles. Relationships are between entities; the easiest
relationships are the binary ones. In the Company database the entities are
- employee
- project
- department
- dependent
The diagram can be seen in ER.html.
The most interesting relationship is works_on, between
employee and project, because it is
many-to-many (one employee can work on many projects, and one project can
have many employees). There are also several one-to-many relationships:
- supervises (one supervisor to many supervisees)
- dept_member (one department to many employees)
- controls (one dept to many projects)
All these ended up without their own table. Where do we encode each of
these three?
There's also a one-to-one manages relationship between
departments and employees. Every department has exactly one manager, and
every employee manages at most one department. Most
employees don't manage any department, but that doesn't change the
one-to-one rule. We'll come back to this one later.