A Relational Database Introduction
    Tables
      DBA
      History/Maps/Terminology
      NoSQL
      Simple queries
      Joins
      Keys and Foreign Keys
      ER diagram intro
      Company DB
      Relational math
      Joins Again
      Keys Again
      Transactions
    
     What is a database?
    
    A database is, primarily, a set of related tables. Each
    table is a set of records (rows, items) of similar type.
    Example: the company
      database. Let's start with company.pdf.
    The company database has these tables:
    
      - employee           
        name, ssn, bdate, address, sex, salary,
        super_ssn, dno
 
      - department         dname, dnumber,
        mgr_ssn, mgr_start
 
      - dept_locations    dnumber, dlocation
 
      - project               
        pname, pnumber, plocation, dnum
 
      - works_on           
        essn, pno, hours
 
      - dependent          
        essn, dependent_name, sex, bdate,
        relationship
 
    
    Records are lists (or "tuples") of attributes; each attribute may have its
    own type though string types are by far the most common. The attributes
    above in bold are the "key" attributes. We will use the
    company database frequently.
    
    Mathematically, a table is a relation (or relationship),
    hence the name "relational database".
    
    Database systems such as Postgres, MySQL and Oracle are sometimes,
    colloquially, described as "databases", though the terms "database system"
    or "relational database management system" can be used when there might be
    ambiguity.
    
    Most modern database systems support multiple collections of related tables,
    any one of which may be referred to as a "database". Thus a Postgres
    installation might include multiple databases, each with different names.
    The company database might be one; the university
    database might be another.
    
    We interact with databases through queries, using SQL
    (Structured Query Language) . Here are a few very simple read-only (data
    access) queries, involving only the employee table from the company
    database.
    
    select * from employee e where e.ssn =
      '333445555';    -- one unique result
      select e.fname, e.lname, e.salary from employee e where e.super_ssn =
      '333445555';    -- multiple results
      select e.lname, e.ssn, e.salary from employee e where e.salary >=
      30000;    
    
    We will consistently use table-alias variables, like e
    above, although SQL does not absolutely require them. We'll return to SQL
    below, but for now note that each query above has three parts:
    
      - a select clause specifying the columns
        desired
 
      - a from clause specifying the tables
        involved
 
      - a where clause specifying the rows
        desired
 
    
    The queries above involved just one table. The most interesting queries
    involve joins of two or more tables. The join
      condition defines a relationship between the tables.
    
    list each employee with their department
      name:
      select e.fname, e.lname, d.dname from employee e join department
        d on e.dno = d.dnumber;  
    
    We used the explicit join notation above, in which the from
    clause lists a single, joined table. The implicit join notation would be as
    follows; the join condition is now moved to the where
    clause. It is deprecated and should not be used in this class.
    
    select e.fname, e.lname, d.dname from
        employee e, department d where e.dno = d.dnumber; 
    
    Why didn't we just put the department name in the employee table? 
    
    Now consider the following query involving a join of the employee and
    project tables, in which we list all employees working on project 2, with
    hours:
    
    select e.fname, e.lname, w.hours, w.pno from
      employee e join works_on w on e.ssn = w.essn where w.pno = 2;
    
    Why didn't we just put the employee's projects in the employee table? This
    has a very different answer! 
    
    Here's a join involving three tables. How would you describe it in English?
    
    select e.fname, e.lname, p.pname 
      from employee e join works_on w on e.ssn = w.essn join project p on w.pno
      = p.pnumber
      where w.hours >= 10;
    
    Why don't we put everything into one single table? Why do
    we have multiple tables at all? Here's a query that would give us the
    employee, works_on and project tables in one large table (with some columns
    edited out)
    
    select e.fname, e.lname, e.salary, w.hours,
      p.pname, p.plocation
      from employee e join works_on w on e.ssn = w.essn join project p on w.pno
      = p.pnumber;
    
    Suppose somebody tells you "we can just keep one spreadsheet with the above
    information", and that therefore you don't need a database with multiple
    tables! What could possibly go wrong?
    
    What if we want to change the name of ProductX to FooMatic? What if we want
    to give Franklin Wong a raise? What if Reorganization moves from Houston to
    Stafford? In each case, we need to update multiple records
    to make one change. What if we miss a record? 
    
    Division of a database into multiple "independent" tables is a very
    important practice. The goal of the decomposition is to retain all the
    original information while at the same time avoiding the data-consistency
      problem of having multiple locations that should have
    the same value, but might not. We "reconnect" separate tables using the join
    operation. Unfortunately, if we have a lot of data, joins aren't
    always as efficient as we'd like.
    
    
     What does a DBA (Database Administrator) do? 
    Here's a traditional, somewhat Oracle-centric list:
    
      - Manage installation, configuration and upgrade
 
      - Manage resources (cpu, RAM, disk, network), and
        makes sure these are sufficient
 
      
        - Decide on file formats and file locations
 
        - Disk assignment; eg putting two files joined frequently on different
          physical disks
 
        - Deciding which files should go on the fast SSD versus the slower
          magnetic disk
 
      
      - Monitor performance, and adjust configuration to improve performance
 
      
        - Monitor logs
 
        - Monitor query performance; optimize frequently used queries
 
      
      - Participate in database design: what goes into each table, and how do
        different tables relate?
 
      - Decide on database indexes
 
      - Make occasional table changes, or other configuration changes
 
      - Monitor security
 
      - Manage access: who gets to see what data, maintains
        access policies and documentation
 
      - Implement data backups
 
    
    We might add a few broader missions to this list:
    
      - Identify information from the database that will have a significant
        impact on the organization's mission
 
      - Identify the database and tools that will allow the extraction of this
        information quickly and inexpensively
 
      - Manage queries so as to maintain suitable overall performance
 
      - Figure out how to migrate away from Oracle
 
    
    Our goal in this course is to cover the fundamentals of database
      operation, focusing primarily but not exclusively on the relational model,
      and addressing both theoretical and practical concerns. We will, in the
      process of doing this, cover a fair bit of SQL. The last item above,
      "manage queries so as to maintain suitable overall performance", includes
      a lot of DB tuning and optimization. 
    And then there is one more item to add, that for sure is not on
      the Oracle list: 
    
      - Choose the most appropriate database for each new project
 
    
    Or, to put it another way, should this project use Postgres? MySQL?
      Oracle? Or should it abandon traditional RDBMSs altogether and use
      MongoDB? BigTable? Hadoop? Cassandra? This course will focus on
      traditional RDBMSs, but the so-called no-SQL databases have definitely
      made an impact.
    
    The Oracle
        Database Administrator's Guide lists the following steps as central
      for DB administration:
    
    Task 1: Evaluate the Database Server
      Hardware
      Task 2: Install the Oracle Database Software
      Task 3: Plan the Database
      Task 4: Create and Open the Database
      Task 5: Back Up the Database
      Task 6: Enroll System Users
      Task 7: Implement the Database Design
      Task 8: Back Up the Fully Functional Database
      Task 9: Tune Database Performance
      Task 10: Download and Install Patches
      Task 11: Roll Out to Additional Hosts
    
    This is a somewhat more "mechanical" list of tasks than mine.
    
    Here is a rather simple division of traditional database applications into
    two categories:
    
      - Online Transaction Processing (OLTP): eg inventory
        management, advertising, etc. Transactions are a set of record updates.
        They are simple but continual. Record locking is needed. Often an entire
        record is needed.
 
      - Data Warehouse: eg customer history. The database is
        often much larger, but most accesses are read-only, as part of
        data-mining analysis. The DW might be updated only once a week. Queries
        tend to be complex and infrequent. Queries often tend to involve just a
        few columns; retrieving entire records may mean unnecessary I/O.
 
    
    Another category, often taken to lie in between OLTP and Warehousing, is
      CRUD: Create, Read, Update, Delete. A CRUD application does all these
      things. OLTP is also sometimes viewed as a "simplified" form of CRUD, or
      else as CRUD plus large-scale concurrency.
    
    The traditional Relational Database Management System (RDBMS): like
      Oracle and MySQL. Data is stored in rows; one disk fetch
      is needed for one row. The Structured Query Language (SQL) is used to
      access and update the data.
    
    Problems with traditional RDBMSs:
    
      - Lots of overhead on record-level locks and the write-ahead recovery
        log (it's hard to quantify this, but overhead can easily exceed 50% of
        the total. On the other hand, you cannot manage without locking)
 
      - Inefficient table-join operations
 
    
    There now are many alternatives to traditional RDBMSs; see below. These
    typically are developed to address RDBMS difficulties with massive data
    stores, although often people then try to apply the new models to
    ordinary-sized data stores. Collectively, they are known as "no-SQL"
    databases, but recently the "no" has been taken to stand for "not
    only". 
    
    The RDBMS-still-central contrarian view: www.mongodb-is-web-scale.com
    (warning: some (ok, a lot of) rude language, and some cluelessness regarding
    animal husbandry (for example, most "bulls" are castrated as calves, not as
    adults))
    
    
    
    Some Databases
    
    Oracle v MySQL v PostgreSQL: three general-purpose databases.
    
    Oracle has many proprietary extensions. Some are there more to lock you into
    Oracle than because they are a good idea. That said, Oracle does
    do well with big, transactional DBs. But Oracle is expensive to license.
    
    Note that, with Oracle's purchase of Sun Microsystems, Oracle now owns
    MySQL. Some people think this is why they bought Sun. (Others think it was
    for Java.)
    
    PostgreSQL has always tried to support (close to) the full SQL standard, and
    to support it fairly strictly. In this it resembles Oracle, without the
    latter's proprietary extensions.
    
    Performance used to be an issue in Postgres. But in the last decade a great
    deal of progress has been made; Postgres is now very fast. Postgres is also
    much more similar to Oracle than MySQL (which is a little ironic because
    Oracle now owns MySQL, though the usual assumption is that Oracle
    doesn't want MySQL to compete with the family jewels).
    
    Traditionally, MySQL often omitted support for some features. Transactions
    are one, though this is now beginning to appear in MySQL. Another is key
    constraints, and foreign key constraints
    in particular. However, consider the following:
    
      - Application programs generally have to explicitly check for foreign
        key values, anyway; otherwise, it is difficult to respond naturally to a
        user error.
 
      - Now that Oracle has acquired the innodb
        database engine for MySQL, foreign key constraints are now implemented
        whenever the DB administrator chooses the innodb engine (which is now
        the default).
 
    
    MySQL is dual-licensed: as open-source (which means your additions must also
    be open-source, if you release them) and as a proprietary product (meaning
    you can develop and sell
    proprietary extensions to MySQL, provided you pay the licensing fee).
    
    
    
    
    History
    Relational databases came out of theoretical work of Edgar Codd, in a 1970
    paper titled A Relational Model of Data
      for Large Shared Data Banks. At the time, the approach was seen as
    too computationally expensive. But by 1980, both IBM DB2 and (early version
    of) Oracle were out, and it was becoming generally recognized that
    relational databases were the wave of the future.
    
    Codd's relational approach solved a major data
      consistency problem, by eliminating redundancy in data storage. We
    looked at that above.
    
    While there are now other types of databases, notably object-oriented
    databases, it is fair to say that no later development has offered a
    convincing solution to a general organizational problem that is intractable
    with the relational model. That said, relational databases sometimes have
    serious scaling problems, and
    there is a significant No-SQL movement.
    
     SEQUEL / SQL
    In the early 1970's, IBM introduced SEQUEL: Structured English Query
    Language. For trademark reasons, the name had to be changed; IBM chose SQL
    (Structured Query Language). To this day, many people prefer to pronounce
    SQL as "sequel" rather than as "ess queue ell"; note that in this case the
    "sequel" pronunciation actually came first.
    
    The "official" pronunciation for MySQL is "My ess queue ell", but they
    tolerate "mysequel". The following is from the MySQL reference manual,
    §1.3.2:
    
    MySQL is named after co-founder Monty
      Widenius's daughter, My. 
    
    
    Widenius's youngest daughter is Maria. Widenius is currently working on MariaDB, a drop-in replacement for MySQL.
    
    SQL is fundamentally command-line. It is also a rather non-procedural
    language!
    
    Wrapping a good Human-Computer Interface (HCI) around SQL is good practice
    (well-nigh essential); generally
    this is done in some high-level language with a SQL interface (eg JDBC).
    However, actual queries are still done in SQL.
    
    
    Mappings
    Underlying all databases is the Map data structure. A Map object has a key
    field, say of type K, and a data field, say of type D,
    with the following operations:
    insert(K key, D data);    // inserts or updates a
      ⟨key,data⟩ record
      D lookup(K key);           
      // retrieves the data portion given the key
    
    There are some minor variants; for example, sometimes insert() requires that
    the key not already be present, and an update(K key, D data) is provided
    when the key is already present. Sometimes there are separate methods to
    look up to see if a key is present. 
    
    From this perspective, a database is a collection of Map structures, with
    the following additions:
    
      - The data type D -- and, for that matter, the key type K -- can consist
        of multiple subfields
 
      - We can have multiple keys.
 
      - The lookup() operation generally returns the entire record, key
        included.
       
    
    Each database table has a primary key, that can be used to
    search for records. Sometimes the primary key is a single column, sometimes
    it's multiple columns. A table may also have one or more secondary
      keys, which are also keys, but not designated "primary".
    
    
     Terminology
    A record is a list of data fields.
    The fields do not have to have the same type, and are usually referenced by
    name rather than by position (or number). Example: ⟨'Peter', 'Dordal',
    '123561234', 32020, 15.7, true, 37⟩. The fields may also be called columns
    or attributes.
    
    A table is a set of records. In
    some object-oriented databases we might identify a record with a class type,
    and a single table might consist of records from a parent class and from
    multiple base classes. However, in relational databases all records of a
    table have the same type: all records have the same fields, in the same
    order.
    
    If we imagine a table laid out graphically, with records as rows, the use of
    the word column to describe a field
    makes more sense.
    
    A database is a set of tables. The
    database may impose some constraints between tables.
    
    A key field is a field we can use
    for searching a table, that is guaranteed to return at most a single record.
    That is, the SSN field of an Employee database is a key (we hope!), but the
    FirstName field is probably not. Given a key field, a database might
    maintain an index to speed up
    lookups on that key; non-key fields can also have indexes however, and
    without an index we can always use linear search. In the University database
    of EN (below), here are the keys:
    
      - STUDENT: Student_number
 
      - COURSE: Course_number
 
      - SECTION: Section_identifier (this is not completely obvious)
 
      - GRADE_REPORT: ⟨Student_number, Section_identifier⟩
 
    
    Tables with two columns as key often serve to represent relationships;
    the GRADE_REPORT table identifies the "has-taken-the-class" relationship
    between STUDENTs and SECTIONs: Peter has-taken-the-class COMP-305-001.
    
    We can use the four tables here to provide a list of what students have
    taken what courses in what semesters, and what grades they earned. This sort
    of mixing and matching of data from multiple tables is the hallmark of the relational database approach.
    
    Abstractly, a relation is a set of
    lists/rows/tuples that are all of the same type: there are the same number
    of fields in each row, and each field consists of elements of a single type.
    This corresponds to a subset of the cartesian product A1×A2×...×An
    of the underlying attribute sets. Relational
      databases almost always enforce this restriction: rows of a given
    table must be homogeneous: all the
    rows must be of the same length and with the same respective field types. 
    
    Mathematically, however, one can have a
      relation that is a subset of A×B, where B is a union
      of a base class C and two derived classes C×D and C×E: C ∪ C×D ∪ C×E. This
      would allow the relation to have heterogeneous
      tuples like ⟨a,c⟩ ⟨a,c,d⟩ and ⟨a,c,e⟩, though these might be more
      accurately written as ⟨a,c⟩ ⟨a,⟨c,d⟩⟩ and ⟨a,⟨c,e⟩⟩. Some object-oriented
      databases do in fact provide support for constructions such as this,
      although relational databases usually do not.
    
    
    Another hallmark of the relational approach is that all fields are supposed
    to be atomic types; that is, they
    are not lists or sub-relations (strings, however, are allowed, even with
    substring operations). For example, suppose we want to store the student
    name, id, and list of courses:
    
        ⟨'peter', address, 23456, [343,
      346, 353]⟩
        ⟨'paul', address2, 45678, [343,
      372]⟩
    
    Lists are not atomic; a strict relational approach would decompose this into
    two tables, one of ⟨name, address, student_id⟩ data, indexed by student_id,
    and a second table of ⟨student_id, coursenum⟩, indexed by both
    fields together and containing
    
    
      
        
          
            23456 
             | 
            343 
             | 
          
          
            23456 
             | 
            346 
             | 
          
          
            23453 
             | 
            353 
             | 
          
          
            45678 
             | 
            343 
             | 
          
          
            45678 
             | 
            372 
             | 
          
        
      
     
    
    Note that doing this allows us to recover the original lists of courses on a
    per-student_id basis, by mixing and matching from the two new tables. Note
    also that the atomic-types rule would disallow the C ∪ C×D ∪ C×E example
    above, as union types are not atomic.
    
    Sometimes the above decomposition is too inefficient. To get the list of
    Peter's courses, we have to search a rather large secondary table. Postgres
    does allow us to include lists as fields, violating the strict-relational
    model.
    
    
    Core relational-database concept:
    
     Divide data into multiple tables
      (mathematically, RELATIONS) in such a way that there is enough
      division to enforce consistency and not too much
      division to cause problems with reassembly
      
      See the sql1 notes on implicit constraints
    
    
    
    Relational DBs, SQL and NoSQL
    We will define the concept of a relational database more below, though the
    rules above are a start.
    
    The SQL language is designed as a way to ask questions about a relational
    database without specifying exactly how the data is to be retrieved.
    Consider the following from http://www.aosabook.org/en/nosql.html
    by Adam Marcus:
    
    SQL is a declarative language for querying
      data. A declarative language is one in which a programmer specifies what
      they want the system to do, rather than procedurally defining how
      the system should do it. A few examples include: find the record for
      employee 39, project out only the employee name and phone number from
      their entire record, filter employee records to those that work in
      accounting, count the employees in each department, or join the data from
      the employees table with the managers table.
    
    
    That sounds dandy: you just tell the computer what
    to do, not how to do it. But there
    is a downside:
    
    To a first approximation, SQL allows you to
      ask these questions without thinking about how the data is laid out on
      disk, which indices to use to access the data, or what algorithms to use
      to process the data. A significant architectural component of most
      relational databases is a query optimizer, which decides which
      of the many logically equivalent query plans to execute to most quickly
      answer a query. These optimizers are often better than the average
      database user, but sometimes they do not have enough information or have
      too simple a model of the system in order to generate the most efficient
      execution.
    
    
    It turns out to be surprisingly easy to write inefficient queries,
    particularly if complex "inner-query" searches are needed (though [even]
    MySQL is making progress on these). SQL allows you to search on any field,
    not just a key field for which there is an index; for very large datasets,
    this is ill-advised. SQL allows, in effect, any query regardless of how
    efficient it is. Some restrictions imposed by the NoSQL world (or, more
    accurately, imposed by the real
    world and implemented by the NoSQL
    world) are:
    
      - Allowing search only on key fields, for which an index is provided
 
      - Limiting the use of table joins;
        avoiding table decomposition as an alternative
       
      - Restricted guarantees on transactions,
        data consistency and durability
 
      - Moving more-complex query parts from the query language to the
        application logic
 
      - Supporting "document" data components (eg XML documents with a
        required structure)
 
      - Supporting structured field data, as a way to keep complex data in a
        single place instead of spread over multiple tables
       
    
    These limitations are essentially always done for the sake of efficiency
    and performance.
     You can get something like the same effect by using a relational
      database, and just promising yourself not to use it in inefficient ways.
    Alice: Why do NoSQL developers eat lunch
      alone?
      Bob: I don't know. Why?
      Alice: They don't know how to join tables.
    
    
    NoSQL example
    The native database used by the Asterisk phone switch consists of tables
    (actually called "families") that are ⟨key,value⟩ pairs. Neither the key nor
    the value need be atomic, or for that matter the same type as other entries.
    (This is done for simplicity, not because the DB is huge.)
    
    The MS Windows registry is a similar example.
    
    
     Basic SQL select
    SQL: 
    select * from STUDENT
      select * from STUDENT where Student_number = 17;
      select * from SECTION where 101 <= Section_identifier and
      Section_identifier <= 115;   
    
    
    The last query above is somewhat misleading, as Section_identifiers are
    usually not intended for linear comparison.
    
    Figure 2.1 on EN p 32: a schema for the students database
    
    In the queries above, the * means "all columns of the selected rows";
    sometimes we want fewer:
    
    select name from student;
      select name, major from student where student_number = 17;
      select course_number, semester from section where 101<=
      section_identifier and section_identifier <= 115;
    
    
    Finally, when we are working with multiple tables, it is excellent practice
    to name the tables, and then
    qualify all column names with the table names, as follows. We will use this
    style consistently throughout the semester; it is particularly useful when
    the where clause involves multiple
    tables (as in joins).
    
    select s.name from student s;
      select s.name, s.major from student s where s.student_number = 17;
    
    
    Table Joins
    Suppose in the university database we want to know the names of everyone in
    section 112. (A peculiarity of the specific data given as example is that no
    section has more than one student!) The GRADE_REPORT table has only student
    numbers; we need to match these up with names from the STUDENT table. This
    operation, of matching corresponding rows of different tables, is known as
    the join. Here is the SQL for the
    query, where the join condition is
    in bold:
    
    select s.name from student s join
        grade_report gr
      on s.student_number = gr.student_number
      where gr.section_identifier = 112;
    
    We are retrieving records from two
    tables here, but restricting attention to pairs of records that "match up"
    according to the join condition.
    
    The join can also be done with the following alternative syntax, but it's
    clearer to use the explicit-join syntax above:
    
    select s.name from student s, grade_report
      gr
      where s.student_number =
        gr.student_number and gr.section_identifier = 112;
    
    
    The join operation was once derided as introducing too much inefficiency.
    Technical advances in the 1980's made this issue less important, but the
    rise of huge datasets in this century has made this again relevant.
    
    
    Here are two "classic" joins:
    
    1. Names and grades for everyone in Section 112. The grade_report table has
    two-column key ⟨student_number,section_identifier⟩. The join is on the
    student_number attribute, which is the key to the student table. The join
    can be viewed as taking each grade_report record with section_identifier=112
    and using the student table to look up the student number and find the
    corresponding name.
    
    select s.name , gr.grade 
      from student s join grade_report gr on s.student_number
        = gr.student_number
      where gr.section_identifier = 112;
    
    
    2. Names of courses for all courses taught in Fall 08. Both tables involved
    have single-column keys; course_number for course and section_identifier for
    section. The join involves the key field of the course
    table, so the join can be viewed as taking each suitable section record and
    looking up the course number in the course table to find
    the name.
    
    select c.course_name 
      from course c join section s on s.course_number = c.course_number
      where s.year=2008 and s.semester='Fall';
    
    Joins "look" symmetrical, but in most cases one table is used as a direct
    map-type lookup for values appearing in the other table. Reversing the roles
    here (eg taking each name, and searching the grade_report table for the
    grades of all matching students in section 112), is a very different kind of
    operation.
    
    Most joins involve a primary-key attribute from one table joined to an
    attribute on the other table that has a foreign-key constraint referencing
    the primary key of the first table. The primary keys in the examples above
    are student.student_number and course.course_number.
    
    In both the examples above, we used the shorthand s and gr.
    These are sometimes called table aliases or tuple
      variables or (in the SQL standard) correlation names.
    Their use greatly enhances readability, as it makes clear which table a
    given column comes from. I recommend their use in all queries. See E&N6,
    p 101.
    
    Sometimes the keyword "as" is inserted: select s.name, gr.grade from student
    as s, grade_report as gr ....
    
    Sometimes it is helpful to view a table alias as a cursor variable:
    a variable ranging over each record in the corresponding table. Thus:
    
        for gr in grade_report:
            for s in student:
                   if
    s.student_number = gr.student_number and gr.section_identifier = 112
                   then:
    print record
    
    This is an oversimplification of how joins are actually
    implemented, however; it is too inefficient for large tables.
    
    
    
    Keys and Foreign Keys
    Except in usual cases, every table will have a primary key.
    For example, the primary key of table 'employee' is the ssn attribute, the
    primary key of table 'department' is the dnumber attribute, and the primary
    key of table 'works_on' is the pair of attributes (essn,pno). 
    
    The database enforces the rule that we cannot have two different records
    with the same primary-key value. This is a constraint, not
    a description. It happens to be the case that no two employees
    have the same lname (or same fname), but lname is not a key field  as
    we can easily add Samuel Wong. But if we try to add Samuel Wong with ssn
    333445555, the insertion (or update) will fail.
    
    As we will see later, generally an index is created on the
    primary-key attribute. To enforce the primary-key constraint, every time a
    primary-key value is inserted or updated the index is used to see if another
    record has the same key value.
    
    Several of the Company and University tables also have foreign-key
    constraints. Such constraints are quite different, and don't involve a key
    constraint at all. The classic example is employee.dno, which is intended as
    a reference to a row in the department table that contains the full
    information for that department. We don't want to allow an employee to have
    dno=6, because over in the department table there is no
    department with dnumber=6.
    
    Another way to put this is to say that employee.dno has a foreign-key
    constraint which references
    department.dnumber. The constraint here is on employee.dno: any value for
    employee.dno must appear somewhere in the dnumber column of department.
    Usually the referenced attribute is a key for its table, as is
    department.dnumber here. The attribute employee.dno is sometimes said to be
    the foreign key, because it must match a value in a
    "foreign" table that is, in that table, a key.
    
    To enforce this foreign-key constraint, the database system must check
    whenever a new employee is inserted, or an employee.dno value is updated. It
    must also check whenever a department is deleted, or a
    department.dnumber value is updated.
    
    The Company database has several other foreign-key constraints:
    
      - department.mgr_ssn has a foreign-key constraint referencing
        employee.ssn
 
      - employee.super_ssn has a foreign-key constraint referencing
        employee.ssn
 
      - works_on.essn has a foreign-key constraint referencing employee.ssn
 
      - works_on.pno has a foreign-key constraint referencing project.pnumber
 
    
    
    Sometimes foreign-key constraints lead to annoying circularities. When the
    tables employee and department are created initially, both are empty. We
    cannot add the first employee until the corresponding department already
    exists. Similarly, due to the first foreign-key constraint in the bulleted
    list above, we cannot add the first department until after the
    department-manager employee already exists.
    
    
    First Look at ER Diagrams
    Programmers often use Booch diagrams or UML diagrams to display object
    relationships visually. DBAs usually use Entity-Relationship, or ER,
    diagrams. Entities are the "physical objects" represented by the database,
    drawn with rectangles. Relationships are between entities; the easiest
    relationships are the binary ones. In the Company database the entities are
    
      - employee
 
      - project
 
      - department
 
      - dependent
 
    
    The diagram can be seen in ER.html.
    
    The most interesting relationship is works_on, between
      employee and project, because it is
      many-to-many (one employee can work on many projects, and one project can
      have many employees). There are also several one-to-many relationships:
    
      - supervises (one supervisor to many supervisees)
 
      - dept_member (one department to many employees)
 
      - controls (one dept to many projects)
 
    
    All these ended up without their own table. Where do we encode each of
      these three?
    There's also a one-to-one manages relationship between
    departments and employees. Every department has exactly one manager, and
    every employee manages at most one department. Most
    employees don't manage any department, but that doesn't change the
    one-to-one rule. We'll come back to this one later.
    
    
    
    The COMPANY database
    Schema: EN7 p 161 / EN6 p 71
    Data: EN7 p 162 / EN6 p 72
    basic
      table definitions, with all foreign-key constraints
    table
      definitions plus data, with ALTER (some FK constraints are added
    later, to allow initial data loading)
    Spreadsheet
    zip file
     pdf view of
      tables
    
    The tables are (with primary key in bold):
        employee:         name,
    ssn, bdate, address, sex, salary, super_ssn, dno
        department:     dname, dnumber,
    mgr_ssn, mgr_start
        dept_locations: dnumber, dlocation
       
    project:            
    pname, pnumber, plocation, dnum
        works_on:        essn,
    pno, hours
        dependent:       essn,
    dependent_name, sex, bdate, relationship
    
    The university
      database is here (EN7 p 8 / EN6 p 8)
    
    The university database here has all its foreign-key constraints. There is
    no constraint "circularity", so this should not be a problem. I did give
    names to the university FK constraints.
    
    (brief review of create table)
    Loading the databases
    If you have a command-line window, and want to load up a file of SQL
    statements (say company.alter.text),
    
      - use "cd" in the shell window to move to the directory where your files
        are located
 
      - Start Postgres (eg with psql
          -U myname) or MySQL (eg with \mysql\bin\mysql
          -u myname -p) 
 
      - After you've connected to the right database, type the following:
 
      
        - Postgres: \i company.alter.text
 
        - mysql: source company.alter.text;
 
      
    
     Alternatively, you can paste the entire file into a
      command window (you will probably need the menu paste command, as CNTL-V
      is likely to mean something else). It helps if there are no tab characters
      in the file.
    To load the University database in Postgres:
        \i university.text
     
    To drop the Company tables:
    drop table works_on cascade;
      drop table project cascade;
      drop table employee cascade;
      drop table department cascade;
      drop table dependent;
      drop table dept_locations;
    To drop the University tables, use these. The "on cascade" option is not
      necessary, as the foreign-key constraints are not circular.
    drop table grade_report;
      drop table section;
      drop table prerequisite;
      drop table course;
      drop table student;
    
    
     Relational Math
    (EN7 chapter 5 / EN6 chapter 3)
    A relation is any set of tuples
    The set of all possible tuples is the cross product of
    some domains
            col1 × col2 × col3 × ... × colN
    Example: A = {1,2,3}, B = {x,y}  C = {1,2}
            A × B
            A ×
    C
            < relation in A × C
            <= relation in A × C
            
    DB relations are not defined by rule, but by tabulation!
    
    Given attribute sets A1,
    A2, ..., An, a relation
    is a subset of the cartesian product A1×A2×...×An;
    that is, a set of tuples ⟨a1,a2,...,an⟩
    where each ai∈Ai. These tuples may also be called records.
    
    Relations in a DB are represented as tables.
    EN also uses the term relation state
    to refer to a specific set of records in a table.
    
    STUDENT table, EN p 63
    
    
      
        
          Name 
           | 
          SSn 
           | 
          Home_phone 
           | 
          Address 
           | 
          Office_phone 
           | 
          Age 
           | 
          GPA 
           | 
        
        
          Benjamine Bayer 
           | 
          305-61-2435 
           | 
          817-373-1616 
           | 
          2918 bluebonnet Lane 
           | 
          NULL 
           | 
          19 
           | 
          3.21 
           | 
        
        
          Chung-cha Kim 
           | 
          381-62-1245 
           | 
          817-375-4409 
           | 
          125 Kirby Road 
           | 
          NULL 
           | 
          18 
           | 
          2.89 
           | 
        
        
          Dick Davidson 
           | 
          422-11-2320 
           | 
          NULL 
           | 
          3452 Elgin Road 
           | 
          817-749-1253 
           | 
          25 
           | 
          3.53 
           | 
        
        
          Rohan Panchal 
           | 
          489-22-1100 
           | 
          817-376-9821 
           | 
          265 Lark Lane 
           | 
          817-749-6492 
           | 
          28 
           | 
          3.93 
           | 
        
        
          Barbara Benson 
           | 
          533-69-1238 
           | 
          817-839-8461 
           | 
          7384 Fontana Lane 
           | 
          NULL 
           | 
          19 
           | 
          3.25 
           | 
        
      
    
    
    Note the 
    
    Also note that some entries are NULL. This means undefined
    or not available or not
      known; unfortunately, these three options are not
    synonymous or interchangeable. NULL values are essential, but they do
    introduce some complications. The first is that records with NULL entries
    are not in fact elements of A1×A2×...×An;
    they are elements of
        (A1 ∪ {NULL}) × (A2 ∪ {NULL}) × ...
    × (An ∪ {NULL})
    
    EN also gives an alternative definition of a relation, as a set of maps
    from the attribute set to the set of attribute values, where the attribute
    set is essentially the set of names
    of columns. With this approach, a null entry is represented by a partial
    map, undefined for some attributes.
    
    Note that we must be careful when comparing null values: if two people have
    NULL as their Office_phone, it does not
    mean they have the same phone! Worse, we simply do not know if the NULL
    means we don't know their phone, or if they simply do not have one, or if
    they have no office at all and so the "office_phone" is irrelevant.
    
    
    Joins Again
    As we saw above, the join is the
    operation of creating all records merged from two (or more) tables, where
    one attribute of one table is required to match a corresponding attribute of
    another. Usually, but not always, the column-matching is based on equality
    of corresponding attributes.
    
    Examples:
    University: 
    
      - Listing all students in Section 112 (my data), example above
 
      - Printing all of each student's grades, by joining the Student_number
        fields of STUDENT and GRADE_REPORT
 
      - Printing all sections including Course_name, joining COURSE and
        SECTION on the Course_number field
 
    
    The first example we did above as follows:
    
    select s.name from student s join
      grade_report gr on s.student_number =
        gr.student_number 
      where gr.section_identifier = 112;
    
    Here's the second. We also want sections!
    
    select s.name, gr.section_identifier,
      gr.grade 
      from student s join grade_report gr on s.student_number =
      gr.student_number;
    
    But this isn't really complete. How do we translate section_identifier to
    meaningful course names?
    
    select s.name, sec.course_number, gr.grade 
      from student s join grade_report gr on s.student_number =
      gr.student_number
      join section sec on gr.section_identifier = sec.section_identifier;
    
    Better? Those are still actually "course_number" entries.
    
    Here's the third:
    
    select c.course_name, c.course_number from
      course c join section sec on c.course_number = sec.course_number;
    
    Are we using sec at all? What is the output if we remove the join, and just
    take data from table course? (select * from course;)  What additional
    fields can we add to the output to make the query clearer?
    
    For the examples above, which join fields are part of the keys?
    
    
    Now some joins involving the Company DB:
    
      - Printing the name and address of all employees who work in the
        'Research' dept (Query1 from EN ).Use employee and department tables
 
      - Printing the project number, dept number, and the dept manager's name,
        for all projects located in 'Stafford' (Query 2 from EN) Use project,
        department and employee tables
 
      - Printing each employee's name and his or her supervisor's name (Query
        8, EN)  Use employee table joined to itself
       
    
     Demos of these
    
    A full Cartesian product would be denoted in SQL by, eg,
    
        select * from employee, department;
    
    where there is no join condition establishing a relation between the two
    tables.
    
    The join is conceptually somewhat inefficient. Lots of
    behind-the-scenes optimization makes it fast.
    
    
    More on keys
    A KEY is any set of columns that is guaranteed to uniquely determine a row.
    
    Primary Key: the key the database developer thinks is most
    important; usually a single attribute if there is one
    Composite Key: multiple columns (eg the GRADE_REPORT
    table). Note that there is no single-column key here.
    
    Secondary Keys (or "candidate keys"): other column
    combinations that are keys, but not the primary one.
    
    A column can't be a key just because values in that column have no
    duplicates, and so the column value determines a unique row. There has to be
    a constraint, acting on potential future row insertions.
    
    Note that keys are not properties of particular tables, but rather of the
    "table schema". They represent design constraints.
    
     Foreign Keys
    Key constraints are one kind of constraint. What about the use of dno
    in table Employees? We probably want all dnos to refer to real departments,
    that is, to match an existing dnumber
    in table Department. This is done through a foreign
      key constraint: we declare in table Employee that attribute dno is
    a foreign key: a reference to a key
    of another table. The declaration looks like
    
    foreign key (dno) references
      department(dnumber)
    
    
    We can also give this constraint a name:
    
            constraint FK_employee_department
    foreign key (dno) references department(dnumber)
    
    (This is a simpler naming convention from the earlier example; only the
    parent table name is given.) Note that the constraint here applies to adding
    (or updating) records in Employee, and also to deleting records in
    Department.
    
    Foreign keys are notorious for introducing circularity problems. What
    happens if we enforce foreign keys and try to load the COMPANY database as
    originally written? With all tables empty, we can't add any employee because
    no dno value we might use would appear in the empty Department table, and we
    cannot add a department because the mgr_ssn is a foreign key referencing
    Employee. 
    
    In principle, there is no reason to require that the foreign key actually be
    a key in the other table. In practice, it almost always is; in database
    schemas generated through so-called Entity-Relationship diagrams it always
    is.
    
    Life can be quite frustrating if you forget the circularity problem.
    Once two tables with a "foreign-key embrace" (each uses the other as a
    foreign key) are created, they can be difficult to remove. Sometimes one has
    to resort to dropping the entire database. If I load my file
    company.brokenalter.text, these all fail:
    
      - drop table employee;
 
      - drop table department;
 
      - alter table employee drop foreign key dno;
 
    
    The last one above, however, fails simply because it is wrong;
      I shouldn't have used the column name (dno), but rather the constraint
      name (in this case, department_ibfk_1). Some people like foreign-key
      constraint names for this reason.
    To drop table T, you must first drop all foreign key constraints from other
      tables to T.
    
    The command
    
      - alter table department drop foreign key department_ibfk_1
 
    
    does work. The constraint name can be determined from show
        create table department.
    Another thing that does work (though only for MySQL) is this:
    
      - set foreign_key_checks=0;
 
    
    Here is an insert command that should fail due to a foreign-key
      violation, as there is no department 6 (the delete command right after
      undoes the addition):
    insert into employee values ('ralph', null,
      'wiggums', '121212121', null, null, null, null, null, 6);
      delete from employee where lname = 'wiggums';
    If this succeeds, the employee table probably has foreign key constraints
      removed. You can see the constraints
      with the MySQL command
    
    show create table employee;
    
    They can be added back with:
    
    alter table employee ADD foreign key
      (super_ssn) references employee(ssn);
      alter table employee ADD foreign key (dno) references department(dnumber);
    
    They can be added back and given names with the following:
    
    alter table employee ADD constraint
      FK_employee_employee foreign key (super_ssn) references employee(ssn);
      alter table employee ADD constraint FK_employee_department foreign key
      (dno) references department(dnumber);
    
    (The naming convention here is FK_childtable_parenttable. It is common, but
    not universal; some add the referenced column in the parent table as well.)
    
    
    The FK declaration goes into the child table, and includes
    a reference to a parent table: some column of the child
    table is restricted to values that appear in the designated column of the
    parent table. That is, with the second FK constraint above, involving dno,
    table EMPLOYEE is the child table and table DEPARTMENT is the parent table.
    (Of course, there is a different FK constraint, on DEPARTMENT.mgr_ssn,
    making department the child and employee the parent!)
    Other constraints
    Examples might be that the employee salary is in a given range, or is less
    than the supervisor's salary, etc. These non-key constraints can sometimes
    be addressed at the DB level, but are often easier to address at the level
    of the user interface to the DB; that is, the web interface can contain the
    necessary business logic.
    
    
    
    
     A few more basic concepts
    Database tables are usually "self-describing", in that the table description
    and/or relationships to other tables is often embedded in the table
    description. At a minimum, columns have names and types.
    
    The table definitions, column names and their types are generally known,
    collectively, as the database schema.
    
    Database tables can be changed without the need to recompile programs using
    that table: new columns can be added, or entire new tables. Essentially this
    is because DBMS queries are interpreted,
    and table columns are identified by name
    rather than offset. (C programs that access record fields do
    need recompilation if fields are changed.)
    
    Databases often support different views,
    perhaps for users with different privilege levels. A view may be a subset of
    the original set of columns, or it may contain some computed columns in lieu
    of the original columns. Excluding columns such as social_security_num
    or salary is relatively common.
    
    Transactions are sets of related updates, eg removing money
    from one account and adding it to another, or perhaps dropping one class and
    adding another (sadly, remarkably many student DBMSs lack the latter
    transaction operation). EN defines a transaction to be the result of an
    executing program rather than a set of related updates; this latter
    definition might include adding multiple new rows to one table. 
    
    Transaction processing must satisfy the isolation
      property, that transactions appear to execute in isolation from one
    another, and the atomicity
      property, which says that each transaction is completely executed
    or not executed at all. (This is sometimes described as that transactions
    must meet the ACID test: Atomicity, Consistency,
    Isolation, Durability. However, consistency and durability are relevant to
    all database updates; they are not particular to transactions.)
    
    Here is a simple failure of isolation: two transactions are transferring
    money to other accounts. 
    
      - Transaction 1 is transferring $100 from A to B
 
      - Transaction 2 is transferring   $50 from A to C
 
    
    The total amount of money should always remain $1000.
    
    
      
        
          transaction 1 
           | 
          transaction 2 
           | 
          Acct A 
           | 
          Acct B 
           | 
          Acct C 
           | 
        
        
           
           | 
           
           | 
          1000 
           | 
          0 
           | 
          0 
           | 
        
        
          get value of A:
            $1000 
           | 
           
           | 
          1000 
           | 
          0 
           | 
          0 
           | 
        
        
          add $100 to B 
           | 
           
           | 
          1000 
           | 
          100 
           | 
          0 
           | 
        
        
           
           | 
          Debit A by
            $50            
             
           | 
          950 
           | 
          100 
           | 
          0 
           | 
        
        
          Store $1000 - $100 in A 
           | 
           
           | 
          900 
           | 
          100 
           | 
          0 
           | 
        
        
           
           | 
          Credit C by $50 
           | 
          900 
           | 
          100 
           | 
          50 
           | 
        
      
    
    
    
    Multi-user DBMSs need concurrency control.
    
     Concurrency control leads to locks, the other great
      performance bottleneck of RDBMSs.
    One of the advantages of a DBMS, however, is that it provides automatic
        concurrency control. Consider, for example, the typical Unix
      password file, /etc/passwd. This contains userids and hashed passwords
      (the password hashes have almost entirely moved now to /etc/shadow, but
      that doesn't affect the issue.) Verifying passwords is a trivial matter of
      searching the file for the matching entry.
    But what about updates? This is sort of a mess, specifically because of
      the rare case of two users updating passwords at the same time. If both
      run a naive password-update command at the same time, then the two
      instances of the command will each make a copy of the file, and will each
      update their own copy, and then will each write the file back. If we are lucky,
      one of the two updates will be overwritten by the other. If we are
      unlucky, the two writes will interleave, leading to a completely corrupted
      password file.
    So the actual password-update command has to lock the
      password file. This is time-consuming, and means the program logic has to
      deal with waiting for the file to become unlocked.
    However, if passwords were stored in a database, the concurrent-write
      problem will be shifted to the database, which has presumably already
        solved it in an efficient manner. This alone is sometimes worth the
      price of admission. 
    Once Upon A Time, proponents of corporate central DBMSs had to argue against
    individual DBMSs maintained by each administrative group. This is pretty
    much a settled issue now, though it does mean that the "natural owners" of
    data in an organization (eg registration and records at Loyola, for student
    registration data) will not in fact "own" that data.