Database week 13

Comp 353/453: Database Programming, Corboy L08, 4:15 Mondays

Week 13, Apr 22

Transactions

A transaction is a set of operations, which we can idealize as read(X), write(X), which at the end we either commit (save) or rollback/abort (discard). Generally we hope rollbacks are rare.

Abstract transactions: operations are read(X) and write(X), for X a field/record/block/table.
Fig 21.2: two example transactions.
T1: read(X), write(X), read(Y), write(Y)
T2: read(X), write(X)

Ideally, we execute the transactions one at a time, with no overlap. In practice, because of the considerable I/O-wait for disk retrieval, we cannot afford that. Transactions will overlap. We need to figure out how to arrange for this to happen safely. Here are two problems:

Lost-update (write/write) problem: this occurs when one transaction writes a value that another then overwrites. See Fig 21.3a. T1 is transferring N dollars from X to Y, while T2 is adding M dollars to X. T2.write(X) overwrites T1.write(X).

T1	T2
read(X) X = X-N
	read(X) X = X+M
write(X) read(Y)
	write(X)
Y = Y+N write(Y)

Dirty-Read (read/write) problem: this occurs when one transaction reads an item that another then overwrites. This would end up as a lost-update, but consider Fig 21.3b where T1 actually aborts after write(X).

T1	T2
read(X) X = X-N write(X)
	read(X) X = X+M write(X)
read(Y) abort/rollback

Fig 21.3c shows a corrupted result; an abort would have been an improvement!

The system log, or journal, is a master list of all operations performed, used in the event of rollback. Failures leading to rollback may be system crash, lack of disk space, lookup error (eg SQL error), locking problems, or other physical or logical errors.

Items read here can be individual record attributes, entire records, or entire tables. When joining tables, the entire table is read.

The ACID test

ACID is an acronym enumerating the following set of features relating to support for parallel transactions.

Atomicity: either all steps of a transaction are performed, or none are. We either commit the entire thing or rollback completely.
Consistency preservation: the individual transaction should preserve database consistency and database invariants
Isolation: transactions should appear to be executing in isolation.
Durability: committed transactions survive DB crashes

On the face of it, durability has little to do with parallelism, and consistency is an attribute for individual transactions. There are other redundancies here as well; the main points for transactions are atomicity and isolation, with the latter understood as a special case of correctness.

Note that Consistency is closely related to Isolation: if each transaction preserves consistency, and the transactions appear to execute in Isolation, then a series of transactions will be consistency- preserving, at least if Isolation means something like Serializability. However, serializability might be a stronger condition than necessary, if we really had a good definition for isolation. (Which we in fact do not; think about it. Query 1 may affect the outcome of Query 2 even if they are executed one after the other.)

Moral: beware of choosing your terms to fit a clever acronym.

(the above we actually covered Week 12)

Transaction schedules

Transaction schedules are lists of read/write operations of each transaction, plus commit/abort. The operations of each individual transaction must be in their original order.

Eventually we will enforce schedules by using locking, but for now we just consider schedules abstractly.

Examples:

Sa: Fig 21.3a: r1(X), r2(X), w1(X), r1(Y), w2(X), w1(Y), c1, c2
Sb: Fig 21.3b: r1(X), w1(X), r2(X), w2(X), r1(Y), a1

Two operations is a schedule conflict (where the word is to be taken as indicating potential conflict) if:

They belong to different transactions
They access the same data item (eg X)
At least one of the operations is a write()

For example, in Sa, r1(X) and w2(X) conflict. So do r2(X) and w1(X). What does not conflict? r1(X) and r2(X), r1(X) and w1(X).

Conflicting operations are those where interchanging the order can result in a different outcome. A conflict is read-write if it involves a read(X) in one transaction and write(X) in the other, and write-write if it involves two transactions each doing write(X).

A complete schedule is a total order of all the operations, incuding abort/commits. (The book allows some partial ordering of irrelevant parts).

Given a schedule S, the committed projection C(S) is those operations of S that are part of transactions that have committed. This is sometimes useful in analysing schedules when transactions are continuously being added.

We can build a transaction precedence graph based on schedule conflict, given a schedule S. Let us say Ti ⟶ Tj if Ti and Tj conflict as above, and one of the following occurs:

Ti executes write(X) and later Tj executes read(X)
Ti executes read(X) and later Tj executes write(X)
Ti executes write(X) and later Tj executes write(X)

These are the three specific cases of conflict that can occur; note that at this level we distinguish between read-write and write-read.

Demo: build graphs for Fig 21.3(a), (b). Note that the figure determines the schedule.

Recoverable and nonrecoverable schedules

Given two transactions T and T', we say T reads from T', or T depends on T', if T' does a write(X) and then later T does a read(X) (this is Case 1 of T'⟶T above) Thus, the outcome of T may depend on the (at least in part) earlier T'.

A schedule is recoverable if no transaction T can commit until all T' where T reads from T' have already committed. Otherwise a schedule is nonrecoverable. The point is that if T reads from T', then if T' aborts, T must abort too. So T waits to commit until T' commits. Given a recoverable schedule, we never have to abort a committed transaction. This is, in practice, an essential rule.

Consider the following modification of Sa above:

    S'a: r1(X), r2(X), w1(X), r1(Y), w2(X), c2, w1(Y), c1.

This is recoverable even though there is a risk of lost update. But now consider

    Sc: r1(X), w1(X), r2(X), r1(Y), w2(X), c2, a1             nonrecoverable: T2 reads X from T1 but T2 commits first
    Sd: r1(X), w1(X), r2(X), r1(Y), w2(X), w1(Y), c1, c2;    recoverable because T2 waited to commit until T1 did

Suppose T1 aborts in Sd. Then we have Se:

    Se: r1(X), w1(X), r2(X), r1(Y), w2(X), w1(Y), a1, a2;

Though recoverability implies no committed transaction needs to be rolled back, we still may have cascading rollback. A schedule is said to avoid cascading rollback if every transaction T only reads items that were earlier written by committed transactions. In this case, an abort of some other transaction will not cause T to abort. How can we achieve this in Sd/Se? By delaying r2(X) until after c1.

There is also a notion of strict schedule, in which transactions do not use X for either read or write until all transactions that wrote X have committed. This is more expensive to require.

strict implies cascadeless implies recoverable, but not the reverse.

Consider the following three transactions:

    T1: X+=10, Y*=1.1;
    T2: Y+=10, Z *= 1.1;
    T3: Z+=10, X *= 1.1;

What happens if we do these in order T1, T2, T3? If X, Y and Z are initially 100 (so adding 10 is the same as multiplying by 1.1), then in the order above we finish with X=121 (multiplication was second), Y=120 (multiplication was first) and Z=120 (again, multiplication was first).

We get similar end results in any serialized order: one of the values will be 121, and the other two will be 120.

What happens if we create a schedule in which we do all the additions and then all the multiplications? Note that, if we defer the commits, this can be a recoverable schedule. And there are no lost updates. But it is still not equivalent to any serial schedule: we end up with X=Y=Z=121.

Serializability

The ideal situation with transactions is that they are executed serially, but that is too inefficient. Instead, we look for schedules that are in some sense equivalent to a serial schedule. In Fig 21.5, (a) and (b) show two serial schedules. Schedule (c) has a lost update (T1's write(X) is clobbered by T2's later write(X)); schedule (d) is ok (in a sense to be defined below) because we can move T2 to the end of T1.

A schedule is serializable if it is equivalent to a serial schedule. For equivalence, we can use result equivalence, but that is intractable. Instead we use conflict equivalence. Two schedules are conflict-equivalent if the order of any two conflicting operations is the same in both schedules. We can reorder the other (nonconflicting) parts to get what we need.

Conflict-serializability algorithm

The first step is to build the transaction precedence (directed) graph above, in which we have an arc Ti⟶Tj if any of these three cases holds:

Ti executes write(X) and later Tj executes read(X)
Ti executes read(X) and later Tj executes write(X)
Ti executes write(X) and later Tj executes write(X)

If the graph has no cycles, it is conflict-serializable. To look for cycles, we do a topological sort (ie find a total order of the nodes consistent with the Ti⟶Tj ordering; see below). The original schedule is conflict-equivalent to the schedule of executing the transactions in that order.

The actual topological-sort algorithm is as follows: first, find all Ti that do not have any predecessors, ie Tk with Tk⟶Ti. Do these first, and remove all Ti⟶Tj links. Now repeat, until there are no more nodes. If at some stage there are no Ti without predecessors, then there must be a cycle.

Fig 21.7 shows the graphs for the schedules in Fig 21.5.

Debit-credit transactions

These are transactions built from pieces of the form read(X), X+=N, write(X). We do not have to serialize these! As long as we avoid the lost-update problem, we can interleave increments and decrements from different transactions without harm, because addition is commutative. The schedule below is not conflict-serializable, but it yields the same result, namely X = X+N−M, Y=Y+M−N.

T1	T2
read(X) X+=N write(X)
	read(Y) Y+=M write(Y)
read(Y) Y−=N write(Y)
	read(X) X−=M write(X)

Locking

We will only consider relatively simple locks; complex locks tend to be inefficient. A binary lock has two states: locked and unlocked. We can implement a per-field/record/block/table lock with a locking table; at any one time, we are likely to have relatively few locks.

When we lock something, we may have to wait. See Fig 22.1. Ideally, we wait in a queue on the lock.

The actual inner details of lock() and unlock() must be atomic. On multi-processor systems, it is not enough simply to disable interrupts; we must actually involve some kernel locking.

A straightforward improvement on binary locks is read-write locks: write locks imply exclusive access, but several transactions can read-lock the same data value concurrently. Note that if we intend to write X eventually, then we will need to write-lock X before reading X; otherwise we may have lost updates.

Two-phase locking

Two-phase locking is a simple protocol in which all locking operations (including upgrades of read-lock(X) to write-lock(X)) precede the first unlock operation. The locking phase is called the expanding phase; the unlock is the shrinking phase.

If all transactions use two-phase locking, then any schedule will be conflict-serializable.

In fig 22.3(a), the two-phase locking protocol is not followed. The schedule in 22.3(c) is allowed by the locks, but it is not serializable.

Fig 22.4 shows the same transactions using the two-phase locking protocol. Here, however, we have the potential for deadlock.

Two-phase locking comes in flavors. The version above is called basic. We also have:

conservative: transactions must lock all the items to be accessed at the start of the transaction. This is deadlock-free, but sometimes the transaction does not know every item to be read until partway through.
strict: no write locks are released until commit/rollback. This means that no other transaction that reads from this one can do so until the first one has committed. This leads to a strict schedule, which is then cascadeless. Deadlocks may occur.
rigorous: no locks at all are released until commit/rollback. Rigorous two-phase locking is sort of a mirror image of conservative two-phase locking.

Deadlock

Locking protocols generally introduce the risk of deadlock:

T1	T2
write_lock(X)	write_lock(Y)
write_lock(Y)	write_lock(X)
...	...

Perhaps the simplest strategy for deadlock prevention is to lock everything according to one master order. This is impractical for the programmer to enforce; the database can only do it if we ask for all the locks together: write_lock(X,Y,Z).

In practical terms, we also have to deal with deadlock detection, and then decide which transaction will die and which will wait. Note that it soon becomes obvious in practice that deadlock has occurred: any transaction that waits for more than a few seconds on a lock, and is not moving up in the queue for that lock, is likely deadlocked. More formally, we can construct a wait-for graph, and look for cycles.

Timestamps

An alternative to locks entirely is the use of timestamps. We assign each transaction a timestamp, and then only allow schedules that are equivalent to executing the transactions in timestamp order. To do this, each database item has associated with it a read-timestamp and a write-timestamp; the former is the latest timestamp of all transactions that have read X, and the latter is the latest timestamp of all transactions that have written X. If a transaction T attempts read(X), we first compare T's timestamp with read-timestamp(X). If T's timestamp is not later than read-timestamp(X), we abort T and put it back in the transaction queue with a later timestamp; ditto for write(X). This protocol can involve cascading rollback, so further constraints are needed to get it to generate cascadeless or strict schedules (or even recoverable schedules, for that matter).

Transactions in PHP

See http://php.net/manual/en/pdo.transactions.php. Here's an example from that page:

try {  

              $db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

            

              $db->beginTransaction();

              $db->exec("insert into staff (id, first, last) values (23, 'Joe', 'Bloggs')");

              $db->exec("insert into salary
            (id, amount) values (23, 50000)");

              $db->commit();

              

            } catch (Exception $e) {

              $db->rollBack();

              echo "Failed: " . $e->getMessage();

            }

Bear in mind that you should not interpose any wait for user input between the beginTransaction() and the commit(); otherwise you may be holding a lock "indefinitely".

Crash Recovery

A crucial feature of transactions (the "atomic" property) is that if the database crashes, or is restarted (eg due to deadlocks), or one transaction fails, then no transaction is left "partially" executed: every transaction either completes or is rolled back.

The typical way to implement this is with a "transaction journal file", which is carefully written to disk after every update. This journal might contain:

a transaction ID
original values of the data fields to be updated
a flag indicating that the transaction is pending

The flag is set to "committed" at the end. Then, if the DB is restarted after a crash, we can restore the records involved to their pre-transaction state.

Note that we also will now need to roll back any other transaction that depended on this one; we can use the transaction ID for that.

NoSQL

There is currently a movement that suggests that the days of relational databases are waning, and that emerging needs to manage very large databases means that new approaches must be found. A typical NoSQL database supports tables that allow lookup via one primary key column, but cannot necessarily be joined to other tables, or searched by other columns. Such a structure -- sometimes called a ⟨key,value⟩ store -- looks something like a giant version of a Java HashMap, though often with multiple data columns.

Not everyone is in agreement with this: see I Can't Wait for NoSQL to Die, by Ted Dziuba (archived here). Also note that, more and more, NoSQListas claim that their acronym stands for "Not only SQL", as if to recognize that SQL and Relationalism will always have a role to play.

But for the Facebooks and Amazons and Googles out there, a traditional RDBMS has limitations. The issue, however, isn't simply a matter of size: a big part of it is the need to partition the database to support huge numbers of users.

Also, there are several different ways to think about size:

Raw number of records
Number of records being written; large distributed read-only (or read-mostly) databases are relatively easy
Number of simultaneous locks, related to the number of simultaneous writes.

Large distributed email databases, where everyone is constantly having new email written to their inbox, are relatively easy too: any inbox is accessed by only two entities: the mail deliverer (adding mail) and the user (deleting it).

As a second example, consider the user "walls" on Facebook. These are viewed from numerous points, and updated from numerous sources. However, the updates are timestamped, and we can rely on the updates arriving eventually. Thus, if we replicate a given user's wall over several databases, there may be moments when different versions of the wall show different comments. However, we are ensured of eventual consistency: a given comment will eventually appear in every wall instance, in the correct order.

As a third example, consider Amazon's online-ordering database. A single user placing an order can be assigned to one particular, smaller database. The items are entered, the payment is made, and then Amazon needs to process the order. But the latter can be done in a relatively leisurely fashion. The important part, however, is making sure the inventory is updated; in particular, we don't want to sell the last copy of something to two different people.

At first glance this suggests that the inventory database needs to be consistent at all times. But there's a problem: the CAP theorem.

CAP theorem

Eric Brewer's CAP theorem: Consider the following three attributes for large distributed databases:

Consistency of view across all locations
Availability at all times
Partition tolerance: if a link breaks, the database pieces can keep going

Theorem: you can only have two of them.

Argument: Suppose you have the last two. A link does break, and the two halves continue in isolation. Then we cannot possibly hope for consistency.

But why do we even care? Why ever build a distributed database?

Because you're tired of being poor, that's why. You want to build an Amazon, an E-bay, a Google, not a single-server DB that can process 100 queries a second when you have 10,000 arriving.

Brewer's talk is here. Gilbert & Lynch's proof is here.

If we insist on a single master DB, we forfeit partition tolerance. Brewer's theorem does not mention this explicitly, but for databases to be really large we want to be able to distribute them.

A classic distributed database forfeits availability: if the link between nodes goes down, the database is unavailable to anyone, even those able to access one of the nodes.

Consider DNS as a big database: we do not have consistency.

One solution is eventual consistency: when connectivity returns, we resolve conflicting transactions. DNS does have eventual consistency. The problem case, of course, is when two transactions involve updates to the same item. Note that for the Facebook wall example above, there aren't really any conflicting transactions (unless you count this: A posts, B comments on A's post, A deletes his/her post without seeing B's. But a related form of this happens all the time even when the data is consistent: A posts, B comments, A deletes without refreshing the page).

The worst situation is when Amazon has one book left, and two people have just bought it. We can guarantee this won't happen with a central database (or even a separate central database for each of several disjoint sets of products, eg DB[N] is for all items where SKU mod 137 = N). But eventual consistency is not good here: it leads to scenarios where the last copy is sold twice.

Note, however, that while eventual consistency may not be good here, it may in fact be good enough. What should Amazon do if the last copy is sold twice? They can just tell one of the buyers that, too bad, the item they ordered turned out to be out of stock. This happens often enough anyway, due to discrepancies between the online inventory and the physical inventory.

In other words, eventual consistency may not be ideal, but it may be a workable arrangement.

Formally, eventual consistency means that if no new updates occur, then eventually all accesses will return the last-updated value, or the winner of the conflicting-update resolution. DNS does this.

Denormalization

Aside from the CAP theorem, there are other problems with RDBMSs. Joins, for one. Joining two big tables can be expensive. The advantage of joins is traditionally that the underlying database is protected from anomalies. But suppose we eliminate those through careful front-end programming? (As an example, suppose we move the constraint that SSN is a key of the EMPLOYEE table out of MySQL and into our PHP front end. It would be straightforward to implement.)

For as long as RDBMSs have been around, DB managers have engaged in occasional "denormalization" to avoid a particular join for performance reasons. One example might be a TOTAL field in the INVOICE table: the total for a specific invoice_num is calculated by finding all items ordered, as listed in INVITEMS:
select sum(quantity*price) from INVITEM ii where ii.invnum = invoice_num;
This is a relatively expensive join and search! Storing a TOTAL field in the INVOICE table introduces a potential inconsistency, but it is also rather common practice for performance reasons.

EN6 gives a denormalization example on pp 731-732. Consider a view ASSIGN:
⟨emp_id, project_id, emp_lname, percent_assigned, proj_name, proj_mgr_id, proj_mgr_name⟩

This is a complex join, even after replacing percentages with project_hours and total_hours:

select e.ssn, p.pnumber, e.lname, w.hours, sum(w2.hours), p.pname, d.mgr_ssn, m.lname
from employee e, employee m, project p, works_on w, works_on w2, department d
where e.ssn = w.essn and p.pnumber = w.pno and e.ssn = w2.essn and p.dnum = d.dnumber and d.mgr_ssn = m.ssn
group by e.ssn;

Sometimes materialized views are used for this. Note that this amounts to a denormalized table, though it is now one that the DBMS is responsible for keeping synchronized.

select e.ssn, p.pnumber, e.lname, w.hours, (select sum(w2.hours) from works_on w2 where w2.essn = e.ssn) as total, p.pname, d.mgr_ssn, m.lname
from employee e, employee m, project p, works_on w, department d
where e.ssn = w.essn and p.pnumber = w.pno and p.dnum = d.dnumber and d.mgr_ssn = m.ssn
group by e.ssn;

Big Data Models

The basic big-data model is the ⟨key,value⟩ store, meaning that from the key we can retrieve a single matching record. Lists are often directly supported as values because the relational-DB way of representing lists (as a new table indexed by key and listitem) is too slow.

Often these models are not only NoSQL but are "schema free", usually meaning that they store ⟨key,value⟩ pairs where the value can be some structured data, sometimes in JavaScript Object Notation (JSON), below (note the hierarchical subtyping and the support for list structures). Some data models might allow two different keys: ⟨key1,key2,value⟩.

{ "ssn"     : "123456789",
"name"    : { "fname" : "John", "lname" : "Smith" },
"bdate"   : "1965-01-09",
"address" : {"street": "731 Fondren", "city" : "Houston", "state" : "TX" },
"salary" : 30000,
"dno"     : 5,
"projects": [ {"pno" : "3", "hours" : 30 }, {"pno" : "20", "hours": 10 } ]
}

The main point is that for a given key the rest of the data can be any type; that is, we in effect have a map of key strings to Object.

Some big-data models support sharding: splitting up rows of a dataset among multiple locations.

Examples:

BigTable: a Google internal product. Tables can be subdivided (and then distributed). To look up a value, you supply the usual row key, and also a column key: an identifier for the particular column (or columns) you want to retrieve.

Hadoop: a public Apache project. The framework is typically used to implement distributed filesystems, or other large projects with a high degree of distributedness.

Cassandra: also an Apache project. Cassandra is specifically a database intended to handle very large amounts of data spread over a large number of servers. The data model was taken from BigTable. A given key can map to multiple columns, often grouped into column families. Column families are in effect declared through a database schema, but the actual columns within families are not. Until 2010, the Facebook messaging system used Cassandra; it then moved to HBase.

Hbase: another Apache database with close ties to the Hadoop framework.

Hive: the database running on top Hadoop that is used by Facebook.

MongoDB: A "document-centric" database using JSON-like syntax (above). But field names do exist (even if they are now dynamic). Searches can be done on any field, and indexes can be created. Data distribution is again a major goal; Mongo also includes load-balancing features.

CouchDB: Couch ("Cluster Of Unreliable Commodity Hardware") was an earlier NoSQL effort. Data consists of free-form "documents", with JSON structuring. Keys are internal row identifiers. Couch has a form of support for ACID, but substituting eventual consistency for strict consistency. Couch has specific tools to handle resynchronization of separated databases (eg an mp3 player that spends significant amounts of time "offline" but that is merged at intervals with the master database). Documents in the database contain revision information, intended to facilitate the merger of two documents from two separated Couch instances. It is left up to the application to figure out how to merge two documents that have been independently modified; sometimes this is straightforward and other times it is impossible to do in a meaningful way.

Mongo and Couch are considered to be "document" databases; Cassandra and Hbase are considered to be "wide-column" databases. This often means that many columns are allowed, with grouping of columns into "super-columns", etc. It may also mean, however, that the data is organized on disk by column, rather than by row. This allows faster searching of individual columns (as a much larger number of column values will be in a single block), but is slower for retrieving an entire record.

More on Hadoop

Hadoop (http://hadoop.apache.org/) is the combination of the following:

Hadoop kernel (for Hadoop-specific I/O and CPU scheduling)
Hadoop Distributed File System, which is fault-tolerant
MapReduce Engine
upper layers like Hbase and Hive

The goal is for the system to have an awareness of where the data is located. Operations on a given collection of data are scheduled, where possible, on the same LAN as the data, avoiding backbone traffic. Hadoop is generally "rack-aware", really meaning it knows when two nodes are on the same high-speed switch.

Smaller-scale Hadoop setups have one master node (CPU) and multiple "worker" nodes that are both DataNodes (repositories of data) and also TaskTracker nodes. Data is replicated on multiple nodes; the default degree of replication is 3.

Hadoop is designed to support a lot more reading than writing. Examples might include Google's search index, Facebook pages or Amazon inventory. In June 2012, Facebook had a 100 petabyte Hadoop installation, growing at 0.5 PB/day.

The Hadoop Filesystem is just that. Most ⟨key,value⟩ stores will be implemented as Hadoop tables. Note, however, that any filesystem is a ⟨key,value⟩ store where keys are filenames and values are file contents.

MapReduce

MapReduce is a large-data paradigm developed by Google and based on a common functional-language operation. The original overview paper by Dean and Ghemawat is mapreduce-2004.pdf. Not every Hadoop installation runs MapReduce, but this is the "native" data-analysis tool. The MapReduce engine has one master JobTracker node, which receives MapReduce tasks from clients and organizes them onto TaskTracker nodes. TaskTracker nodes generally run a small number (typically 4) of MapReduce tasks.

The idea behind MapReduce is that the programmer must supply two components:

an easily parallelizable map() operation that gets run on every data unit
a reduce() operation that combines the results of the map() calls

The map() operation usually takes a ⟨key,value⟩ pair and returns a list of related ⟨key,value⟩ pairs. A more specific description of the interface is

map (in_key, in_value) -> list(out_key, intermediate_value)

The out_keys in a single list need not be the same.

A reduce() operation typically takes a single out_key, and a list of all (or some of) the intermediate_values associated with that out_key. The result is a list of output values.

Here is one of google's basic examples of MapReduce. This example counts the occurrence of each word in a set of documents.

  map(String input_key, String input_value):
    // input_key: document name
    // input_value: document contents
    for each word w in input_value:
      EmitIntermediate(w, "1");

  reduce(String output_key, Iterator intermediate_values):
    // output_key: a word
    // output_values: a list of counts
    int result = 0;
    for each v in intermediate_values:
      result += ParseInt(v);
    Emit(AsString(result));

Notice the map() operations generate long lists of words with a count of 1. The reduce() phase is applied to all the map() results with the same key (that is, all map() results for the same word).

Notice also that the ⟨key,value⟩ input to map() is ⟨filename,file_contents⟩, quite different from the ⟨key,value⟩ pairs generated as output.

Here's an example in which we assume that the map() operations actually count all the individual words of each document:

doc1: here is a list of words

doc2: list of words with words in list and another list

doc3: here here here is words and words and another words

Map() outputs:

(here, 1) (is, 1) (a, 1) (list, 1) (of, 1) (words, 1)

(list, 3) (of, 1) (words, 2) (with, 1) (in, 1) (and, 1) (another, 1)

(here, 3) (is, 1) (words, 3) (and, 2) (another, 1)

Reduce inputs:

    (here, 1) (here, 3)
    (is, 1) (is, 1)
    (a, 1)
    (list, 1) (list, 3)
    (of, 1) (of, 1)
    (words, 1) (words, 2) (words, 3)
    (with, 1)
    (in, 1)
    (and, 1) (and, 2)
    (another, 1) (another, 1)

The reduce tasks then add up each word over each list, below. Note how the MapReduce infrastructure must take the results returned by the map()s and reorganize them by key (shuffles them) to feed into the reduce() operations.

    here    4
    is        2 for the
    a         1
    list    4
    of        2
    words 6
    with    1
    in        1
    and     3
    another 2

Search is another operation that lends itself to MapReduce formulation. The map()s just output records containing the value; the reduce() just consolidates the map output.

Reverse web-link graph: the map() searches source pages and outputs a ⟨target,source⟩ pair each time a link to target is found on page source. The reduce operation concatenates all ⟨target,source⟩ pairs with the same target into a pair ⟨target,list(source)⟩

A fair bit of machinery is needed to organize the map() outputs in a way that can be fed into the reduce() inputs; this is sometimes called the shuffle stage. This shuffle, though, is a generic, easily parallelizable operation. Typically this phase uses extensive hashing on intermediate_key values: each map() output pair ⟨ikey,ivalue⟩ is immediately put into bucket hash(ikey) by the system. Then, in the reduce() phase, ⟨ikey,ivalue⟩ pairs with the same intermediate_key in one bucket are fed to one reduce() task; of course, that bucket will contain all the ⟨ikey,ivalue⟩ pairs with ikey = intermediate_key. The system keeps track of the order of intermediate_key values in one bucket, for easier later sorting.

Hive

Hive (http://hive.apache.org/) is a database application built on top of Hadoop. It supports ⟨key,value⟩ tables with indexes. It also supports data types (eg large-file data types such as text files, subtables, and binary files. Tables may contain partition columns (sometimes "virtual" columns) that provide tags for partitioning the table among multiple nodes; in the data-definition language this is indicated by a PARTITION BY clause. Tables within one node may be organized into buckets, eg by date or userid; these in turn are indicated by a CLUSTER BY clause. Here is an example table declaration from https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-WhatisHive:

    CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
    ROW FORMAT DELIMITED
            FIELDS TERMINATED BY '1'
            COLLECTION ITEMS TERMINATED BY '2'
            MAP KEYS TERMINATED BY '3'
    STORED AS SEQUENCEFILE;

Unlike relational systems, a primitive array type is supported; note the friends field above.

Arguably Hive's primary feature is that it supports an SQL-like query language, HQL. It is not as general as SQL, but HQL queries are very unlikely to run excessively slowly on huge data. HQL queries are translated by Hive into MapReduce queries.

HQL supports

row selection using a WHERE clause
column selection using a SELECT clause
aggregation functions using a GROUP BY clause
limited equijoins[!]

Query output always goes into an intermediate table, rather than to the console. You can do this in SQL, too.

For joins, it is recommended to list the larger table first.

Reservations problem

Not on the final!

Suppose we want to create a reservation system for trains (the traditional mode of transportation for the Seat Reservation Problem, which has an extensive literature). Here is some information:

The train stops in Argenta, Brocton, Colp, Dongola, Elsah (4 links). (These are actual Illinois towns, but almost certainly NOT on any kind of line.) Suppose the following reservations are made:

    seat 1 is reserved A-B
    seat 2 is reserved B-C
    seat 3 is reserved C-D
    seat 4 is reserved D-E

Problem: no seats in 1-4 are now available for end-to-end!

Note that there is always a solution that involves no seat-changing, as long as no link is ever overbooked: have passengers get on the train and find an empty seat. If there is no seat, that link is overbooked. If they find a seat, they keep it until they get off. The problem with this is that we would like to grant reservations in the order made, rather than in the order of embarkation.

Some algorithms:
OPT (optimum): Essentially what was outlined above. We get all reservation requests and find the arrangement that seats everyone. (A variation may involve reservations made by groups, the "find a seat" method may split up groups, which we might not want to allow.)

Worst:
Assign the first seat that has no reservations attached. Above.

First-Fit:
Assign the min seatnumber that is free for the desired interval.

Best-Fit:
Minimize the size of the empty space containing the seat. We need lots of tie-breaking rules to make this precise.
With five seats, do we prefer _ _ x x _ or _ x x _ _? (X = reservation, _ = vacant empty space)

Worst-Fit:
maximize the empty space left open around a seat, on the theory that we're leaving large blocks for later.

The problem case above is due to worst-fit.

Other algorithms used in practice:

Long-haul-only: Accept reservations for travelers who are traveling > 50% of the route, say, first. A few days before travel, open up the reservation system to other travelers.

Pooling:
have separate seat pools for short-haul and long-haul reservations. In general, K pools where K <= #legs

Usual performance measure: worst-case ratio of # of passengers accommodated using the algorithm to # of passengers that could be accommodated using OPT.

These algorithms are related to the memory-allocation problem, except that we assume
    NO CANCELLATIONS (or few, at any rate)
    RESERVATION MUST BE CONTIGUOUS

Suppose there are only two legs on the schedule.

We can prove that first-fit is optimal: Suppose some reservation is blocked. Claim: it's because on one leg or the other, the train is full.

Take away the 2-leg (through) reservations. Remaining one-leg reservations for seat 1 form an initial segment; similar for seat 2. Whichever had more reservations must have used ALL the remaining ones.

Three legs, TWO SEATS

p1: leg 1
p2: leg 1,2
p3: leg 3
p4: leg 2,3

First-fit assigns:
p1: seat 1, leg 1
p2: seat 2, leg 1,2
p3: seat 1, leg 3
p4: no available seats

but best-fit assigns as:

p1: seat 1, leg 1
p2: seat 2, leg 1,2
p3: seat 2, leg 3 tighter fit than seat 1
p4: seat 1, leg 2,3

What if we just allow reservations for trips of length >= 2? This also works.

Representing reservations:

Attempt 1:

⟨passenger, seat, leg, date⟩

create table reservations (
    passno integer primary key, -- one passenger can't reserve two seats
    seat integer,
    leg integer,
    unique (seat, leg)        -- can't reserve same seat twice
);

I left out DATE for simplicity.

Problem: queries for available seats are very difficult in pure SQL. You can only select fields from a SUBSET of records. With only the table above you cannot ask about vacancies (but see below)

To ask for seats reserved on leg 3:

    select r.seat from reservations r where r.leg = 3;

To ask for seats reserved on legs 2 or 3:

    select r.seat from reservations r where r.leg between 2 and 3;

But how to ask for free seats?



Attempt 2:

⟨passenger, seat, leg, date⟩ plus VACANCIES: ⟨seat, leg, date⟩, initially full.

To ask for an open seat on leg 3: (again ignore date; just assume ⟨SEAT, LEG⟩

    select v.seat from VACANCIES v where v.leg = 3;

        (select * from VACANCIES v2 where v.seat = v2.seat and v2.leg = 3)

Attempt 1: table Reservations

Hard to find vacant seats on leg 3

Attempt 2: separate table of Vacancies
Easy to find vacant seats on leg 3, BUT tricky to guarantee consistency

Back to attempt 1:
create table reservations (
    passno integer,
    seat integer,
    leg integer,
    primary key (seat, leg)        -- can't reserve same seat twice
);

create table seats (
    seat integer primary key;
);

create table legs (leg integer primary key);

Now, to find seats that are vacant on legs 2 and 3:

    select seat from seats
    minus
    select r.seat from reservations r where r.leg between 2 and 3;

To implement the First-Fit algorithm:

    select min(seat) from
    (     select seat from seats
        minus
        select r.seat from reservations r where r.leg between 2 and 3
    );

First-fit:

To reserve a seat for legs A-B:
    select min(seat) from
    (     select seat from seats
        minus
        select r.seat from reservations r where r.leg between A and B
    );

Best-fit: I couldn't do this in SQL
    for each seat, get length of open block containing A-B (zero for some)
    almost a sql query
    select n from seats where

    now, order this by size, and consider only blocks of min size.

    AVAIL(s, lo, hi): seat s is available on legs lo through hi:

        NOT EXISTS (select * from reservation r where r.seat = s and r.leg between lo and hi)

for each seat, the biggest block of free legs below 3:
    select s. seat, min(l.leg) from SEATS s, LEGS l where
        l.leg <= 3
    and    not exists (select * from reservations r where r.seat = s.seat and r.leg between l.leg and 3)
    group by s.seat;

    select s. seat, min(lo.leg), max(hi.leg) from SEATS s, LEGS lo, LEGS hi
    where
        lo.leg <= 3 and hi.leg >= 3
    and    not exists (select * from reservations r where r.seat = s.seat
                and r.leg between lo.leg and hi.leg)
    group by s.seat;

select s. seat, min(lo.leg), max(hi.leg), max(hi.leg)-min(lo.leg) AS gap
from SEATS s, LEGS lo, LEGS hi
where
    lo.leg <= 3 and hi.leg >= 3
and    not exists (select * from reservations r where r.seat = s.seat
            and r.leg between lo.leg and hi.leg)
group by s.seat
order by gap;

Range query:

select s. seat, min(lo.leg), max(hi.leg), max(hi.leg)-min(lo.leg) AS gap
from SEATS s, LEGS lo, LEGS hi
where
    lo.leg <= 3 and hi.leg >= 4
and    not exists (select * from reservations r where r.seat = s.seat
            and r.leg between lo.leg and hi.leg)
group by s.seat
order by gap;

Best fit:
    1. put the above into an array, stopping when the value of GAP becomes > value of GAP in first row.
    3. Apply tie-breaker rules to the entries in the array.