Comp 353/453: Database Programming, Corboy
L08, 4:15 Mondays
Week 13, Apr 22
Transactions
A transaction is a set of operations, which we can idealize as read(X),
write(X), which at the end we either commit
(save) or rollback/abort (discard).
Generally we hope rollbacks are rare.
Abstract transactions: operations are read(X) and write(X), for X a
field/record/block/table.
Fig 21.2: two example transactions.
T1: read(X), write(X), read(Y), write(Y)
T2: read(X), write(X)
Ideally, we execute the transactions one at a time, with no overlap. In
practice, because of the considerable I/O-wait for disk retrieval, we cannot
afford that. Transactions will overlap.
We need to figure out how to arrange for this to happen safely.
Here are two problems:
Lost-update (write/write) problem:
this occurs when one transaction writes a value that another then
overwrites. See Fig 21.3a. T1 is
transferring N dollars from X to Y, while T2 is adding M dollars to X.
T2.write(X) overwrites
T1.write(X).
T1
|
T2
|
read(X)
X = X-N
|
|
|
read(X)
X = X+M
|
write(X)
read(Y)
|
|
|
write(X)
|
Y = Y+N
write(Y)
|
|
Dirty-Read (read/write) problem:
this occurs when one transaction reads an item that another then overwrites.
This would end up as a lost-update, but consider Fig
21.3b where T1 actually aborts after write(X).
T1
|
T2
|
read(X)
X = X-N
write(X)
|
|
|
read(X)
X = X+M
write(X)
|
read(Y)
abort/rollback
|
|
Fig 21.3c shows a corrupted result; an
abort would have been an improvement!
The system log, or journal,
is a master list of all operations performed, used in the event of rollback.
Failures leading to rollback may be system crash, lack of disk space, lookup
error (eg SQL error), locking problems, or other physical or logical errors.
Items read here can be individual record attributes, entire records, or
entire tables. When joining tables, the entire table is read.
The ACID test
ACID is an acronym enumerating the following set of features relating to
support for parallel transactions.
- Atomicity: either all steps of
a transaction are performed, or none are. We either commit the entire
thing or rollback completely.
- Consistency preservation: the
individual transaction should preserve database consistency and database
invariants
- Isolation: transactions should
appear to be executing in isolation.
- Durability: committed
transactions survive DB crashes
On the face of it, durability has little to do with parallelism, and
consistency is an attribute for individual transactions. There are other
redundancies here as well; the main points for transactions are atomicity
and isolation, with the latter understood as a special case of correctness.
Note that Consistency is closely related to Isolation: if each transaction
preserves consistency, and the transactions appear to execute in Isolation,
then a series of transactions will be consistency- preserving, at least if
Isolation means something like Serializability. However, serializability
might be a stronger condition than necessary, if we really had a good
definition for isolation. (Which we in fact do not; think about it. Query 1
may affect the outcome of Query 2 even if they are executed one after the
other.)
Moral: beware of choosing your terms to fit a clever acronym.
(the above we actually covered Week 12)
Transaction schedules
Transaction schedules are lists of read/write operations of each
transaction, plus commit/abort. The operations of each individual
transaction must be in their original order.
Eventually we will enforce
schedules by using locking, but for
now we just consider schedules abstractly.
Examples:
Sa: Fig 21.3a:
r1(X), r2(X), w1(X), r1(Y), w2(X), w1(Y), c1, c2
Sb: Fig 21.3b:
r1(X), w1(X), r2(X), w2(X), r1(Y), a1
Two operations is a schedule conflict
(where the word is to be taken as indicating potential
conflict) if:
- They belong to different transactions
- They access the same data item (eg X)
- At least one of the operations is a write()
For example, in Sa, r1(X) and w2(X) conflict. So do r2(X) and w1(X). What
does not conflict? r1(X) and r2(X), r1(X) and w1(X).
Conflicting operations are those where interchanging the order can result in
a different outcome. A conflict is read-write
if it involves a read(X) in one transaction and write(X) in the other, and write-write if it involves two
transactions each doing write(X).
A complete schedule is a total
order of all the operations, incuding abort/commits. (The book allows some
partial ordering of irrelevant parts).
Given a schedule S, the committed
projection C(S) is those operations of S that are part of
transactions that have committed. This is sometimes useful in analysing
schedules when transactions are continuously being added.
We can build a transaction precedence
graph based on schedule conflict, given a schedule S. Let us
say Ti ⟶ Tj if Ti and Tj conflict as above, and one of the following occurs:
- Ti executes write(X) and later Tj executes read(X)
- Ti executes read(X) and later Tj executes write(X)
- Ti executes write(X) and later Tj executes write(X)
These are the three specific cases of conflict that can occur; note that at
this level we distinguish between read-write and write-read.
Demo: build graphs for Fig 21.3(a), (b).
Note that the figure determines the schedule.
Recoverable and nonrecoverable schedules
Given two transactions T and T', we say T reads
from T', or T depends on
T', if T' does a write(X) and then later T does a read(X) (this is Case 1 of
T'⟶T above) Thus, the outcome of T may depend on the (at least in part)
earlier T'.
A schedule is recoverable if no
transaction T can commit until all T' where T reads from T' have already
committed. Otherwise a schedule is nonrecoverable.
The point is that if T reads from T', then if T' aborts, T must abort
too. So T waits to commit until T' commits. Given a recoverable
schedule, we never have to abort a
committed transaction. This is, in practice, an essential rule.
Consider the following modification of Sa above:
S'a: r1(X), r2(X), w1(X), r1(Y), w2(X), c2, w1(Y), c1.
This is recoverable even though there is a risk of lost
update. But now consider
Sc: r1(X), w1(X), r2(X), r1(Y), w2(X), c2, a1
nonrecoverable:
T2 reads X from T1 but T2 commits first
Sd: r1(X), w1(X), r2(X), r1(Y), w2(X), w1(Y), c1,
c2; recoverable because T2 waited to commit until T1 did
Suppose T1 aborts in Sd. Then we have Se:
Se: r1(X), w1(X), r2(X), r1(Y), w2(X), w1(Y), a1,
a2;
Though recoverability implies no committed transaction needs to be rolled
back, we still may have cascading rollback.
A schedule is said to avoid cascading
rollback if every transaction T only reads items that were earlier
written by committed transactions.
In this case, an abort of some other transaction will not cause T to
abort. How can we achieve this in Sd/Se? By delaying r2(X) until after
c1.
There is also a notion of strict
schedule, in which transactions do not use X for either read or write until
all transactions that wrote X have committed. This is more expensive to
require.
strict implies cascadeless implies recoverable, but not the reverse.
Consider the following three transactions:
T1: X+=10, Y*=1.1;
T2: Y+=10, Z *= 1.1;
T3: Z+=10, X *= 1.1;
What happens if we do these in order T1, T2, T3? If X, Y and Z are initially
100 (so adding 10 is the same as multiplying by 1.1), then in the order
above we finish with X=121 (multiplication was second), Y=120
(multiplication was first) and Z=120 (again, multiplication was first).
We get similar end results in any
serialized order: one of the values will be 121, and the other two will be
120.
What happens if we create a schedule in which we do all the additions and
then all the multiplications? Note that, if we defer the commits, this can
be a recoverable schedule. And there are no lost updates. But it is still
not equivalent to any serial schedule: we end up with X=Y=Z=121.
Serializability
The ideal situation with transactions is that they are executed serially,
but that is too inefficient. Instead, we look for schedules that are in some
sense equivalent to a serial
schedule. In Fig 21.5, (a) and (b) show
two serial schedules. Schedule (c) has a lost update (T1's write(X) is
clobbered by T2's later write(X)); schedule (d) is ok (in a sense to be
defined below) because we can move T2 to the end of T1.
A schedule is serializable if it is
equivalent to a serial schedule. For equivalence, we can use result
equivalence, but that is intractable. Instead we use conflict
equivalence. Two schedules are conflict-equivalent
if the order of any two conflicting
operations is the same in both schedules. We can reorder the other
(nonconflicting) parts to get what we need.
Conflict-serializability algorithm
The first step is to build the transaction precedence (directed) graph
above, in which we have an arc Ti⟶Tj if any of these three cases holds:
- Ti executes write(X) and later Tj executes read(X)
- Ti executes read(X) and later Tj executes write(X)
- Ti executes write(X) and later Tj executes write(X)
If
the graph has no cycles, it is conflict-serializable. To
look for cycles, we do a topological sort (ie find a total order of the
nodes consistent with the Ti⟶Tj ordering; see below). The original schedule
is conflict-equivalent to the schedule of executing the transactions in that
order.
The actual topological-sort algorithm is as follows: first, find all Ti that
do not have any predecessors, ie Tk with Tk⟶Ti. Do these first, and remove
all Ti⟶Tj links. Now repeat, until there are no more nodes. If at some stage
there are no Ti without predecessors, then there must be a cycle.
Fig 21.7 shows the graphs for the schedules
in Fig 21.5.
Debit-credit transactions
These are transactions built from pieces of the form read(X), X+=N,
write(X). We do not have to serialize these! As long as we avoid the
lost-update problem, we can interleave increments and decrements from
different transactions without harm, because addition is commutative. The
schedule below is not conflict-serializable, but it yields the same result,
namely X = X+N−M, Y=Y+M−N.
T1
|
T2
|
read(X)
X+=N
write(X)
|
|
|
read(Y)
Y+=M
write(Y)
|
read(Y)
Y−=N
write(Y)
|
|
|
read(X)
X−=M
write(X)
|
Locking
We will only consider relatively simple locks; complex locks tend to be
inefficient. A binary lock has two
states: locked and unlocked.
We can implement a per-field/record/block/table lock with a locking
table; at any one time, we are likely to have relatively few locks.
When we lock something, we may have to wait.
See Fig 22.1. Ideally, we wait in a queue on the lock.
The actual inner details of lock()
and unlock() must be atomic. On
multi-processor systems, it is not
enough simply to disable interrupts; we must actually involve some kernel
locking.
A straightforward improvement on binary locks is read-write
locks: write locks imply exclusive access, but several transactions
can read-lock the same data value concurrently. Note that if we intend to
write X eventually, then we will need to write-lock X before
reading X; otherwise we may have lost updates.
Two-phase locking
Two-phase locking is a simple protocol in which all locking operations
(including upgrades of read-lock(X) to write-lock(X)) precede the first
unlock operation. The locking phase is called the expanding
phase; the unlock is the shrinking
phase.
If all transactions use two-phase locking, then
any schedule will be conflict-serializable.
In fig 22.3(a), the two-phase locking
protocol is not followed. The
schedule in 22.3(c) is allowed by the locks, but it is not serializable.
Fig 22.4 shows the same transactions using
the two-phase locking protocol. Here, however, we have the potential for deadlock.
Two-phase locking comes in flavors. The version above is called basic.
We also have:
- conservative: transactions
must lock all the items to be accessed at the start of the transaction.
This is deadlock-free, but sometimes the transaction does not know every
item to be read until partway through.
- strict: no write locks are
released until commit/rollback. This means that no other transaction
that reads from this one can do so until the first one has committed.
This leads to a strict schedule,
which is then cascadeless.
Deadlocks may occur.
- rigorous: no locks at all are
released until commit/rollback. Rigorous two-phase locking is sort of a
mirror image of conservative two-phase locking.
Deadlock
Locking protocols generally introduce the risk of deadlock:
T1
|
T2
|
write_lock(X)
|
write_lock(Y)
|
write_lock(Y)
|
write_lock(X)
|
...
|
...
|
Perhaps the simplest strategy for deadlock prevention is to lock everything
according to one master order. This is impractical for the programmer to
enforce; the database can only do it if we ask for all the locks together:
write_lock(X,Y,Z).
In practical terms, we also have to deal with deadlock detection,
and then decide which transaction will die and which will wait. Note that it
soon becomes obvious in practice that deadlock has occurred: any transaction
that waits for more than a few seconds on a lock, and is not moving up in
the queue for that lock, is likely deadlocked. More formally, we can
construct a wait-for graph, and
look for cycles.
Timestamps
An alternative to locks entirely is the use of timestamps. We assign each
transaction a timestamp, and then only allow schedules that are equivalent
to executing the transactions in timestamp order. To do this, each database
item has associated with it a read-timestamp and a write-timestamp; the
former is the latest timestamp of all transactions that have read X, and the
latter is the latest timestamp of all transactions that have written X. If a
transaction T attempts read(X), we first compare T's timestamp with
read-timestamp(X). If T's timestamp is not later than read-timestamp(X), we
abort T and put it back in the transaction queue with a later timestamp;
ditto for write(X). This protocol can involve cascading rollback, so further
constraints are needed to get it to generate cascadeless or strict schedules
(or even recoverable schedules, for that matter).
Transactions in PHP
See http://php.net/manual/en/pdo.transactions.php. Here's an example from
that page:
try {
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$db->beginTransaction();
$db->exec("insert into staff (id, first, last) values (23, 'Joe', 'Bloggs')");
$db->exec("insert into salary
(id, amount) values (23, 50000)");
$db->commit();
} catch (Exception $e) {
$db->rollBack();
echo "Failed: " . $e->getMessage();
}
Bear in mind that you should not interpose any wait for user input between
the beginTransaction() and the commit(); otherwise you may be holding a lock
"indefinitely".
Crash Recovery
A crucial feature of transactions (the "atomic" property) is that if the
database crashes, or is restarted (eg due to deadlocks), or one transaction
fails, then no transaction is left "partially" executed: every transaction
either completes or is rolled back.
The typical way to implement this is with a "transaction journal file",
which is carefully written to disk after every update. This journal might
contain:
- a transaction ID
- original values of the data fields to be updated
- a flag indicating that the transaction is pending
The flag is set to "committed" at the end. Then, if the DB is restarted
after a crash, we can restore the records involved to their pre-transaction
state.
Note that we also will now need to roll back any other transaction that
depended on this one; we can use the transaction ID for that.
NoSQL
There is currently a movement that suggests that the days of relational
databases are waning, and that emerging needs to manage very large databases
means that new approaches must be found. A typical NoSQL database supports
tables that allow lookup via one primary key column, but cannot necessarily
be joined to other tables, or searched by other columns. Such a structure --
sometimes called a ⟨key,value⟩ store --
looks something like a giant version of a Java HashMap, though often with
multiple data columns.
Not everyone is in agreement with this: see I
Can't Wait for NoSQL to Die, by Ted Dziuba (archived here).
Also note that, more and more, NoSQListas claim that their acronym stands
for "Not only SQL", as if to
recognize that SQL and Relationalism will always have a role to play.
But for the Facebooks and Amazons and Googles out there, a traditional RDBMS
has limitations. The issue, however, isn't simply a matter of size: a big
part of it is the need to partition the database to support huge numbers of
users.
Also, there are several different ways to think about size:
- Raw number of records
- Number of records being written;
large distributed read-only (or read-mostly) databases are relatively
easy
- Number of simultaneous locks,
related to the number of simultaneous writes.
Large distributed email databases, where everyone is constantly having new
email written to their inbox, are relatively easy too: any inbox is accessed
by only two entities: the mail deliverer (adding mail) and the user
(deleting it).
As a second example, consider the user "walls" on Facebook. These are viewed
from numerous points, and updated from numerous sources. However, the
updates are timestamped, and we can rely on the updates arriving eventually.
Thus, if we replicate a given user's wall over several databases, there may
be moments when different versions of the wall show different comments.
However, we are ensured of eventual
consistency: a given comment will eventually appear in every wall
instance, in the correct order.
As a third example, consider Amazon's online-ordering database. A single
user placing an order can be assigned to one particular, smaller database.
The items are entered, the payment is made, and then Amazon needs to process
the order. But the latter can be done in a relatively leisurely fashion. The
important part, however, is making sure the inventory
is updated; in particular, we don't want to sell the last copy of
something to two different people.
At first glance this suggests that the inventory database needs to be consistent at all times. But there's a
problem: the CAP theorem.
CAP theorem
Eric Brewer's CAP theorem: Consider the following three attributes for large
distributed databases:
- Consistency of view across all
locations
- Availability at all times
- Partition tolerance: if a link
breaks, the database pieces can keep going
Theorem: you can only have two of them.
Argument: Suppose you have the last two. A link does break, and the two
halves continue in isolation. Then we cannot possibly hope for consistency.
But why do we even care? Why ever build a distributed database?
Because you're tired of being poor, that's why. You want to build an Amazon,
an E-bay, a Google, not a single-server DB that can process 100 queries a
second when you have 10,000 arriving.
Brewer's talk is here.
Gilbert & Lynch's proof is here.
If we insist on a single master DB, we forfeit partition tolerance. Brewer's
theorem does not mention this explicitly, but for databases to be really
large we want to be able to distribute them.
A classic distributed database forfeits availability: if the link between
nodes goes down, the database is unavailable to anyone, even those able to
access one of the nodes.
Consider DNS as a big database: we do not
have consistency.
One solution is eventual consistency:
when connectivity returns, we resolve conflicting transactions. DNS does
have eventual consistency. The problem case, of course, is when two
transactions involve updates to the same item. Note that for the Facebook
wall example above, there aren't really any conflicting transactions (unless
you count this: A posts, B comments on A's post, A deletes his/her post
without seeing B's. But a related form of this happens all the time even
when the data is consistent: A posts, B comments, A deletes
without refreshing the page).
The worst situation is when Amazon has one book left, and two people have
just bought it. We can guarantee this won't happen with a central database
(or even a separate central database for each of several disjoint sets
of products, eg DB[N] is for all items where SKU mod 137 = N). But eventual
consistency is not good here: it leads to scenarios where the last copy is
sold twice.
Note, however, that while eventual consistency may not be good here, it may
in fact be good enough. What
should Amazon do if the last copy is sold twice? They can just tell one of
the buyers that, too bad, the item they ordered turned out to be out of
stock. This happens often enough anyway, due to discrepancies between the
online inventory and the physical inventory.
In other words, eventual consistency may not be ideal, but it may be a
workable arrangement.
Formally, eventual consistency means that if no new
updates occur, then eventually all accesses will return the last-updated
value, or the winner of the conflicting-update resolution. DNS does this.
Denormalization
Aside from the CAP theorem, there are other problems with RDBMSs. Joins, for
one. Joining two big tables can be expensive. The advantage of joins is
traditionally that the underlying database is protected from anomalies. But
suppose we eliminate those through careful front-end programming? (As an
example, suppose we move the constraint that SSN is a key of the EMPLOYEE
table out of MySQL and into our PHP front end. It would be straightforward
to implement.)
For as long as RDBMSs have been around, DB managers have engaged in
occasional "denormalization" to avoid a particular join for performance
reasons. One example might be a TOTAL field in the INVOICE table: the total
for a specific invoice_num is calculated by finding all items ordered, as
listed in INVITEMS:
select sum(quantity*price) from INVITEM ii where
ii.invnum = invoice_num;
This is a relatively expensive join and search! Storing a TOTAL field in the
INVOICE table introduces a potential inconsistency, but it is also rather
common practice for performance reasons.
EN6 gives a denormalization example on pp 731-732. Consider a view ASSIGN:
⟨emp_id, project_id, emp_lname, percent_assigned,
proj_name, proj_mgr_id, proj_mgr_name⟩
This is a complex join, even after replacing percentages with project_hours
and total_hours:
select e.ssn, p.pnumber, e.lname, w.hours,
sum(w2.hours), p.pname, d.mgr_ssn, m.lname
from employee e, employee m, project p, works_on w, works_on w2,
department d
where e.ssn = w.essn and p.pnumber = w.pno and e.ssn = w2.essn and p.dnum
= d.dnumber and d.mgr_ssn = m.ssn
group by e.ssn;
Sometimes materialized views are
used for this. Note that this amounts to a denormalized table, though it is
now one that the DBMS is responsible for keeping synchronized.
select e.ssn, p.pnumber, e.lname, w.hours,
(select sum(w2.hours) from works_on w2 where w2.essn = e.ssn) as total,
p.pname, d.mgr_ssn, m.lname
from employee e, employee m, project p, works_on w, department d
where e.ssn = w.essn and p.pnumber = w.pno and p.dnum = d.dnumber and
d.mgr_ssn = m.ssn
group by e.ssn;
Big Data Models
The basic big-data model is the ⟨key,value⟩ store, meaning
that from the key we can retrieve a single matching record. Lists are often
directly supported as values because the relational-DB way of representing
lists (as a new table indexed by key and listitem) is too slow.
Often these models are not only NoSQL but are "schema free", usually meaning
that they store ⟨key,value⟩ pairs where the value can be some structured
data, sometimes in JavaScript Object Notation (JSON), below (note the
hierarchical subtyping and the support for list structures). Some data
models might allow two different keys: ⟨key1,key2,value⟩.
{
"ssn" : "123456789",
"name" : {
"fname" : "John", "lname" : "Smith" },
"bdate" :
"1965-01-09",
"address" : {"street": "731
Fondren", "city" : "Houston", "state" : "TX" },
"salary" : 30000,
"dno"
: 5,
"projects": [ {"pno" : "3",
"hours" : 30 }, {"pno" : "20", "hours": 10 } ]
}
The main point is that for a given key the rest of the data can be any type;
that is, we in effect have a map of key strings to Object.
Some big-data models support sharding: splitting up rows
of a dataset among multiple locations.
Examples:
BigTable: a Google
internal product. Tables can be subdivided (and then distributed). To look
up a value, you supply the usual row key, and also a column
key: an identifier for the particular column (or columns) you
want to retrieve.
Hadoop: a public Apache
project. The framework is typically used to implement distributed
filesystems, or other large projects with a high degree of distributedness.
Cassandra: also
an Apache project. Cassandra is specifically a database intended to handle
very large amounts of data spread over a large number of servers. The data
model was taken from BigTable. A given key can map to multiple columns,
often grouped into column families. Column families are in effect declared
through a database schema, but the actual columns within families are not.
Until 2010, the Facebook messaging system used Cassandra; it then moved to
HBase.
Hbase: another Apache
database with close ties to the Hadoop framework.
Hive: the database
running on top Hadoop that is used by Facebook.
MongoDB: A
"document-centric" database using JSON-like syntax (above). But field names
do exist (even if they are now dynamic). Searches can be done on any field,
and indexes can be created. Data distribution is again a major goal; Mongo
also includes load-balancing features.
CouchDB: Couch ("Cluster
Of Unreliable Commodity Hardware") was an earlier NoSQL effort. Data
consists of free-form "documents", with JSON structuring. Keys are internal
row identifiers. Couch has a form of support for ACID, but substituting
eventual consistency for strict consistency. Couch has specific tools to
handle resynchronization of separated databases (eg an mp3 player that
spends significant amounts of time "offline" but that is merged at intervals
with the master database). Documents in the database contain revision
information, intended to facilitate the merger of two documents from two
separated Couch instances. It is left up to the application to figure out
how to merge two documents that have been independently modified; sometimes
this is straightforward and other times it is impossible to do in a
meaningful way.
Mongo and Couch are considered to be "document" databases; Cassandra and
Hbase are considered to be "wide-column" databases. This often means that
many columns are allowed, with grouping of columns into "super-columns",
etc. It may also mean, however, that the data is organized on disk by
column, rather than by row. This allows faster searching of individual
columns (as a much larger number of column values will be in a single
block), but is slower for retrieving an entire record.
More on Hadoop
Hadoop (http://hadoop.apache.org/)
is the combination of the following:
- Hadoop kernel (for Hadoop-specific I/O and CPU scheduling)
- Hadoop Distributed File System, which is fault-tolerant
- MapReduce Engine
- upper layers like Hbase and Hive
The goal is for the system to have an awareness of where the data is
located. Operations on a given collection of data are scheduled, where
possible, on the same LAN as the data, avoiding backbone traffic. Hadoop
is generally "rack-aware", really meaning it knows when two nodes are on
the same high-speed switch.
Smaller-scale Hadoop setups have one master node (CPU) and multiple
"worker" nodes that are both DataNodes (repositories of data) and also
TaskTracker nodes. Data is replicated on multiple nodes; the default
degree of replication is 3.
Hadoop is designed to support a lot more reading than writing. Examples
might include Google's search index, Facebook pages or Amazon inventory.
In June 2012, Facebook had a 100 petabyte Hadoop installation, growing at
0.5 PB/day.
The Hadoop Filesystem is just that. Most ⟨key,value⟩ stores will be
implemented as Hadoop tables. Note, however, that any filesystem
is a ⟨key,value⟩ store where keys are filenames and values are file
contents.
MapReduce
MapReduce is a large-data paradigm developed by Google and based on a
common functional-language operation. The original overview paper by Dean
and Ghemawat is mapreduce-2004.pdf.
Not every Hadoop installation runs MapReduce, but this is the "native"
data-analysis tool. The MapReduce engine has one master JobTracker node,
which receives MapReduce tasks from clients and organizes them onto
TaskTracker nodes. TaskTracker nodes generally run a small number
(typically 4) of MapReduce tasks.
The idea behind MapReduce is that the programmer must supply two
components:
- an easily parallelizable map() operation that gets
run on every data unit
- a reduce() operation that combines the results of
the map() calls
The map() operation usually takes a ⟨key,value⟩ pair and returns a list
of related ⟨key,value⟩ pairs. A more specific description of the interface
is
map (in_key, in_value) -> list(out_key,
intermediate_value)
The out_keys in a single list need not be the same.
A reduce() operation typically takes a single out_key, and a list of all
(or some of) the intermediate_values associated with that out_key. The
result is a list of output values.
Here is one of google's basic examples of MapReduce. This example counts
the occurrence of each word in a set of documents.
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Notice the map() operations generate long lists of words with a count of
1. The reduce() phase is applied to all the map() results with the
same key (that is, all map() results for the same word).
Notice also that the ⟨key,value⟩ input to map() is
⟨filename,file_contents⟩, quite different from the ⟨key,value⟩ pairs
generated as output.
Here's an example in which we assume that the map() operations actually
count all the individual words of each document:
doc1: here is a list of words
doc2: list of words with words in list and another list
doc3: here here here is words and words and another words
Map() outputs:
(here, 1) (is, 1) (a, 1) (list, 1) (of, 1) (words, 1)
(list, 3) (of, 1) (words, 2) (with, 1) (in, 1)
(and, 1) (another, 1)
(here, 3) (is, 1) (words, 3) (and, 2)
(another, 1)
Reduce inputs:
(here, 1) (here, 3)
(is, 1) (is, 1)
(a, 1)
(list, 1) (list, 3)
(of, 1) (of, 1)
(words, 1) (words, 2) (words,
3)
(with, 1)
(in, 1)
(and, 1) (and,
2)
(another, 1) (another, 1)
The reduce tasks then add up each word over each list, below. Note how
the MapReduce infrastructure must take the results returned by the map()s
and reorganize them by key (shuffles them) to feed into the
reduce() operations.
here 4
is 2 for the
a 1
list 4
of 2
words 6
with 1
in 1
and 3
another 2
Search is another operation that lends itself to
MapReduce formulation. The map()s just output records containing the
value; the reduce() just consolidates the map output.
Reverse web-link graph: the map() searches source pages and
outputs a ⟨target,source⟩ pair each time a link to target is
found on page source. The reduce operation concatenates all
⟨target,source⟩ pairs with the same target into a pair
⟨target,list(source)⟩
A fair bit of machinery is needed to organize the map() outputs in a way
that can be fed into the reduce() inputs; this is sometimes called the shuffle
stage. This shuffle, though, is a generic, easily parallelizable
operation. Typically this phase uses extensive hashing on intermediate_key
values: each map() output pair ⟨ikey,ivalue⟩ is immediately put into
bucket hash(ikey) by the system. Then, in the reduce() phase,
⟨ikey,ivalue⟩ pairs with the same intermediate_key in one bucket are fed
to one reduce() task; of course, that bucket will contain all
the ⟨ikey,ivalue⟩ pairs with ikey = intermediate_key. The system keeps
track of the order of intermediate_key values in one bucket, for easier
later sorting.
Hive
Hive (http://hive.apache.org/) is a
database application built on top of Hadoop. It supports ⟨key,value⟩
tables with indexes. It also supports data types (eg
large-file data types such as text files, subtables, and binary files.
Tables may contain partition columns (sometimes
"virtual" columns) that provide tags for partitioning the table among
multiple nodes; in the data-definition language this is indicated by a
PARTITION BY clause. Tables within one node may be organized into buckets,
eg by date or userid; these in turn are indicated by a CLUSTER BY clause.
Here is an example table declaration from https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-WhatisHive:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
Unlike relational systems, a primitive array type is
supported; note the friends field above.
Arguably Hive's primary feature is that it supports an SQL-like query
language, HQL. It is not as general as SQL, but HQL queries are very
unlikely to run excessively slowly on huge data. HQL queries are
translated by Hive into MapReduce queries.
HQL supports
- row selection using a WHERE clause
- column selection using a SELECT clause
- aggregation functions using a GROUP BY clause
- limited equijoins[!]
Query output always goes into an intermediate table, rather than to the
console. You can do this in SQL, too.
For joins, it is recommended to list the larger table first.
Reservations problem
Not on the final!
Suppose we want to create a reservation system for trains (the traditional
mode of transportation for the Seat
Reservation Problem, which has an extensive literature). Here is
some information:
The train stops in Argenta, Brocton, Colp, Dongola, Elsah (4 links). (These
are actual Illinois towns, but almost certainly NOT on any kind of line.)
Suppose the following reservations are made:
seat 1 is reserved A-B
seat 2 is reserved B-C
seat 3 is reserved C-D
seat 4 is reserved D-E
Problem: no seats in 1-4 are now available for end-to-end!
Note that there is always a
solution that involves no seat-changing, as long as no link is ever
overbooked: have passengers get on the train and find an empty seat. If
there is no seat, that link is overbooked. If they find a seat, they keep it
until they get off. The problem with this is that we would like to grant
reservations in the order made, rather than in the order of embarkation.
Some algorithms:
OPT (optimum): Essentially what was outlined above. We get all
reservation requests and find the arrangement that seats everyone. (A
variation may involve reservations made by groups, the "find a seat" method
may split up groups, which we might not want to allow.)
Worst:
Assign the first seat that has no reservations attached. Above.
First-Fit:
Assign the min seatnumber that is free for the desired interval.
Best-Fit:
Minimize the size of the empty space containing the seat. We need lots of
tie-breaking rules to make this precise.
With five seats, do we prefer _ _ x x _ or _ x x _ _? (X = reservation, _ =
vacant empty space)
Worst-Fit:
maximize the empty space left open around a seat, on the theory that we're
leaving large blocks for later.
The problem case above is due to worst-fit.
Other algorithms used in practice:
Long-haul-only: Accept reservations for travelers who are traveling > 50%
of the route, say, first. A few days before travel, open up the reservation
system to other travelers.
Pooling:
have separate seat pools for short-haul and long-haul reservations. In
general, K pools where K <= #legs
Usual performance measure: worst-case ratio of # of passengers accommodated
using the algorithm to # of passengers that could be accommodated using OPT.
These algorithms are related to the memory-allocation problem, except that
we assume
NO CANCELLATIONS (or few, at any rate)
RESERVATION MUST BE CONTIGUOUS
Suppose there are only two legs on the schedule.
We can prove that first-fit is optimal: Suppose some reservation is blocked.
Claim: it's because on one leg or the other, the train is full.
Take away the 2-leg (through) reservations. Remaining one-leg reservations
for seat 1 form an initial segment; similar for seat 2. Whichever had more
reservations must have used ALL the remaining ones.
Three legs, TWO SEATS
p1: leg 1
p2: leg 1,2
p3: leg 3
p4: leg 2,3
First-fit assigns:
p1: seat 1, leg 1
p2: seat 2, leg 1,2
p3: seat 1, leg 3
p4: no available seats
but best-fit assigns as:
p1: seat 1, leg 1
p2: seat 2, leg 1,2
p3: seat 2, leg 3 tighter fit than seat
1
p4: seat 1, leg 2,3
What if we just allow reservations for trips of length >= 2? This also
works.
Representing reservations:
Attempt 1:
⟨passenger, seat, leg, date⟩
create table reservations (
passno integer primary key, -- one passenger can't
reserve two seats
seat integer,
leg integer,
unique (seat, leg)
-- can't reserve same seat twice
);
I left out DATE for simplicity.
Problem: queries for available seats are very difficult in pure SQL. You can
only select fields from a SUBSET of records. With only the table above you
cannot ask about vacancies (but see below)
To ask for seats reserved on leg 3:
select r.seat from reservations r where r.leg = 3;
To ask for seats reserved on legs 2 or 3:
select r.seat from reservations r where r.leg between 2
and 3;
But how to ask for free seats?
Attempt 2:
⟨passenger, seat, leg, date⟩ plus VACANCIES: ⟨seat, leg, date⟩, initially
full.
To ask for an open seat on leg 3: (again ignore date; just assume ⟨SEAT,
LEG⟩
select v.seat from VACANCIES v where v.leg = 3;
(select * from VACANCIES v2 where
v.seat = v2.seat and v2.leg = 3)
Attempt 1: table Reservations
Hard to find vacant seats on leg 3
Attempt 2: separate table of Vacancies
Easy to find vacant seats on leg 3, BUT tricky to guarantee consistency
Back to attempt 1:
create table reservations (
passno integer,
seat integer,
leg integer,
primary key (seat, leg)
-- can't reserve same seat twice
);
create table seats (
seat integer primary key;
);
create table legs (leg integer primary key);
Now, to find seats that are vacant on legs 2 and 3:
select seat from seats
minus
select r.seat from reservations r where r.leg between 2
and 3;
To implement the First-Fit algorithm:
select min(seat) from
( select seat from seats
minus
select r.seat from reservations r
where r.leg between 2 and 3
);
First-fit:
To reserve a seat for legs A-B:
select min(seat) from
( select seat from seats
minus
select r.seat from reservations r
where r.leg between A and B
);
Best-fit: I couldn't do this in SQL
for each seat, get length of open block containing A-B
(zero for some)
almost a sql query
select n from seats where
now, order this by size, and consider only blocks of min
size.
AVAIL(s, lo, hi): seat s is available on legs lo through
hi:
NOT EXISTS (select * from reservation
r where r.seat = s and r.leg between lo and hi)
for each seat, the biggest block of free legs below 3:
select s. seat, min(l.leg) from SEATS s, LEGS l where
l.leg <= 3
and not exists (select * from
reservations r where r.seat = s.seat and r.leg between l.leg and 3)
group by s.seat;
select s. seat, min(lo.leg), max(hi.leg) from SEATS s,
LEGS lo, LEGS hi
where
lo.leg <= 3 and hi.leg >= 3
and not exists (select * from
reservations r where r.seat = s.seat
and r.leg between lo.leg and hi.leg)
group by s.seat;
select s. seat, min(lo.leg), max(hi.leg), max(hi.leg)-min(lo.leg) AS gap
from SEATS s, LEGS lo, LEGS hi
where
lo.leg <= 3 and hi.leg >= 3
and not exists (select * from reservations r where r.seat
= s.seat
and r.leg between
lo.leg and hi.leg)
group by s.seat
order by gap;
Range query:
select s. seat, min(lo.leg), max(hi.leg), max(hi.leg)-min(lo.leg) AS gap
from SEATS s, LEGS lo, LEGS hi
where
lo.leg <= 3 and hi.leg >= 4
and not exists (select * from reservations r where r.seat
= s.seat
and r.leg between
lo.leg and hi.leg)
group by s.seat
order by gap;
Best fit:
1. put the above into an array, stopping when the value
of GAP becomes > value of GAP in first row.
3. Apply tie-breaker rules to the entries in the array.