Beyond awesome | 越而胜己

Overall System Architecture

The SimpleDB database system is a minimal relational database system with the basic DBMS features. It contains two essential modules: the storage manager and the query processor.

The query processor is considered the “front end” of SimpleDB, and consists of the query parser, the query optimizer, and the query executor. The query parser is the direct interface between users and SimpleDB, and is in charge of parsing SQL queries into logical query plans. The query optimizer converts the logical query plan into an optimal physical query plan. The query executor then executes the physical plan.

The query plan is basically a tree of operators. SimpleDB implements the most common relational algebra operators, including Filter, Join, Group-By, and Project. It also supports data manipulation operators, including Insert and Delete. At the bottom of the query plan are the access methods, which is just SeqScan in the current SimpleDB.

The storage manager acts as the “back end” of SimpleDB. The storage manager provides an series of interfaces and abstractions for SimpleDB to access data tables stored on the disk. The top-level abstraction is the SeqScan access method, which serves as the source for the upper-level operators in a query plan.

At the core of the storage manager is the buffer pool manager. The main purpose of the buffer pool manager is to cache pages on the disk in memory. To allow concurrent transactions, the storage manager also contains a lock manager. The lock manager makes sure that no conflicts happen while allowing multiple transactions to potentially share access to the same page. The log manager maintains a log file that allows the database system to recover if the system crashes, and allows transactions to roll back when it aborts.

The buffer pool manager is the core in that it acts as the point-of-contact for transactions to use the lock manager and the log manager. When a transaction requests a page, the buffer pool manager informs the lock manager to let the transaction try to acquire a lock on that page; when a transaction begins, changes a page, commits, or aborts, the buffer pool manager informs the log manager to write a corresponding log entry. The log manager only operates independently when recovering from a crash.

Architecture
Overall Architecture of SimpleDB

Essential Components

Buffer Manager (Lab 1)

The buffer manager takes care of caching disk pages in memory. To understand how the buffer manager works, one needs to understand how SimpleDB tables are stored on disk.

Each table in SimpleDB is a file. SimpleDB has an interface for different implementations of the file, but the default implementation uses HeapFile s. Each HeapFile is a collection of HeapPages, and each HeapPage contains a collection of Tuple s. To access any tuple on disk, the buffer manager reads the page that the tuple is in from the disk, and caches it in the buffer pool. Because the buffer pool can't grow indefinitely, the buffer manager must decide when to flush or discard a page in order to admit new pages.

My implementation of the buffer pool uses a LRU (least recently used) eviction policy. I implemented a LRUCache class with $O (1)$ put/get operations, and used a LRUCache as the underlying data structure for the buffer pool. When the buffer pool needs to evict, it can simply query the LRUCache for the page to evict in constant time.

Operators (Lab 2)

The operators in SimpleDB are the standard, most common relational algebra operators: Filter, Join, Group-By (aggregates), and Project. SimpleDB also supports insertion and deletion of tuples via the Insert and Delete operators.

A query plan is basically operators organized in a tree. The operators all implement the OpIterator interface, and parent operators call next() or on their children to pass tuples up the query plan. The core method to implement in each operator class was the fetchNext() method, which returns the next tuple to return in next() . We will briefly discuss the implementation for each operator.

The Filter operator has a single child and is parametrized with a predicate. It reads in tuples from its child, and only return tuples to its parent that satisfy its predicate.

The Join operator has two children, and is parametrized by a join predicate. The Join operator in SimpleDB implements a nested loop join (with its left child child1 as the outer relation). It reads in tuples from both its children (and rewinds on the inner relation when it is depleted), and only returns to its parent joined tuples that satisfy the join predicate. In addition to the naive nested-loop join operator, SimpleDB also as a more efficient version of Join — the HashEquiJoin operator (which I did not implement, but came as provided code). HashEquiJoin uses a hash table on the inner (build) relation, and is much faster than nested-loop joins. However, as it name suggests, it only works for equi-joins because of how hash tables work.

The Group-By ($\gamma$) relational algebra operator is implemented as the Aggregate class, which uses the Aggregator classes at its core: IntegerAggregator and StringAggregator (since SimpleDB only supports integer and string fields). They work by first merging the tuples into their aggregation groups (each group would have the same aggregation key). While doing the merging, it also records some statistics about each group depending on the operator. Later when it returns tuples, it returns aggregation results for each group, by doing simple calculations with the pre-calculated statistics.

The Project operator is straightforward. It doesn't discard tuples from its child, but transforms them and only keeps the selected columns before passing them to its parent.

The Insert and Delete operators are rather special, because they only return a result of their execution. The important thing to note about these operators is that they are guaranteed to execute only once. This is implemented by maintaining a executed field, which is set to true after the first execution; later executions will not occur because the executed flag is set.

SeqScan is technically not an operator, but an access method, but it follows the same interface, and act as the descendant for all operators. It calls a DbFileIterator to read tuples from the relation, and feeds the tuples to its operator parent. In short, it servers the contacting point between the query executor and the buffer pool.

Lock Manager (Lab 3)

The lock manager in SimpleDB allows multiple transactions to happen concurrently. Its purpose is to make sure that pages in the buffer pool are accessed in the fashion such that no conflict exists.

Quite literally, the core of lock manager is implemented in the LockManager
class. The LockManager class maintains a mapping from PageId s to PageLock s. If any transaction is using a page, then its PageId will map to either a SharedLock or an ExclusiveLock . The SharedLock class contains a set of holding TransactionId s, while the ExclusiveLock class only contains one holding TransactionId . One or more pages that have read-only access to a page can share a SharedLock together, or a single transaction can have exclusive access to that page with an ExclusiveLock . The transactions are served on a first-come-first-serve basis. That is, if a later transaction tries to acquire an incompatible lock and gets denied, it will be blocked and must wait for the page to become available again. The LockManager class also takes care of deadlock detection with a wait-for graph. Whenever a transaction attempts to acquire a lock but is denied, a cycle detection is run on the wait-for graph, and if a cycle exists, the LockManager aborts the denied transaction.

The LockManager acts as a submodule of the BufferPool class, and the BufferPool is in charge of adding locks in LockManager when a transaction requests a page with some permission (read-only or read-write). One thing to note is that SimpleDB enforces a strict 2PL locking policy, i.e. all locks held by a transaction are only released when the transaction finishes.

Log Manager (Lab 4)

The log manager is mainly realized in the LogFile class, the majority of which came as provided code. The log manager allows us to switch from FORCE/NO-STEAL from lab 3 to NO-FORCE/STEAL. NO-FORCE means that pages no longer needs to be flushed when a transaction commits, and STEAL means that dirty pages can be evicted from the buffer pool. The NO-FORCE/STEAL policy is much more flexible, and is now possible because we now write to log whenever we make changes, and the log will ensure that no changes are lost even if the system crashes.

The log manager writes to disk transaction beginnings, any updates, checkpoints as well as committing and aborting. With the log, a transaction can abort after rolling back all its changes, and the rollback can be done easily by undoing all “update” entries of that transaction. Whenever an update is undone, the log manager appends an compensating log record (a CLR entry) that reverts the update.

Of course, the most important feature of the log is to allow the system to recover after a crash. After a crash, the log manager takes an ARIES-like approach to recover. The recovery is a two-phase procedure. The analysis-redo phase works from the latest checkpoint to the end of the log, and computes the loser transactions. The following undo phase then undoes all the changes done by loser transactions.

Query Optimizer (Lab 5)

SimpleDB adopts a Selinger-style query optimizer, and makes decisions based on table statistics. A TableStats class is in charge of keeping track of the statistics of a table. It uses histograms — IntHistogram s for integer fields and StringHistogram s — to keep track of how values in each column are distributed, and uses them to perform selectivity estimation. It also estimates the cost to scan the table based on the size of the table.

These statistics are mainly used for ordering joins, in the JoinOptimizer
class. The JoinOptimizer uses a dynamic programming approach, similar to Selinger . It first calculates best orders for two-way joins based on estimated table stats, and then uses the costs for these optimal joins to evaluate best orders for three-way joins, and so on and so forth. The JoinOptimizer is then used for query optimization. See my lab 5 writeup for more details on how the query optimizer works.

Discussion

Overall, I think SimpleDB is an okay DBMS. It has what a minimal RDBMS needs: concurrency control, crash recovery and query optimization. I am satisfied with how reliable the system is. Its storage manager is pretty well-designed and implemented: the lock manager offers robust concurrency control, and the log manager allows the system to recover whenever the system crashes. The query executor part of SimpleDB is also implemented cleanly.

In terms of performance, I am happy that it is able to handle complicated queries on medium-sized datasets (1% IMDB) in seconds, when using HashEquiJoin for equi-joins. But personally, I'm not quite satisfied with the query optimizer. Based on experimental results from lab 5 (running four-way join queries on the IMDB dataset), table statistics estimation doesn't seem to work very well for string fields, and could lead to poorly selected physical plans. If possible, I would try to tune table statistics estimation further so the query optimizer can make better decisions.

One big addition that would speed up query execution would be indexing. SimpleDB currently doesn't support indices, and adding indexed-scan and index-based joins would definitely allow complicated queries to go much faster.

Another thing that I would like to improve is the buffer pool's eviction policy. It currently uses LRU, but LRU is not optimal for a nested-loop-join-based disk access pattern. Hellerstein and Stonebraker
mentioned the LRU2 eviction policy which is more optimal for database systems, and I would like to try implementing that if time permitted.

Bibliography

[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107–113. DOI:https://doi.org/10.1145/1327452.1327492

[2] David J DeWitt, Jim Gray, and San Francisco. Parallel Database Systems: The Future of High Performance Database Processing. 26.

[3] Joseph M Hellerstein and Michael Stonebraker. Anatomy of a Database System.

[4] P Griffiths Selinger, M M Astrahan, D D Chamberlin, R A Lorie, and T G Price. Access Path Selection in a Relational Database Management System. 12.

[5] Michael Stonebraker and Joseph M Hellerstein. What Goes Around Comes Around. What Goes Around Comes Around, 40.

You’ve successfully subscribed to Skyward
Welcome back! You’ve successfully signed in.
Great! You’ve successfully signed up.
Your link has expired
Success! Check your email for magic link to sign-in.