10

Tokyo Cabinet

February 28, 2009 tagged database, dbm, performance, python, software, tokyocabinet and tyrant. filed under software

Lately I’ve been researching some into the holy grail of keyed data storage – best combination of performance, scalability, efficiency and availability. There are many, many options available ranging from the Berkeley DB to BigTable implementations like Hypertable.

Last weekend I spent some time looking into using BDB in a BigTable fashion for managing schema-free tables. However my tests revealed many problems with a solution like that. For instance, BDB is really slow when writing random keys into databases of >100k row size. In the beginning of this week I had a chat with Jon Åslund regarding this idea and he introduced me to Tokyo Cabinet – a modern, battle-tested and extremely high-performance DBM.

Despite the somewhat uncool name, Tokyo Cabinet is a silent beast developed by Mikio Hirabayashi and used in the high-load environment of Japanese Facebook-equivalent Mixi. TC (short for Tokyo Cabinet) is written in C99 C, sporting a clean and modern API.

Mikio states TC improves on other DBMs in the following areas:

  • Improves space efficiency – smaller size of database file.
  • Improves time efficiency – faster processing speed.
  • Improves parallelism – higher performance in multi-thread environment.
  • Improves usability – simplified API.
  • Improves robustness – database file is not corrupted even under catastrophic situation.
  • Supports 64-bit architecture – enormous memory space and database file are available.

Python bindings

As you might know Python lies close to my heart, so despite the multitude of language bindings available for TC I will only talk about the Python bindings pytc by Tasuku Suenaga.

At the time writing this pytc is only available from PyPI and the main repository, thus you need to build this trivial module yourself.

$ svn co http://svn.coderepos.org/share/lang/python/pytc/trunk/ pytc
$ cd pytc
$ sudo python setup.py install

Unfortunately there is no documentation of pytc – you need to look through pytc.c manually, which might give you pleasure or pain, depending on if you have a life or not :P. Anyhow, Tasuku have created two examples giving you most clues needed. Good thing is the Python API is almost a clean port of the C API, meaning you can refer to the C library reference and apply some common sense.

Here’s a very simple program using a hash database, setting three pairs and asserting one of them are retrieved in a proper manner:

import pytc
db = pytc.HDB('casket.hdb', pytc.HDBOWRITER | pytc.HDBOCREAT)
db.put('potato', 'potatis')
db.put('carrot', 'morot')
db.put('banana', 'banan')
assert db.get('carrot') == 'morot'

As the data is mapped into shared memory and the library is thread-safe (locking per row or per database), multiple processes and/or threads of control can operate on the database at the same time. By default whole-database locking is performed, in which case there can be only one writer at any given moment, blocking any other readers or writers. Unless a writer is doing its thing, there can be multiple concurrent readers without any locking.

The performance of the library is very good – almost as fast as working directly with the C API – randomly reading 2 000 000 records took less than 15 seconds (~140 000 reads/sec) using a single thread of control. The database read from was a hash database with 10M records with a variable key length of 10-15 characters.

Update: I’m now working on the Python bindings under the name tc – currently available for Python 2.4, 2.5, 2.6 and 3.0 in MacPorts and PyPI. Source resides in the ‘Hub.


Server – Tokyo Tyrant

While Tokyo Cabinet is best thought of as an API to the native database routines, you’ll be happy to know that you can, in fact, run Tokyo Cabinet in stand alone mode with the help of Tokyo Tyrant - a high concurrency network interface to the underlying database. Backup, replication, and failover are all part of the package.

I ran a few benchmark tests of Tyrant, which proved that this solution indeed provide extremely high performance. This test was run on a MacPro with 8×2.8 GHz Intel Xeon processors with almost no other load, reserving 4 threads for the server.

Server started like this:

# Server with 4 threads, using a hash table with 10M buckets
$ ttserver -thnum 4 "*#bnum=1000000"

Test 1 – 1 000 000 records, 4 threads:
(Result: About 22 000 operations per second for any of reading, writing or deleting records)

tc-tyrant-4threads-1mrecs

Using the mget we can fetch several records in one call, drastically increasing performance:

tc-tyrant-mget-performance

The above numbers was aquired by running tcrmttest read against the local Tyrant server with 1M records:

$ tcrmttest read -tnum 4 -mul X localhost

Library benchmark

Mikio provides a pretty good benchmark document in which he has compared many DBMs with Tokyo Cabinet:

Write performance (Writing 1 000 000 records)
tc-bench-write

Read performance (Fetch all 1 000 000 records)
tc-bench-read


Conclusion

A very interesting project which state imo. are in the transition between “new and experimental” and “good old reliable software”. I’ve tried to contact Tasuku Suenaga (without any success) about getting his Python bindings into MacPorts. Tokyo Cabinet and Tyrant are perfect complements to my Smisk web services framework – maybe I’ll continue working on Tasuku Suenaga Python bindings and bring TC to Debian and MacPorts as well as Smisk.

Further reading

Ilya Grigorik has a pretty good introduction to Tokyo Cabinet covering parts of TC not mentioned in this article (for instance the array and table engine).


10 comments

  • Avatar
    Kalle Haglunds February 28, 2009

    Hmm, whole-db-locking per default? Sounds as if that could potentially completely kill performance in a production environment?

  • Avatar

    There is no production ready system which can handle concurrent writes to the same data set. However; Tokyo Tyrant provides a replication mechanism which can be used to allow concurrent writes and reads. From the TC documentation:

    Discrete database object can be operated in parallel entirely. For simultaneous operations of the same database object, read-write lock is used for exclusion control. That is, while a writing thread is operating the database, other reading threads and writing threads are blocked. However, while a reading thread is operating the database, reading threads are not blocked.

    Memcached have the same “problem” as Tyrant. Tyrant can speak both the tyrant protocol as well as the memcached protocol. It’s also interaction-able over HTTP REST.

  • Avatar
    Bharani March 2, 2009

    Is there a support for Table engine of tc in pytc?

  • Avatar

    Unfortunately not, but there will be support for table in a future release together with support for array/fixed length databases.

  • Avatar
    Kalle Haglunds March 2, 2009

    Yes but *database level locking* for writes, most modern dbs have table level locking for writes.

  • Avatar

    Oh, a database is a table in this case. It’s bare bones, fast and minimal ;)

  • Avatar
  • Avatar

    Got your name from Jonas Beckman, I have done a similar “trip” like you, but first played with CouchDB (which is still cool) and now I have implemented the binary socket protocol for TT in Squeak. Would be fun to exchange some ideas around TT/TC usage.

  • Avatar
    Cowtown Coder March 4

    “There is no production ready system which can handle concurrent writes to the same data set.”

    Not sure about that — doesn’t BDB-JE do exactly this? It certainly does not block whole BDB table for writes, and supposedly has rather good concurrency.
    It is a trade-off of course: speedwise TC is probably faster. But still.. such coarse-grained granularity seems problematic for many things, including building of distributed key-value stores.

    • Avatar

      I was referring to true concurrent (as in at the same time) writes. A system can of course “yield” in the middle of a write (and store some kind of state) while someone else gets to write a little. That’s how hard disk drives work afaik. There are so many smart people out there — I guess someone will or have already come up with a solution :P

Comment