-
Website
http://randomfoo.net/ -
Original page
http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores -
Subscribe
All Comments -
Community
-
Top Commenters
-
lhl
97 comments · 6 points
-
jacobian
2 comments · 3 points
-
Rod Begbie
1 comment · 6 points
-
jmason
6 comments · 1 points
-
Jason Dusek
1 comment · 1 points
-
-
Popular Threads
Some issues with your system:
- what happens if a node fails? Without using Tokyo Tyrant's master-master replication you are pretty doomed
- what happens if you need to scale beyond 150M keys?
As I mentioned given the tradeoff in write i/o, snapshotting was acceptable for the client, so not doom, but more like minor inconvenience w/ acceptable data loss.
Updated: For scaling, lookups for collisions are O(log n), which seems acceptable while nodes are added and data redistributed.
http://neo4j.org
http://github.com/andreasronge/neo4j/tree/master (Ruby bindings with REST support)
(Disclaimer: I'm involved.)
-EE
I may do a clean-room rewrite and add in the dynamic expansion features and all that, but based on the ridiculousness of my near-term schedule... that probably wouldn't be anytime soon.
Were you looking for any backup capability? can you snapshot the state of the Tokyo Cabinet store to take a backup of that? or are you just relying on doing that via EBS?
(also: S3 as a k-v store: slooooow)
ie, for people that couldn't afford the data loss, I'm assuming that they can run M-M (w/ ulog'ing on another EBS volume if disk I/O is write-limited) and create a cleanup daemon that will check the rts's and delete expired ulogs? Based on my understanding, anything older than the rts timestamp on the corresponding master could be safely dispensed of? I didn't really test and I couldn't find that in the documentation so I punted. But if there are any Tokyo Cabinet experts reading this (or people that have tested) it'd be great to hear.
Also: agrreeeeed. :)
I know a lot more about non-rdbms than I did 20 minutes ago.
cheers Gavin
WRT voldemort. You are correct that all of these systems are very alpha. However, you should not see an decrease in performance based on the number of nodes. We are using a number of voldemort clusters at LinkedIn and we have not seen this problem, it sounds like a bug. Could you send me a little more information about your test setup so we can try to reproduce?
Thanks!
interestingly enough, i came to the exact same conclusion - Tokyo Cabinet/Tyrant with custom routing to multiple nodes is the only available solution (that doesn't cost an arm and a leg) to be fast enough and rock-solid.
Redis is surely a beta product, there are people using it in production but still we are entering in few days the feature-freeze stage now that Redis-git includes non-blocking replication. After we enter the feature freeze stage Redis will be stress-tested for weeks, then 1.0.0-rc1 will be released. My goal is to provide a rock solid product to the market.
So my hint is: handle with care since it's young code, but we are moving very fast, and feature-freeze stage is near. Also to make people safer Redis 1.0 will include a tool to dump a Redis DB into SQL format.
Redis apart, 1600 inserts / second are very poor performances. I think Tokyo cabinet is ways faster, probably it's the networking layer that is slow? Even MySQL is capable of 1600 inserts/second so if you really care about stability, replication, and things like this and you can live without very fast performances a table with an unique key ID and a blob value can really be a good alternative, especially in contexts where all you need is a plain key-value DB like Tokyo.
This is why Redis is stressing a lot on the data structures bit, that are things that are hard to model otherwise.
Redis looks cool so it'll definitely be something I'll keep an eye out in the coming months. The data structure approach is interesting...
You're correct that the insert/s numbers are much lower than the typically published numbers. Part of it is that it is going over the network, another part is that the items sizes are much bigger than those typically used in the benchmarks published. And it's EC2, so the I/O is crap. You're right that MySQL is the baseline there - I think lots of people don't know how fast it can be w/ simple queries -- although it tends to like lots of spindles. Lots and lots of them.
Btw thank you for this article, I understand your findings can be complete or always accurate, but to find non biased data on this stuff is really hard. I hope many other guys in the field will try different key-value stores under real world load and publish their findings. This is the only way all this projects can mature faster, start to be more reliable, and understand what the real user feeling is.
"Partionable", and "distributed" are also tall claims for most of them. I looked at redis too and can't understand where the distributed part comes in.
"Based on the maturity of projects out there, you could write your own in less than a day. It’ll perform as well and at least when it breaks, you’ll be more fond of it. Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is"
Completely agree. At the end of the day, its not rocket science to write your own memory hash-map and have a thread write backups to a disk file or just embed BDB and be done with it. And you can tune it to do exactly what you need for your own domain, including managing relationships if necessary.
thanks for writing this up. Can you elaborate on the CouchDB vaporware bit, though?
Cheers
Jan
--
I also couldn't really wrap my head about the benefit of not having indexes but having to recalculate a view anytime the data changed, but I'd say mostly that at the time (and based on the Q&A at Bob Ippolito's talk maybe still) that CouchDB fanboys and developers were all over the Internet taking up oxygen about Couch while like I mentioned, what I assumed were core components didn't exist workably, much less being suitable for anything but the most toy test projects.
Next time, when you have no clue how a system works it would be best to refrain from talking trash about it and revealing the depth of your ignorance. CouchDB is not the greatest thing since sliced bread, is not a key-value datastore (which makes me wonder why it was even in your list other than to justify your petty little rant), and has a ways to go to meet some of its design goals, but within its niche it is a rather interesting tool that people should pay attention to.
As for trash talk, I have to say that you've been engaging in a fair amount of it. I'm posting my experiences (and I don't claim it to be anything more than that). It's not rocket science, but it's real data w/ real world usage in an area where there's significantly more smoke than fire (or published results). So, what's your skin in the game, and what's your contribution?
Cheers!
What problems did you run into? We recently fleshed out the docs at http://incubator.apache.org/cassandra/; we'd appreciate feedback as to what needs to be added.
If you'd like to see what I ran into, you could spin up an EC2 instance running a Rightscale Debian or Alestic Ubuntu instance and make sure that a user is able to get a blank system up and running. I was able to compile Cassandra, but the Thrift bits gave me some trouble. Once that was supposedly all up and running, I couldn't actually talk to the Cassandra server to test, so my assumption was that I'd missed something in setup.
I got distracted by some other tests at that point and ended up never pushing further having found a better solution. Also, I'm not sure if your intentions are for widespread production use at this point, but if so, I think that I'm like most sane people in getting totally skeeved out by running trunk checkouts in prod. Some packages/releases would probably also be really helpful in that regard.
We're about to turn the corner from mostly-developer-focused to trying to get something that works out of the box for most people. Getting a release out is part of that. Thrift is a bitch and there's unfortunately not much we can do about that, but maybe providing a vmware image with sample single-node and 3-node configurations (for instance) would help there.
Of course then there's the whole RPM side of things, not to mention things like Gentoo or even (shudder) Windows. :)
It'd be nice to get the s/n ratio up a bit for people that actually need to run something into production (I mean sure it's the Interweb, but the amount of fanboy/hater hot air has been pretty insufferable in this area).
http://www.bycast.com/
I used to work there, so I know the product intimately. But I haven't compared it to the open source stuff out there. I doubt any of the free solutions are the type of thing you'd run a hospital on (due to support, documentation, etc.)
Also, my understanding is that HBase mostly uses HDFS (the distributed file system) as opposed to Hadoop.
Good writeup.
Just to add to the list of 'what about X' posts : You mention MySQL, but not MySQL Cluster.
(I am a MySQL Cluster developer btw.)
MySQL Cluster is at heart a distributed key-value store using hash based partitioning.
It supports in-memory or disk storage of key-value pairs as primary key and attributes.
Additionally it supports :
Multi kv pair transactional reads/updates
Synchronous replication of updates within a cluster
Disk-persistence of in-memory data via Redo and checkpointing.
Automatic node failure and recovery handling
Asynchronous replication of updates to other clusters/MySQL databases, including Master-Master with conflict detection/resolution
Online addition of storage nodes and data repartitioning (from version 7.0)
Secondary indices on data (unique, ordered).
SQL access supporting MySQL SQL syntax
Access from all MySQL supported connectors (JDBC, PHP, Perl, etc..)
Latency+throughput optimised API for remote clients
Online snapshot backup, optionally compressed
e.t.c...
It is open source, licensed through GPL, with support available if required.
I suspect that MySQL Cluster could meet or beat the latencies and throughputs of the other systems discussed here, especially when accessed via a native API rather than through MySQLD. Internally it uses a message-passing state machine architecture (similar to the CSP style of Erlang) which gives really nice properties w.r.t. latency, throughput and system efficiency.
Perhaps because MySQL Cluster is associated with MySQL it appears to be 'relational' and therefore does not get included in open-source kv store comparisons?
Hope this doesn't sound too much like an advert :),
Frazer
Hopefully someone gives MySQL Cluster a spin, would love to see how it compares.
Yes, MySQL Cluster is the name we give a system of MySQL servers connected to an Ndb Cluster.
I think there's some confusion with the definition of disk-persistence and durability.
MySQL Cluster has always had disk-persistence. All changes are redo-logged to disk and checkpoints to disk are used to allow the Redo log to be trimmed. Checkpointing has a few percent impact on achievable throughput - the disk write bandwidth used can be traded off against checkpoint duration and hence redo log size. The redo log is not fsynced at every transaction commit, but periodically - usually every 2s, and down to every 100millis. This tradeoff allows high throughput on Cluster's internal 2PC.
This window means that committed transactions are not immediately disk-durable, but when running with 2 or more replicas, all data is synchronously replicated at commit time, so committed transactions are machine-failure durable, and become disk durable (on all replicas) within ~2s. This is a three-way trade off between tolerance to total cluster failure (requiring disk durability), tolerance to machine failure (requiring machine-failure durability) and throughput (requiring control of fsyncs/s).
Prior to MySQL 5.1, all data was held (and had to fit) in memory.
From MySQL 5.1, non indexed data (i.e. the values in a kvp) can be stored on disk. This means that when they are read/written they are fetched from disk into an in-memory LRU cache in the same way as most databases. This allows data sets larger than the memory size to be handled by a single cluster node, at the cost of some performance. Persistence/Durability is the same, with Redo log flushed periodically etc.
Over time we will add support for disk-storage of indexed data (keys in a kvp), disk-durable transactions etc.
I take your point about complexity. Getting a system that has 'just enough' complexity to meet your needs is always hard. I think MySQL Cluster could suit some folks but it's not the simplest system out there.
Frazer
http://www.persvr.org/
http://groups.google.com/group/project-voldemor...
Rebalancing, lI believe, is the next feature to be released.
Both the partition-rebalance2 and protobuf branches look promising, so we'll just have to see.
Personally, not knowing Bob at all, I found his presentation to be useful as a good overview for people that haven't been playing around with the various packages (and it jibed well enough with my own experiences) - I think he was pretty up front about where he was approaching it from (as someone who needed a solution that worked and his experiences - not as any domain expert, whatever that means). Most of the data is going to be out of date anyway since projects have been moving pretty quickly. And the plain fact of the matter is that his presentation has gotten attention precisely because there's so little published out there. In that respect, I think that it's a pretty big contribution to the community and I wish there was *more* of that out there, not less.
Anyway, if want to correct the errata in the presentation, why not just drop a line and it'd get fixed? If it just offends your sensibilities that "anyone" can go around, test some stuff and talk about it... well, that's err, usually how that works. At least he's put his real name to it (that's what I'm a firm believer in).
Anyway, since we've all said our pieces, I'd like to consider this conversation closed unless Bob jumps in. Life's too short.
cheers,
Terry
For the idle socket hanging issue, do you think it's an issue with pytyrant or tyrant itself?
I submitted a patch to pytyrant that could potentially be related to it, basically the client hangs when the socket is closed (which could happen on idle connections.)
http://code.google.com/p/pytyrant/issues/detail...
See if it fixes your issue?
=wil
Rough comparison of MySQL performance against Redis performance
http://colinhowe.wordpress.com/2009/04/27/redis...
Probably similar numbers for other KVS... but yet to find out.
Will be looking at Tokyo Cabinet later and adding in a similar test :)
http://code.google.com/p/schemafree/
It uses Mysql as storage, has key based distribution, versioning, being able to make incremental changes to lists and other features. It has built-in integration with Memcached.
From what I have just learned seems I have been working on kind of Voldemort clone. I mean, working in isolation I've come to very similiar concept/architecture however still different. Nice :)
I know the pain of having to build my own fundamental library as what existed is just not quite what I needed.
On the topic of key/value stores you missed one which is a little more obscure but I like it a lot. SkipDB from the author of IO.