FlockDB: Twitter's distributed, fault-tolerant graph database

amix · on April 12, 2010

By the looks on the code this isn't really a graph database. It's more a graph database emulation built on top of a SQL database. For distribution it uses sharding so it isn't really distributed either, at least not like Cassandra or other distributed databases.

I think Redis would perform a lot better than SQL for graph like structures - since sets are a native datatype in Redis. And you can go A LONG way with just one Redis database (currently we are storing over 20 million keys in our Redis database and I know some that are storing 100 million keys on _one_ server). And with the new Redis VM coming up, I would guess that scalability of Redis is going to be even better.

Other than this, neo4j seems very interesting and would probably also have been a better choice than using a relational database.

lsb · on April 12, 2010

You can go a long way with a SQL database too. I'm writing up an article about getting the two billion words of Wikipedia into an inverted index in a SQLite database entirely in memory, and that's another other order of magnitude bigger.

amix · on April 13, 2010

If you only know how to use a hammer then everything else looks like a nail. In other words, re-implementing an inverted index in SQL is waste of time when you can use tools like Sphinx and Lucene - which are highly optimized to do inverted indexes and that can easily handle 2 billion words. The same can be said about FlockDB - it's possible to emulate a graph database, but is the effort really worth it when there are such tools like Redis and neo4j which seem to be optimized for graph like structures.

al_james · on April 12, 2010

Hmmm... This looks to only store first order relations efficiently, its seems that to traverse many nodes, you would need to repeatedly query the database (e.g. I can only get my friends, not the friends of my friends etc...). This severely limits the use for most problem domains you would want to use a graph Db for. Still, possibly useful if you have to solve a problem that looks alot like twitter's.

wheels · on April 12, 2010

Once you start getting out to second order it becomes a much more complicated (and interesting) problem -- one that I've been kicking around for a while.

Data-locality is the kicker in a distributed graph database; when doing traversals that cross multiple nodes you need to have a partitioning scheme that coordinates with your traversal algorithms so that you need the minimum number of machine-to-machine hops in a multi-level traversal. Getting that right is far more difficult than traditional database sharding.

al_james · on April 12, 2010

Yeah sure.... it is much harder and involves minimizing the number of relations across shard boundaries. Not easy. However, to call a system that only allows depth 1 traversals a 'graph database' is slightly pushing the definition. To me, its more a "key value database with relations between keys".

Everyone has different requirements though, if depth 1 and huge scale are what you need, flock db might be for you.

hendler · on April 12, 2010

I've yet to set try out Gizzard. Wasn't expecting FlockDB to be released so soon.

Wondering if FlockDB is truly abstracted from MySQL/Cassandra. And also wondering how performance compares Neo4j

nkallen · on April 12, 2010

Note that it's in the process of being released: it's as yet unusable by outsiders. Honestly, I did not expect this to make Hacker News so soon. :(

I have not used Neo4J first hand. It has really cool features, but it is not a distributed database and has expensive memory usage. FlockDB is distributed, uses little memory, and has a very limited feature-set that is highly highly optimized for OLTP. It's not really an apples/apples comparison. Theoretically, Neo4J could be used as a back-end data-store in FlockDB.

emileifrem · on April 12, 2010

I agree that it's not apples/apples. From the first few minutes, I think the main strength of Neo4j is the rich ecosystem and functionality on top of it, and the fact that it stores an infinite-levels deep graph. In comparison FlockDB stores one level (e.g. user -> followers). The main strength of FlockDB seems to be that it has built-in distribution, which is something we're working on for Neo4j but it's not yet generally available.

All this of course based on just a quick glance, so I may come back all the wiser and revise my opinion later. :)

-EE [http://neo4j.org]

labria · on April 12, 2010

I didn't look too deep into the "distributed" features (no docs yet, the code suggests sharding), but the feature set looks a lot like Redis sets.

qhoxie · on April 12, 2010

Many (all?) of the distributed features (including the sharding) are part of gizzard, which it sits on top of.

http://github.com/twitter/gizzard

labria · on April 12, 2010

Makes even less sense to me, then.

simonw · on April 12, 2010

Redis is less than a year old. I doubt it was a serious contender when Twitter started building their own solution.

emileifrem · on April 12, 2010

Well, the data model seems very similar to Redis' from first glances [1], but FlockDB certainly seems to have completely different durability characteristics. So even if they started anew today they may end up building their own.

1] Which would make FlockDB less a graph db and more a key-value store with social network semantics for the values.

-EE [http://neo4j.org]

labria · on April 12, 2010

Redis is a bit more than a year old. And it's been tagged 1.0 last september. I doubt that that the FlockDB project is much older than that.

riffraff · on April 12, 2010

I took a loock at the code but I'm not sure i understand one thing: why one class per file for case classes, scala is much cooler than that :)

labria · on April 12, 2010

Scala again? Damn, my bet was on Clojure this time! =)

jseifer · on April 12, 2010

The contributors section names four people. If only four people wrote something like this, that's ridiculously impressive.

wheels · on April 12, 2010

It's only about 2000 lines of code. More like, "that it took four people to write this leaves an impression". ;-)

(In all seriousness, no dig on the authors, planning on poking through some of the source in the next bit.)

brown9-2 · on April 12, 2010

size of source code is not a good measurement of size of achievement

wheels · on April 12, 2010

I'm just going to post a link to my response to that comment the last time it came up:

http://news.ycombinator.com/item?id=1155026

"Lines of doesn't say anything" is one of those flawed mantras that people keep repeating as an overreaction to the too often used assumption that it's the most important metric.

brown9-2 · on April 12, 2010

Not sure if I get your point here. My response to a flawed statement is flawed?

moe · on April 12, 2010

"Fault tolerant" and "Twitter" in the same sentence?