Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
FlockDB: Twitter's distributed, fault-tolerant graph database (github.com/twitter)
79 points by qhoxie on April 12, 2010 | hide | past | favorite | 23 comments


By the looks on the code this isn't really a graph database. It's more a graph database emulation built on top of a SQL database. For distribution it uses sharding so it isn't really distributed either, at least not like Cassandra or other distributed databases.

I think Redis would perform a lot better than SQL for graph like structures - since sets are a native datatype in Redis. And you can go A LONG way with just one Redis database (currently we are storing over 20 million keys in our Redis database and I know some that are storing 100 million keys on _one_ server). And with the new Redis VM coming up, I would guess that scalability of Redis is going to be even better.

Other than this, neo4j seems very interesting and would probably also have been a better choice than using a relational database.


You can go a long way with a SQL database too. I'm writing up an article about getting the two billion words of Wikipedia into an inverted index in a SQLite database entirely in memory, and that's another other order of magnitude bigger.


If you only know how to use a hammer then everything else looks like a nail. In other words, re-implementing an inverted index in SQL is waste of time when you can use tools like Sphinx and Lucene - which are highly optimized to do inverted indexes and that can easily handle 2 billion words. The same can be said about FlockDB - it's possible to emulate a graph database, but is the effort really worth it when there are such tools like Redis and neo4j which seem to be optimized for graph like structures.


Hmmm... This looks to only store first order relations efficiently, its seems that to traverse many nodes, you would need to repeatedly query the database (e.g. I can only get my friends, not the friends of my friends etc...). This severely limits the use for most problem domains you would want to use a graph Db for. Still, possibly useful if you have to solve a problem that looks alot like twitter's.


Once you start getting out to second order it becomes a much more complicated (and interesting) problem -- one that I've been kicking around for a while.

Data-locality is the kicker in a distributed graph database; when doing traversals that cross multiple nodes you need to have a partitioning scheme that coordinates with your traversal algorithms so that you need the minimum number of machine-to-machine hops in a multi-level traversal. Getting that right is far more difficult than traditional database sharding.


Yeah sure.... it is much harder and involves minimizing the number of relations across shard boundaries. Not easy. However, to call a system that only allows depth 1 traversals a 'graph database' is slightly pushing the definition. To me, its more a "key value database with relations between keys".

Everyone has different requirements though, if depth 1 and huge scale are what you need, flock db might be for you.


I've yet to set try out Gizzard. Wasn't expecting FlockDB to be released so soon.

Wondering if FlockDB is truly abstracted from MySQL/Cassandra. And also wondering how performance compares Neo4j


Note that it's in the process of being released: it's as yet unusable by outsiders. Honestly, I did not expect this to make Hacker News so soon. :(

I have not used Neo4J first hand. It has really cool features, but it is not a distributed database and has expensive memory usage. FlockDB is distributed, uses little memory, and has a very limited feature-set that is highly highly optimized for OLTP. It's not really an apples/apples comparison. Theoretically, Neo4J could be used as a back-end data-store in FlockDB.


I agree that it's not apples/apples. From the first few minutes, I think the main strength of Neo4j is the rich ecosystem and functionality on top of it, and the fact that it stores an infinite-levels deep graph. In comparison FlockDB stores one level (e.g. user -> followers). The main strength of FlockDB seems to be that it has built-in distribution, which is something we're working on for Neo4j but it's not yet generally available.

All this of course based on just a quick glance, so I may come back all the wiser and revise my opinion later. :)

-EE [http://neo4j.org]


I didn't look too deep into the "distributed" features (no docs yet, the code suggests sharding), but the feature set looks a lot like Redis sets.


Many (all?) of the distributed features (including the sharding) are part of gizzard, which it sits on top of.

http://github.com/twitter/gizzard


Makes even less sense to me, then.


Redis is less than a year old. I doubt it was a serious contender when Twitter started building their own solution.


Well, the data model seems very similar to Redis' from first glances [1], but FlockDB certainly seems to have completely different durability characteristics. So even if they started anew today they may end up building their own.

1] Which would make FlockDB less a graph db and more a key-value store with social network semantics for the values.

-EE [http://neo4j.org]


Redis is a bit more than a year old. And it's been tagged 1.0 last september. I doubt that that the FlockDB project is much older than that.


I took a loock at the code but I'm not sure i understand one thing: why one class per file for case classes, scala is much cooler than that :)


Scala again? Damn, my bet was on Clojure this time! =)


The contributors section names four people. If only four people wrote something like this, that's ridiculously impressive.


It's only about 2000 lines of code. More like, "that it took four people to write this leaves an impression". ;-)

(In all seriousness, no dig on the authors, planning on poking through some of the source in the next bit.)


size of source code is not a good measurement of size of achievement


I'm just going to post a link to my response to that comment the last time it came up:

http://news.ycombinator.com/item?id=1155026

"Lines of doesn't say anything" is one of those flawed mantras that people keep repeating as an overreaction to the too often used assumption that it's the most important metric.


Not sure if I get your point here. My response to a flawed statement is flawed?


"Fault tolerant" and "Twitter" in the same sentence?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: