We'll have to agree to disagree. I think the visualization you linked to is much harder to read, because the visual weight accorded to each edge is a non-linear (and somewhat arbitrary) function of the 'true' weight. It also doesn't scale well with number of vertices.
Whereas with the chord diagram, your eye is naturally drawn to the big arrows, and you can easily follow them. It's also bidirectional in a more straightforward way.
Data wrangling. So I wrote my own "DataFrame" -- we have an official one coming to Mathematica 10, too.
Also, binning. There is a nice theory for multidimensional binning and aggregation [that I haven't seen anyone describe explicitly so far]. So I wrote primitives. They play nicely with plotting, statistics, etc. That'll also be in Mathematica 10.
1. DataFrames themselves? Well, I think they'll get interesting when they can 'know' about high-level entities like cities, countries, zip codes, ip addresses, etc. Basically, everything that Alpha knows and can compute about, we want Mathematica to know and compute with.
2. I used Go because I am very productive in Go and like a lot of things about it. Goroutines are neat. Java is fine, it's just very boilerplatey, and I'm not practiced enough at it to get past that. And I don't see why we can't develop a GoLink as well.
3. Probably not the whole stack, at least in the beginning. But we'll get there. We want to make it really easy to spider websites and so on.
How long did the whole piece take to put together, and what's the rough break-down of time spent on each component (data wrangling, finding useful sorts, visualizations, write-up)? Thanks!
Fulltime, around 6 weeks. Breakdown is hard to say.
I wasted a lot of time trying to do things the "traditional" way by loading into SQL, querying, etc, but it was actually much faster to process things in memory (I have a 16 gig machine). Intensive stuff was parallelized in Go and used ordinary filesystem with directory prefix tries for performance.
Writeup was mostly SW. He's worked on it maybe an afternoon a week for the last month.
I really enjoy visualizations and can iterate extremely fast (e.g. ChordPlot took half an hour). Don't know why M is not the defacto standard for dataviz people. Tweaking takes a long time, and design iterated with me on getting things looking really nice.
All in all, most of my time was spent building tools to easily create multidimensional histograms. The nice thing is that those tools are clearly useful enough we'll integrate them into Mathematica, so the cost is somewhat amortized.
NLP took a few weeks of Etienne's time... once again, amortized. Most of that is wrangling, really, and building tools to understand the deficiencies of your training set. Naive Bayes works surprisingly well, the magic is in the tooling and "human intelligence" you iterate with.
1. I noticed the interactive slider, embedded into the webpage. That's not what vanilla Mma 9 can do. Is there a simple way we can do the same without the CDF plugin (e.g. a package I'm not aware of) or is this future functionality?
2. The graph with the migration data looks nice. Mma 9 can't do an edge layout like this (curved edges) by default. Is this again custom code (custom Graphics or custom EdgeRenderingFunction) or is it future functionality?
3. There's the part with the frequency of various graph motives (the number of edges, triangles, and other weird shapes in graphs). How did you count these? Was it done using Mma? Some of these motifs are easy to count (there are simple expressions in therms of the adjacency matrix), but some others like the (1-2, 2-3, 1-3, 3-4) subgraph are not so easy.
2. Custom Graphics. We don't (yet) support doing anything interesting with graph weights, which this relies heavily on. I think this is a good candidate for M10 -- name would probably be ChordPlot. I came up with a "GraphForm" construct that allows you to patch various graph properties into various visual parameters (size, color, edge weight, etc). That turns out to be quite useful.
3. Tally[list, IsomorphicGraphQ] . Isn't that cool?
4. Awesome! Nice code. We plan to create ImageCloud and WordCloud functions for M10. WordCloud will be specialized for representing word frequencies and so on. ImageCloud will be the general case: accept a list of images [potentially with transparency], and then find a nice layout given desired sizes. So much cool dataviz will be possible with this! Like country flags...
how would you effectively compete with distributed computing frameworks such as Pregel, MapReduce, and Dremel, when Mathematica is primarily used as a desktop application for in-RAM datasets? I know that Mathematica supports various parallelism options (such as multicore and grid), but frankly to gather real information requires much deeper probing, which far higher numbers of people, graph clustering/centrality on billions of nodes, edges etc. Mathematica's core routines seem to provide multicore implementations, but distributed algorithms require you to implement your own code on top of mathematica, meaning you'll never see the full performance/behavior of the finely tuned Mathematica im,plementation.
There is already HadoopLink. LibraryLink allows you to write C or C++ that gets dynamically linked into the kernel at runtime (no restart required), which gives you freedom to create your own threads and do your own thing [and crash the kernel]. A lot of kernel development happens that way now.
You can even synthesize C code from Mathematica (there is a symbolic subset of C in it already) and have Mathematica run the appropriate build process for you, so things can get pretty interesting with that alone.
Out-of-core processing of large datasets is already on the roadmap for Mathematica 10. We plan to have a domain-specific language to describe and work with external [or in-memory] datasets in an efficient way, translating as appropriate to the native database query languages. Our 'native' format will be HDF5.
Ultimately, though, I think we'll rely on code generation to compile Mathematica to LLVM or transpile it to Go, so that we can distribute chunks of computation out to a cluster using M as command-and-control.
The idea would be that you can create and test large processing pipelines from inside Mathematica and then distribute them across a cluster in an ad-hoc way, then visualize the progress, track errors, and analyze the results. Notebooks are really good for that kind of lightweight UI.
This isn't a new idea, but in a language as dynamic as Mathematica, I think it could be especially powerful. Of course, it is also tricky because type inference would be a big part of making this idea possible in a dynamically typed, symbolic language like Mathematica. But not impossible, I don't think. And functional languages already have demonstrated advantages in this type of situation -- take stream fusion in Haskell.
The out of code data processing is something that was sorely needed, and I was wishing for this for a long time. One of the big drawbacks of Mathematica for data processing was that it's only convenient to use if all the data can be read into memory.
What you're saying about distributing a computation on a cluster sounds very interesting. I used Mathematica for a hybrid Mathematica/C++ calculation (LibraryLink) where the complexity was handled by Mathematica and the (simple) heavy lifting by C++. I used the standard parallel tools to run it on a cluster, which means that communication was done through MathLink.
Another possible problem with my solution (LibraryLink, then Mma parallelization) was that it required a Mma license for as many kernels as I was running, even though most of them were only running the C++ code. But that's easy to fix on WRI's side.
When it comes to "questions I would ask [the data]": which interests are likely shared by friends? (Can you make a correlation table, or a graph out of it?)
And - is correlations of one's interest vs friends interests the same as correlation of one's interests with itself.
That's a good idea! Though if I remember correctly, interests aren't canonicalized, so it might be pretty messy. And I'm not sure if people fill them in non-ironically any more.
Well, it is always "people how clicked on FB that they like X", regardless whether it is shallow or deep interest, genuine or ironic, or random, or "because my friends like it and I want to be as cool as them" etc.
If more fine grained, I wouldn't be surprised to see ties between seemingly exclusive things... e.g. in this http://meta.stackoverflow.com/questions/157976/map-of-all-se... (not interesting, but participation in particular Q&A sites) "christianity", "judaism" and "islam" are in the same category (as opposed to people apathetic to that topic).
That's not precisely correct. In the old days, we typed in our favorite bands/books/music. The attempted canonicalization created some amusing interpretations of song as band, book as author, band as book, etc.
It has also made it exceptionally annoying to get feed updates from the 900 bands, actors, movies, songs, etc. that I said I liked and now have to "unlike".
I think you could correlate people within a certain confidence, but because of the nature of the data, you would have to expect a surprisingly large dissimilarity of interests within a clique over a certain account age based on this noise. Not 90%, but higher than the real value.
Sure, I expect data to be extremely noisy, but I am thinking about looking at very robust things (e.g. this guy is in rock-like perhaps-indie music), not making it too fine-grained.
If you look under "Method", there are a bunch of different methods to use that I'm told correspond to various landmark papers in the field. If you know about community detection, you'll recognize which methods correspond to which papers, but if you don't, why do you care? At least, that's our philosophy for documentation, but I'm not sure I entirely agree with that philosophy.
Quite a bit -- I've been analyzing my own data for years now.
I did some network analysis stuff like this for Twitter long before it was built into Mathematica (rant: someone at Twitter needs to make a 'graph query' API call so that it doesn't take 3 hours to get a single graph of your own network).
I would link to the relevant posts at taliesinb.net, but Posterous is down 7 days ahead of schedule.
I think it can bring out real world insights. You just have to be very cautious and not leap to conclusions because they seem to tell an interesting story. Although it is somewhat disturbing how "gendered" the wall post topic distributions are.
I'll have to check. Certainly there can be no harm in releasing some of the more aggregated data (i.e. that was behind the plots). Perhaps tweet at me so I don't forget -- @taliesinb
I found this interesting. What I would love to have seen, however, is a probe into the dynamics. You did a nice abstraction over time as you measured property X as age was varied. I would have loved to have seen the manner in which topics and ideas spread over your network.
For instance: If an event occurred in New York, say, how long would it have taken to spread to San Francisco? If there were no progression, topic times would center around the same time. This would indicate that people were getting their information from national, not local sources (e.g. the evening news), then talking about it on facebook. On the other hand, if a local topic was spread on facebook alone, we should see some sort of progression.
It's possible that this progression could take more interesting forms besides geolocation, but that might require a more extensive network. A simple experiment would work like this: A few thousand people who are not friends but have a similar interest (say an interest in Elizabeth Warren) post independently a video of her. This particular esoteric interest is unlikely to be valued a priori by their friends, but perhaps they are compelled to repost the information. What's the threshold of "esotericness" such that it won't "go viral?" Is there a way to predict virality as a function of how popular it is to begin with? Is there no actual progression across the network, but rather a small bump in topic expression, until it is picked up by larger media sources at which point the entire network is inundated with people reposting Elizabeth Warren recaps from HuffPo et al?
The reason this is interesting is that it sheds insight into the role of social networks: are we fundamentally disposed toward central sources like the NYTimes, or is facebook a fundamental sharing mechanism? That is, do I post on facebook just to have my views expressed, validated, and challenged, so that they might change the world over a few years? Or do I post on facebook to have my views propagate across the world much more quickly?
Finally, a question: How did you estimate the power law? I know how difficult it is to do this (e.g. not linear regression on a log-log scale). Did you compare the power law fit to other, similar distributions, like lognormal? Preferential attachment is indeed a beautiful theoretical result, because it implies the existence of power law degree distributions. Unfortunately, many networks are not as well represented by power laws as by alternative distributions, which casts doubt on the preferential attachment hypothesis as is. (Also, many sampling methods give rise to fictive power laws). That said, a fat tail can still be interesting.
You're right, that would be very interesting. The most obvious way we could have done this is by looking at the spread of our app itself as people started to use it. Unfortunately, we only started recording anonymized stats for the second release, so we've somewhat missed the boat there.
To do it with links and general "memes" would be technically much harder, because we'd have to periodically rescrape walls of all the donors to see time evolution. It was somewhat out of scope of the blog post, given all the more basic stuff we could do instead.
I'd be surprised if Facebook didn't already do an analysis of this when they "cracked down" on app virality a while back.
Bit.ly's Hilary Mason might have looked at this question too, and I'm sure it has been done to death with Twitter, though the demograph info is much sparser there.
2. This not being a scientific paper, we estimated it by drawing on the log-log CDF. Barring the noise that "deparadoxing" the friend's friend count distribution induces on the low end of the distribution, it was very linear over two decades. We didn't think the exact number was all that interesting, so we didn't spend any more effort than that. Facebook's anatomy paper probably has a very accurate number.
I'd heard about the fictive power law stuff. What makes me even more skeptical is that FB friends are probably a poor proxy for 'true' friends. You'd be better off looking at number of friends as defined by some cross-commenting threshold.
About the "fictive power law" thing: this is THE paper to read: http://arxiv.org/abs/0706.1062 (it's an easy read, explaining what the maximum likelihood method is, etc.). Despite what they say, fitting the log-log CDF usually gives pretty good results when done right (fitting the PDF does not)
I wonder if the higher friends count of Brazilian users is caused by the previous use of Orkut, where it was popular to try to have as many friends as possible.
The attitude of companies like Facebook, Google, Twitter is: if the product isn't addictive or useful to Billions of people, its not worth doing.
Hence there are vastly more resources dedicated to assimilating eg photos and games into their ecosystem, than into something computationally innovative.
This is IMHO a huge mistake, since they could instead be introducing simple forms of programming that takes you on a continuous curve from using the product, to developing for it. There is a huge hunger in the masses for better forms of programming.
This point will probably become obvious if the Wolfram Language is successful.
I'm sure Facebook's Data Science team does a lot of interesting things internally. They do in fact have some interesting papers [0] and [1], though obviously with more of an 'academic' feel than the blog post.
Nice! Do you know if Gephi can do something similar to the summarization that we did using cluster diagrams? The whole "ball of hair" problem doesn't have any other real solution, I don't think (well, unless you use edge clustering, but that doesn't help in-group connections).
I'm 90% certain that it can, but I haven't worked with it in a while. I remember a setting in the display options that simply combined all the dots of one color into one large group.
Cool! I wonder how one could combine the best of both worlds... what we're really talking about here is a hierarchy of graph plots in which you can drill down to each node = graph at a lower level.
Wow- this is awesome! It's really cool how people's friend distribution by age is a convolution of their age and the age of the general facebook population. It's also scary in a way to see a snapshot of how I'm likely to change in the future with regards to my clusters of friends, my relationship status, and what I'll talk about.
The traditional way to plot the assortativity by age is using a scatter plot / heatmap. This is similar to what they did for country homophily on p12 of the Facebook anatomy paper. The result would be a plot with a prominent diagonal, illustrating that "same attracts same".
That aside, imo, Facebook is an incredibly idiosyncratic "app", which makes almost no sense. And yet, it gave us so many opportunities for interesting discussions, like the insights in this blog post. Nice job.
Yeah, we tried a couple. Those heatmaps I think are quite hard to read.. because it is natural to want to take marginals, but you can't easily do that visually.
I think this whole "octile plot" thing turned out quite nicely. It's in a sense a way of 'slicing' the CDF into 8 even strips and projecting them onto a single axis. It's quite intuitive to read, too. Facebook seems to use it too for some of their papers.
One thing that bugs me is how comments are linked to "interest". There are many topics that interest people (passive consumption), that do not necessarily translate into engaging in a conversation with others publicly.
As a marketing term - sure, that would be a good indicator of interest. Since this article is more scientific than marketing-oriented, I would clarify what some of the metrics mean (or don't mean).
You're right. A more ambitious thing might get a bit closer to people's "real interests" would be to follow posted links and topic model the contents of those links.
How much of the friends with zero friends is simply because that information is blocked? If my friends "donated" their data, I would show as having 0 friends if I've blocked that information to apps.
Actually, NONE of the people in our dataset had zero friends. The x-axis starts at 1, not 0. The point is that resampling to remove the friendship paradox shows that there are many more people with single-digit friends than we expected.
Given mcintyre1994's comment, I think this still explains the same situation. People with single-digit friends are simply people who have friends blocked to apps but have multiple friends who've donated data.
Introducing "data science for Facebook" in 2013 is ... odd.
All the more so because Jeff Hammerbacher is often credited with coining the term "data science", and he started doing it at -- that's right -- Facebook.
Very nice looking graphs, but running "Wolfram Alpha Personal Analytics for Facebook" for my own profile comes with a rather nerve-wracking warning:
Wolfram Connection would like to access your public profile, friend list, email address, custom friends lists, News Feed, relationships, birthday, status updates, checkins, education history, hometown, current city, photos, religious and political views, videos, likes and your friends' relationships, birthdays, education histories, hometowns, current cities, photos, religious and political views and videos.
You have to opt in to being a data donor for us to store any of it.
Otherwise we just record basic anonymized statistics -- like number of friends, sex, age, etc... and throw all the detailed stuff away. Our privacy policy has more: http://www.wolframalpha.com/fbfaqs.html
We also encrypt with public keys like there's no tomorrow.
The Mathematica system makes some beautiful, informative graphs, and presumably users can make those graphs with a minimum of fuss and bother. It's technically very nice.
Yet, in the entire blog post, is there one insight that wasn't a priori obvious? Maybe the bits about migration.
The progression of interests over time was non-obvious to me. For instance the explosion of interest in travel in the 20s, and the temporary dip in interests like philosophical quotes also in the 20s.
Theres a lot of stuff in the post, I wouldn't dismiss it because its TL;DR
They said they'll use it to support their 'personal analytics' programme [0], which is free via wolframalpha.com - I don't see how this data would help with Mathematica or anything else they charge for?
Even though WA is closed source and for-profit, I few them (and the company) as a kind of fellow scientist and not the evil big corporation (like oracle, microsoft etc.).
If anyone would like to ask questions about what we did, I'd be happy to answer them.
There's still lots more interesting stuff to do, but it was enough for a blog post. Suggest away if you think we missed something obvious!