Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Data wrangling. So I wrote my own "DataFrame" -- we have an official one coming to Mathematica 10, too.

Also, binning. There is a nice theory for multidimensional binning and aggregation [that I haven't seen anyone describe explicitly so far]. So I wrote primitives. They play nicely with plotting, statistics, etc. That'll also be in Mathematica 10.

Lots of Go for data egress. It's perfect for it.



What can Mathematica bring to the DataFrame concept that hasn't been done before?

Also: why Go instead of a JVM lang that can interop directly with Mathematica via JLink?

Finally: Will Mathematica directly support doing the full stack of this kind of work, including the egress?


1. DataFrames themselves? Well, I think they'll get interesting when they can 'know' about high-level entities like cities, countries, zip codes, ip addresses, etc. Basically, everything that Alpha knows and can compute about, we want Mathematica to know and compute with.

2. I used Go because I am very productive in Go and like a lot of things about it. Goroutines are neat. Java is fine, it's just very boilerplatey, and I'm not practiced enough at it to get past that. And I don't see why we can't develop a GoLink as well.

3. Probably not the whole stack, at least in the beginning. But we'll get there. We want to make it really easy to spider websites and so on.


How long did the whole piece take to put together, and what's the rough break-down of time spent on each component (data wrangling, finding useful sorts, visualizations, write-up)? Thanks!


Fulltime, around 6 weeks. Breakdown is hard to say.

I wasted a lot of time trying to do things the "traditional" way by loading into SQL, querying, etc, but it was actually much faster to process things in memory (I have a 16 gig machine). Intensive stuff was parallelized in Go and used ordinary filesystem with directory prefix tries for performance.

Writeup was mostly SW. He's worked on it maybe an afternoon a week for the last month.

I really enjoy visualizations and can iterate extremely fast (e.g. ChordPlot took half an hour). Don't know why M is not the defacto standard for dataviz people. Tweaking takes a long time, and design iterated with me on getting things looking really nice.

All in all, most of my time was spent building tools to easily create multidimensional histograms. The nice thing is that those tools are clearly useful enough we'll integrate them into Mathematica, so the cost is somewhat amortized.

NLP took a few weeks of Etienne's time... once again, amortized. Most of that is wrangling, really, and building tools to understand the deficiencies of your training set. Naive Bayes works surprisingly well, the magic is in the tooling and "human intelligence" you iterate with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: