Anyone know if there's an easy way to get the 2010 census data all at once the way they've made the 2000 census data available? (http://www2.census.gov/census_2000/datasets)
For ease of exploring new ideas it would be super handy to be able to push it all into a hadoop cluster on aws or some such nonsense.
That is easy to do for a one-off processing (I did it recently), but in my opinion is a lot more of a pain if you want to do queries on-demand on your web server or mobile app in response to user input. Maybe less importantly, this means that the Census Bureau has written some handy search tools (filter by state, zip code, etc.) so that you don't have to - that was never a really big deal, but still annoying.
$ scp census_data.csv admin@my_postgres_server:/tmp/census_data.csv
$ ssh admin@my_postgres_server
$ psql
# CREATE TABLE census_data ... ;
# COPY census_data FROM '/tmp/census_data.csv' (DELIMITER ',');
# CREATE INDEX idx_census_zipcode ON census_data(zipcode);
To do the actual search:
conn = psycopg2.connect("dbname=census_data user=postgres")
cur = conn.cursor()
cur.execute("SELECT count(id) FROM census_data WHERE zipcode='%s';", (zipcode, ))
return cur.fetchone()
What advantage does http/json have over this?
(Yes, I realize I'm missing GROUP BY's.)
[edit: I don't mean to be negative about government transparency AT ALL. I'm only criticizing the particular technical choice here - for small structured data sets, a bunch of csv's in a zip file is the clear winner. Pandas/excel/etc >> json over http for ad-hoc work, and postgres >> json over http for interactive queries (or ad-hoc work).]
CA is 973M as zipped csv. CA is a bit over 10% of the US population, so the whole data set will be about 10gb. You can fit that on one of the cheaper linodes pretty easily.
With a few indices and maybe even a materialized view (or even pruning data you don't need), you can answer most queries so fast that latency between linode and the census > query time.
I know munging csv's and writing sql isn't as sexy as JSON APIs or mongodb, but sometimes the simple solutions are the right ones.
>CA is 973M as zipped csv. CA is a bit over 10% of the US population, so the whole data set will be about 10gb. You can fit that on one of the cheaper linodes pretty easily.
You forgot about unzipping. The total is 13.49 GB zipped, or 133.4 GB unzipped. [0]
As I pointed out in my first comment, you can also just load it with pandas/excel/etc. I brought up postgres for the specific case of "queries on-demand on your web server or mobile app".
To be clear, it's not the data that they've collected that's the problem, it's the data access that the census has thus far provided. That's why it's so important that they've made an attempt to update their portal and API. That said... their dev forum still requires devs to wait for manual moderation in their sign up, for instance.
Another resource that has existed for a while is the "Integrated Public Use Microdata Series". Though I don't think they have an API, they do provide a vast amount of data and have tools for online analysis.
IPUMS is often used by the researchers (including myself) who would download the entire dataset and do regressions on it, which might be one reason why they haven’t bothered to create an API.
I've been hammering code on this all weekend. If anyone is interested in discussing, feel free to contact me. I'm mainly working on queries for county information regarding age, race, sex, occupation, and income.
For ease of exploring new ideas it would be super handy to be able to push it all into a hadoop cluster on aws or some such nonsense.
EDIT: Found it! Hopefully... http://www2.census.gov/census_2010/04-Summary_File_1/