That is easy to do for a one-off processing (I did it recently), but in my opini...

yummyfajitas · on July 28, 2012

Queries on demand, with handy search tools:

    $ scp census_data.csv admin@my_postgres_server:/tmp/census_data.csv
    $ ssh admin@my_postgres_server
    $ psql
    # CREATE TABLE census_data ... ;
    # COPY census_data FROM '/tmp/census_data.csv' (DELIMITER ',');
    # CREATE INDEX idx_census_zipcode ON census_data(zipcode);

To do the actual search:

    conn = psycopg2.connect("dbname=census_data user=postgres")
    cur = conn.cursor()
    cur.execute("SELECT count(id) FROM census_data WHERE zipcode='%s';", (zipcode, ))
    return cur.fetchone()

What advantage does http/json have over this?

(Yes, I realize I'm missing GROUP BY's.)

[edit: I don't mean to be negative about government transparency AT ALL. I'm only criticizing the particular technical choice here - for small structured data sets, a bunch of csv's in a zip file is the clear winner. Pandas/excel/etc >> json over http for ad-hoc work, and postgres >> json over http for interactive queries (or ad-hoc work).]

sholanozie · on July 28, 2012

The retrieval and processing of the data is done externally (i.e. not on your server).

yummyfajitas · on July 28, 2012

CA is 973M as zipped csv. CA is a bit over 10% of the US population, so the whole data set will be about 10gb. You can fit that on one of the cheaper linodes pretty easily.

With a few indices and maybe even a materialized view (or even pruning data you don't need), you can answer most queries so fast that latency between linode and the census > query time.

I know munging csv's and writing sql isn't as sexy as JSON APIs or mongodb, but sometimes the simple solutions are the right ones.

mkopinsky · on July 29, 2012

>CA is 973M as zipped csv. CA is a bit over 10% of the US population, so the whole data set will be about 10gb. You can fit that on one of the cheaper linodes pretty easily.

You forgot about unzipping. The total is 13.49 GB zipped, or 133.4 GB unzipped. [0]

[0] http://www2.census.gov/census_2010/04-Summary_File_1/Califor...

sholanozie · on July 28, 2012

I don't disagree with you, but for many people it's far more convenient to query an external API than to create their own.

_tggb · on July 29, 2012

Why should I have to install Postgres to be able to play around with data? This is one less step to think about.

yummyfajitas · on July 29, 2012

As I pointed out in my first comment, you can also just load it with pandas/excel/etc. I brought up postgres for the specific case of "queries on-demand on your web server or mobile app".