Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Big data analytics with Pandas and SQLite

> A Large Data Workflow with Pandas

> Data Analysis of 8.2 Million Rows with Python and SQLite

> This notebook explores a 3.9Gb CSV file

Big data?



Yeah like others point out this is small data. My rules are:

1) Does it fit in a desktop computer memory? If it does it is "small data".

2) Does it fit on a hard drive in a desktop? Then it is just "data". Or medium data.

3) If you need a cluster or some centralized network storage to fit and manage it, you might have big data.

4) Next level up is streaming data. Your data doesn't even sit anywhere because it is accumulated faster than you could ever processes it.


The best definition I've seen of Big Data is "the database indices don't fit in memory on the beefiest single server you have access to". (And this doesn't mean you can claim big data because you only have access to your 4 GB RAM laptop.)


My usual rule of thumb is "does AWS sell a single instance big enough to store and process data in real time?" If you can't find an instance big enough, you have big data.


Infrastructure-wise - no.

Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet, manually investigate a chunk of it and search with regex is (a buzz word) "big data". Also, you need you forget about all O(n^2) scripts.

If you do any kind of analysis (and you are of size < Google), processing is the easy part. The hard part is to get maximum out of it, while disregarding noise. (Also, it is why when you ask data scientists what they do, a common answer is that 80-90% of time is data-cleaning.)


  Methodology-wise - yes
No. Big Data methodology has traditionally (going back to MapReduce) been all about distributed computing, and figuring out how to get anything meaningful at all out of your data when it can't be comfortably processed on a giant server. When you talk about "size < Google" - Big Data as a term was ORIGINALLY COINED to describe what Google has to deal with. For example, Wikipedia says:

  Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
People have been dealing with multi-GB datasets for a lot longer than the term "Big Data" has been around, and it was not intended to refer to existing techniques. Unfortunately, lots of people (especially non-programmers) have tried to adopt it to describe what they do because it's buzzword-y and because Google used it.

  80-90% of time is data-cleaning
That's Data, period. I have spent 80-90% of my time cleaning when dealing with a 1MB flat text file. That doesn't make it Big Data.


I won't argue about a particular phrase, as 1) buzz-words more often mis-used than used properly 2) there is no single, strict definition of it.

But going from ~MBs to ~GBs change methodology (of course, it's more about number of items and their structure than sheer size). Even at this scale there is some tradeoff of quality/quantity. And feasibility of using some more complex algorithms.

Data cleaning - sure, for all data. But when volume is in MB, it is still possible to open it it a text editor and manually look into abnormalities. Once it's GBs some artifacts will remain undetected.


The fact remains that nothing that can be analyzed on a single computer in a single threaded application in a single memory space is anything like "big data", unless you just want to tap into the trend of buzzword-laden link bait inbound marketing.


> Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet...

Why just spreadsheets? This is essentially loading data in SQLite and manually investigating it with SELECT queries.


Haha 3.9Gb, thats nothing. I'd argue that anything that can fit into ram of a relatively modern laptop is small data.


The word around here is "if it fits onto a computer you own, it's not 'big data'".

I'd look strangely at anyone claiming "big data" for anything smaller than double digit terabytes.


Yeah. I would hesitate to call anything under 60 terabytes as big data.


This is an interesting point because the typical amount of RAM in a modern computer is 8GB, with 16GB being a rare upgrade.

For data analyses which I've done, my 8GB computers will complain when analyzing 2-3GB datasets because when processing the data, the computers needs to make a copy in memory, since the analysis isn't in-place, and there are memory leaks (stupid R). So 4GB-6GB of memory usage causes hangs (since the OS is using a lot of memory too) and is why I've had to resort to SQL for data sets that hit that size. (which is perfectly fine)


I think it was submitted as "SQLite and Pandas for Medium Data Workflows" (that's what I see in my RSS reader), but may have been changed to reflect the article title.


OP here - yup, you're right, submitted as "SQLite and Pandas for Medium Data Workflows"


That was not particularly big data in 1995. You'd just load it into Oracle running on a desktop machine and query it in SQL, draw some graphs in Developer/2000.


I mean, using this approach I imagine you can do the same with much more data...

(Also Big Data is one of those words whose definition sucks to pin down)


Like what? Loading multiple terabytes of data into sqlite on a single server?


For a lot of companies that is a lot of data, yes. Maybe not big data, but the term is relative so if a business sits on 1000 times as much data as they have played with before, is it that unreasonable for them to say it is big data?


> ...is it that unreasonable for them to say it is big data?

Let's say I have an organization and we run into a dataset that is 1,000 times larger than anything we've dealt with. Should we put out a help-wanted ad for a "big data" developer? What if the largest dataset we had previously dealt with was 100 rows? The reason we have terms for things like this is to facilitate communication. If the definition is highly sensitive to context, then the term doesn't facilitate any communication, the whole context must still be spelled out. If the term is to have any meaning at all, it can't be a relative thing. Of course I'm of the opinion that the term is already meaningless, so I guess do whatever you want :)


Big data does not mean "a lot of data". It's about variety and velocity, in addition to volume.


As a rule of thumb, if the data can fit on something smaller than your literal thumb nail (i.e. a single microSD card) it's definitely not big data.


So, 512GB?


It's so big, that you can't even fit it in a single Excel sheet! /s


I think the limit for PowerPivot is 4Gb of data when saved so you might be able to do this that way in Excel - the row limit is just under 2 billion.


Agreed.

Please don't call any datasets that is smaller than 100 million rows big data.


A hundred billion more like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: