> Big data analytics with Pandas and SQLite > A Large Data Workflow with Pandas ...

rdtsc · on April 11, 2015

Yeah like others point out this is small data. My rules are:

1) Does it fit in a desktop computer memory? If it does it is "small data".

2) Does it fit on a hard drive in a desktop? Then it is just "data". Or medium data.

3) If you need a cluster or some centralized network storage to fit and manage it, you might have big data.

4) Next level up is streaming data. Your data doesn't even sit anywhere because it is accumulated faster than you could ever processes it.

semi-extrinsic · on April 11, 2015

The best definition I've seen of Big Data is "the database indices don't fit in memory on the beefiest single server you have access to". (And this doesn't mean you can claim big data because you only have access to your 4 GB RAM laptop.)

texthompson · on April 11, 2015

My usual rule of thumb is "does AWS sell a single instance big enough to store and process data in real time?" If you can't find an instance big enough, you have big data.

stared · on April 11, 2015

Infrastructure-wise - no.

Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet, manually investigate a chunk of it and search with regex is (a buzz word) "big data". Also, you need you forget about all O(n^2) scripts.

If you do any kind of analysis (and you are of size < Google), processing is the easy part. The hard part is to get maximum out of it, while disregarding noise. (Also, it is why when you ask data scientists what they do, a common answer is that 80-90% of time is data-cleaning.)

azernik · on April 11, 2015

  Methodology-wise - yes

No. Big Data methodology has traditionally (going back to MapReduce) been all about distributed computing, and figuring out how to get anything meaningful at all out of your data when it can't be comfortably processed on a giant server. When you talk about "size < Google" - Big Data as a term was ORIGINALLY COINED to describe what Google has to deal with. For example, Wikipedia says:

  Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

People have been dealing with multi-GB datasets for a lot longer than the term "Big Data" has been around, and it was not intended to refer to existing techniques. Unfortunately, lots of people (especially non-programmers) have tried to adopt it to describe what they do because it's buzzword-y and because Google used it.

  80-90% of time is data-cleaning

That's Data, period. I have spent 80-90% of my time cleaning when dealing with a 1MB flat text file. That doesn't make it Big Data.

stared · on April 11, 2015

I won't argue about a particular phrase, as 1) buzz-words more often mis-used than used properly 2) there is no single, strict definition of it.

But going from ~MBs to ~GBs change methodology (of course, it's more about number of items and their structure than sheer size). Even at this scale there is some tradeoff of quality/quantity. And feasibility of using some more complex algorithms.

Data cleaning - sure, for all data. But when volume is in MB, it is still possible to open it it a text editor and manually look into abnormalities. Once it's GBs some artifacts will remain undetected.

x0x0 · on April 11, 2015

The fact remains that nothing that can be analyzed on a single computer in a single threaded application in a single memory space is anything like "big data", unless you just want to tap into the trend of buzzword-laden link bait inbound marketing.

leereeves · on April 11, 2015

> Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet...

Why just spreadsheets? This is essentially loading data in SQLite and manually investigating it with SELECT queries.

bobowzki · on April 11, 2015

Haha 3.9Gb, thats nothing. I'd argue that anything that can fit into ram of a relatively modern laptop is small data.

bigiain · on April 11, 2015

The word around here is "if it fits onto a computer you own, it's not 'big data'".

I'd look strangely at anyone claiming "big data" for anything smaller than double digit terabytes.

IndianAstronaut · on April 12, 2015

Yeah. I would hesitate to call anything under 60 terabytes as big data.

minimaxir · on April 11, 2015

This is an interesting point because the typical amount of RAM in a modern computer is 8GB, with 16GB being a rare upgrade.

For data analyses which I've done, my 8GB computers will complain when analyzing 2-3GB datasets because when processing the data, the computers needs to make a copy in memory, since the analysis isn't in-place, and there are memory leaks (stupid R). So 4GB-6GB of memory usage causes hangs (since the OS is using a lot of memory too) and is why I've had to resort to SQL for data sets that hit that size. (which is perfectly fine)

e28eta · on April 11, 2015

I think it was submitted as "SQLite and Pandas for Medium Data Workflows" (that's what I see in my RSS reader), but may have been changed to reflect the article title.

duck237 · on April 11, 2015

OP here - yup, you're right, submitted as "SQLite and Pandas for Medium Data Workflows"

gaius · on April 11, 2015

That was not particularly big data in 1995. You'd just load it into Oracle running on a desktop machine and query it in SQL, draw some graphs in Developer/2000.

jszymborski · on April 11, 2015

I mean, using this approach I imagine you can do the same with much more data...

(Also Big Data is one of those words whose definition sucks to pin down)

buster · on April 11, 2015

Like what? Loading multiple terabytes of data into sqlite on a single server?

tomjen3 · on April 11, 2015

For a lot of companies that is a lot of data, yes. Maybe not big data, but the term is relative so if a business sits on 1000 times as much data as they have played with before, is it that unreasonable for them to say it is big data?

glesica · on April 11, 2015

> ...is it that unreasonable for them to say it is big data?

Let's say I have an organization and we run into a dataset that is 1,000 times larger than anything we've dealt with. Should we put out a help-wanted ad for a "big data" developer? What if the largest dataset we had previously dealt with was 100 rows? The reason we have terms for things like this is to facilitate communication. If the definition is highly sensitive to context, then the term doesn't facilitate any communication, the whole context must still be spelled out. If the term is to have any meaning at all, it can't be a relative thing. Of course I'm of the opinion that the term is already meaningless, so I guess do whatever you want :)

blumkvist · on April 11, 2015

Big data does not mean "a lot of data". It's about variety and velocity, in addition to volume.

gizmo · on April 11, 2015

As a rule of thumb, if the data can fit on something smaller than your literal thumb nail (i.e. a single microSD card) it's definitely not big data.

est · on April 11, 2015

So, 512GB?

rmc · on April 11, 2015

It's so big, that you can't even fit it in a single Excel sheet! /s

arethuza · on April 11, 2015

I think the limit for PowerPivot is 4Gb of data when saved so you might be able to do this that way in Excel - the row limit is just under 2 billion.

eva1984 · on April 11, 2015

Agreed.

Please don't call any datasets that is smaller than 100 million rows big data.

gaius · on April 11, 2015

A hundred billion more like.