The best definition I've seen of Big Data is "the database indices don't fit in memory on the beefiest single server you have access to". (And this doesn't mean you can claim big data because you only have access to your 4 GB RAM laptop.)
My usual rule of thumb is "does AWS sell a single instance big enough to store and process data in real time?" If you can't find an instance big enough, you have big data.
Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet, manually investigate a chunk of it and search with regex is (a buzz word) "big data". Also, you need you forget about all O(n^2) scripts.
If you do any kind of analysis (and you are of size < Google), processing is the easy part. The hard part is to get maximum out of it, while disregarding noise. (Also, it is why when you ask data scientists what they do, a common answer is that 80-90% of time is data-cleaning.)
No. Big Data methodology has traditionally (going back to MapReduce) been all about distributed computing, and figuring out how to get anything meaningful at all out of your data when it can't be comfortably processed on a giant server. When you talk about "size < Google" - Big Data as a term was ORIGINALLY COINED to describe what Google has to deal with. For example, Wikipedia says:
Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
People have been dealing with multi-GB datasets for a lot longer than the term "Big Data" has been around, and it was not intended to refer to existing techniques. Unfortunately, lots of people (especially non-programmers) have tried to adopt it to describe what they do because it's buzzword-y and because Google used it.
80-90% of time is data-cleaning
That's Data, period. I have spent 80-90% of my time cleaning when dealing with a 1MB flat text file. That doesn't make it Big Data.
I won't argue about a particular phrase, as 1) buzz-words more often mis-used than used properly 2) there is no single, strict definition of it.
But going from ~MBs to ~GBs change methodology (of course, it's more about number of items and their structure than sheer size). Even at this scale there is some tradeoff of quality/quantity. And feasibility of using some more complex algorithms.
Data cleaning - sure, for all data. But when volume is in MB, it is still possible to open it it a text editor and manually look into abnormalities. Once it's GBs some artifacts will remain undetected.
The fact remains that nothing that can be analyzed on a single computer in a single threaded application in a single memory space is anything like "big data", unless you just want to tap into the trend of buzzword-laden link bait inbound marketing.
This is an interesting point because the typical amount of RAM in a modern computer is 8GB, with 16GB being a rare upgrade.
For data analyses which I've done, my 8GB computers will complain when analyzing 2-3GB datasets because when processing the data, the computers needs to make a copy in memory, since the analysis isn't in-place, and there are memory leaks (stupid R). So 4GB-6GB of memory usage causes hangs (since the OS is using a lot of memory too) and is why I've had to resort to SQL for data sets that hit that size. (which is perfectly fine)
I think it was submitted as "SQLite and Pandas for Medium Data Workflows" (that's what I see in my RSS reader), but may have been changed to reflect the article title.
That was not particularly big data in 1995. You'd just load it into Oracle running on a desktop machine and query it in SQL, draw some graphs in Developer/2000.
For a lot of companies that is a lot of data, yes. Maybe not big data, but the term is relative so if a business sits on 1000 times as much data as they have played with before, is it that unreasonable for them to say it is big data?
> ...is it that unreasonable for them to say it is big data?
Let's say I have an organization and we run into a dataset that is 1,000 times larger than anything we've dealt with. Should we put out a help-wanted ad for a "big data" developer? What if the largest dataset we had previously dealt with was 100 rows? The reason we have terms for things like this is to facilitate communication. If the definition is highly sensitive to context, then the term doesn't facilitate any communication, the whole context must still be spelled out. If the term is to have any meaning at all, it can't be a relative thing. Of course I'm of the opinion that the term is already meaningless, so I guess do whatever you want :)
> A Large Data Workflow with Pandas
> Data Analysis of 8.2 Million Rows with Python and SQLite
> This notebook explores a 3.9Gb CSV file
Big data?