Wikidata.org the database behind Wikipedia is a good alternative as a general database including films and related crew
for instance with Avatar (2009)
https://www.wikidata.org/wiki/Q24871
Indeed, the 2013 Spike Jonze movie is the first idea that popped-up to my mind when I saw those videos
amazing to see this movie 10 years after it was released in the light of those "futuristic" tools (AI assistant and such)
An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.
Here is the command to input the files:
CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');
DuckDB is awesome. A couple of comments here. First of all, this is totally my fault, as I didn't explain it properly.
I am trying to simulate performance for streaming ingestion, which is the typical use case for QuestDB, Clickhouse, and Timescale. The three of them can do batch processing as well, but they shine when data is coming at high throughput in real time. So, while the data is presented on CSV (for compatibility reasons), I am reading line after line, and sending the data from a python script using the streaming-ingestion API exposed by those databases.
I guess the equivalent in DuckDB would be writing data via INSERTS in batches of 10K records, which I am sure would still be very performant!
The three databases on the article have more efficient methods for ingesting batch data (in QuestDB's case, the COPY keyword, or even importing directly from a Pandas dataframe using our Python client, would be faster than ingesting streaming data). I know Clickhouse and Timescale can also efficiently ingest CSV data way faster than sending streaming inserts.
But that's not the typical use case, or the point of the article, as removing duplicates on batch is way easier than on streaming. I should have made that clearer and will probably update it, so thank you for your feedback.
Other than that, I ran the batch experiment you mention on the box I used from the article (had already done it in the past actually) and the performance I am getting is 37 seconds, which is slower than your numbers. The reason here is that we are using a cloud instance using an EBS drive, and those are slower than a local SSD on your laptop.
You can use local drives on AWS and other cloud providers, but those are way more expensive and have no persistence or snapshots, so not ideal for a database (we use them sometimes for large one-off imports to store the original CSV and speeding up reads while writing to the EBS drive).
Actually one of the claims in DuckDB's entourage is that "big data is dead". With the power in a developer's laptop today you can do things faster than with cloud at unprecedented scale. DuckDB is designed to run locally on a data scientist machine, rather than on a remote server (of course you have Motherduck if you want to do remote, but you are now adding latency and a proprietary layer on top).
I also use regularly miller/mlr with files having 1M+ lines and I never had problems with the hardcoded processing (i.e. "verbs") nor the DSL language which is know to be much less efficient compared to verbs.
Both visidata and miller are essential tools to process/view CSV/TSV files, way better than LibreOffice or Excel in terms of performance on large files.
the multiple execution seems to be much slower with the empty executable than with /bin/true using the multitime tool https://tratt.net/laurie/src/multitime/
$ multitime -n 10 ./empty
===> multitime results
1: ./empty
Mean Std.Dev. Min Median Max
real 0.006 0.000 0.005 0.006 0.006
user 0.001 0.001 0.000 0.002 0.002
sys 0.000 0.001 0.000 0.000 0.002
$ multitime -n 10 /bin/true
===> multitime results
1: /bin/true
Mean Std.Dev. Min Median Max
real 0.002 0.000 0.001 0.002 0.003
user 0.001 0.000 0.001 0.001 0.001
sys 0.000 0.000 0.000 0.000 0.000
> What is also interesting is that while strace seems to report much less overhead in "./empty" compared to /bin/true
That's because it reports an error and does not actually execute the file.
>strace: exec: Erreur de format pour exec()
`strace` uses one of the exec()-style functions that don't automatically invoke a shell here. And if it did, it would invoke an entire shell (/bin/sh, typically), which might have more overhead.
----
You need to be very careful when measuring this, because what happens is that a shell executes the shebangless empty file.
If you try launching it from e.g. a C program using execlp(), it will have to first start that shell.
If you try launching it from e.g. bash, it might directly run that in-process or at least after a fork() or similar, skipping some of the shell setup.
One can spot cargo cult command-line use when two pairs of the commands overlap functionally, and another triplet is the same program by different names.
Also note that RIS is a bad idea, discouraged since 1983 I recently discovered, although DEC did not think to tell people outside DEC this for half a decade. Use DECSTR.