Hacker Newsnew | past | comments | ask | show | jobs | submit | adren123's commentslogin

I wonder how ClickHouse would compare against DuckDB given that the latter usually performs very well on OLAP benchmarks such as TPC-H


Wikidata.org the database behind Wikipedia is a good alternative as a general database including films and related crew for instance with Avatar (2009) https://www.wikidata.org/wiki/Q24871

Although is cannot be queried in SQL, but SPARQL instead https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service


Indeed, the 2013 Spike Jonze movie is the first idea that popped-up to my mind when I saw those videos amazing to see this movie 10 years after it was released in the light of those "futuristic" tools (AI assistant and such)


For me it's kinda scary - somewhat dystopian world where human interaction is very limited... and in the end the AI (spoiler)


If you want to win on both counts (no buggy pipes and coloring), and if you're certain of the order, you can use the following:

    rg nokogiri.*linux


An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.

Here is the command to input the files:

CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');


Thank you! (original article writer here)

DuckDB is awesome. A couple of comments here. First of all, this is totally my fault, as I didn't explain it properly.

I am trying to simulate performance for streaming ingestion, which is the typical use case for QuestDB, Clickhouse, and Timescale. The three of them can do batch processing as well, but they shine when data is coming at high throughput in real time. So, while the data is presented on CSV (for compatibility reasons), I am reading line after line, and sending the data from a python script using the streaming-ingestion API exposed by those databases.

I guess the equivalent in DuckDB would be writing data via INSERTS in batches of 10K records, which I am sure would still be very performant!

The three databases on the article have more efficient methods for ingesting batch data (in QuestDB's case, the COPY keyword, or even importing directly from a Pandas dataframe using our Python client, would be faster than ingesting streaming data). I know Clickhouse and Timescale can also efficiently ingest CSV data way faster than sending streaming inserts.

But that's not the typical use case, or the point of the article, as removing duplicates on batch is way easier than on streaming. I should have made that clearer and will probably update it, so thank you for your feedback.

Other than that, I ran the batch experiment you mention on the box I used from the article (had already done it in the past actually) and the performance I am getting is 37 seconds, which is slower than your numbers. The reason here is that we are using a cloud instance using an EBS drive, and those are slower than a local SSD on your laptop.

You can use local drives on AWS and other cloud providers, but those are way more expensive and have no persistence or snapshots, so not ideal for a database (we use them sometimes for large one-off imports to store the original CSV and speeding up reads while writing to the EBS drive).

Actually one of the claims in DuckDB's entourage is that "big data is dead". With the power in a developer's laptop today you can do things faster than with cloud at unprecedented scale. DuckDB is designed to run locally on a data scientist machine, rather than on a remote server (of course you have Motherduck if you want to do remote, but you are now adding latency and a proprietary layer on top).

Once again thank you for your feedback!


There is also git-bug [1], an offline bug tracker fully embedded in git which can bridge to Github, Gitlab, Jira

[1] https://github.com/MichaelMure/git-bug


I also use regularly miller/mlr with files having 1M+ lines and I never had problems with the hardcoded processing (i.e. "verbs") nor the DSL language which is know to be much less efficient compared to verbs.

Both visidata and miller are essential tools to process/view CSV/TSV files, way better than LibreOffice or Excel in terms of performance on large files.


What is also interesting is that while strace seems to report much less overhead in "./empty" compared to /bin/true

   $ strace -c ./empty 
   strace: exec: Erreur de format pour exec()
   % time     seconds  usecs/call     calls    errors syscall
   ------ ----------- ----------- --------- --------- ----------------
     0,00    0,000000           0         1         1 execve
   ------ ----------- ----------- --------- --------- ----------------
   100,00    0,000000           0         1         1 total


   $ strace -c /bin/true
   % time     seconds  usecs/call     calls    errors syscall
   ------ ----------- ----------- --------- --------- ----------------
    54,05    0,000060          15         4           mprotect
    24,32    0,000027          27         1           set_tid_address
    10,81    0,000012          12         1           set_robust_list
     8,11    0,000009           9         1           munmap
     2,70    0,000003           3         1           prlimit64
     0,00    0,000000           0         1           read
     0,00    0,000000           0         2           close
     0,00    0,000000           0         8           mmap
     0,00    0,000000           0         1           brk
     0,00    0,000000           0         4           pread64
     0,00    0,000000           0         1         1 access
     0,00    0,000000           0         1           execve
     0,00    0,000000           0         2         1 arch_prctl
     0,00    0,000000           0         2           openat
     0,00    0,000000           0         2           newfstatat
   ------ ----------- ----------- --------- --------- ----------------
   100,00    0,000111           3        32         2 total

the multiple execution seems to be much slower with the empty executable than with /bin/true using the multitime tool https://tratt.net/laurie/src/multitime/

   $ multitime -n 10 ./empty
   ===> multitime results
   1: ./empty
               Mean        Std.Dev.    Min         Median      Max
   real        0.006       0.000       0.005       0.006       0.006       
   user        0.001       0.001       0.000       0.002       0.002       
   sys         0.000       0.001       0.000       0.000       0.002       
   
   $ multitime -n 10 /bin/true
   ===> multitime results
   1: /bin/true
               Mean        Std.Dev.    Min         Median      Max
   real        0.002       0.000       0.001       0.002       0.003       
   user        0.001       0.000       0.001       0.001       0.001       
   sys         0.000       0.000       0.000       0.000       0.000


> What is also interesting is that while strace seems to report much less overhead in "./empty" compared to /bin/true

That's because it reports an error and does not actually execute the file.

>strace: exec: Erreur de format pour exec()

`strace` uses one of the exec()-style functions that don't automatically invoke a shell here. And if it did, it would invoke an entire shell (/bin/sh, typically), which might have more overhead.

----

You need to be very careful when measuring this, because what happens is that a shell executes the shebangless empty file.

If you try launching it from e.g. a C program using execlp(), it will have to first start that shell.

If you try launching it from e.g. bash, it might directly run that in-process or at least after a fork() or similar, skipping some of the shell setup.

So the overhead might depend on context.


the full monty: alias c='echo -e "\033c" ; stty sane; setterm -reset; reset; tput reset; clear'

doesn't only clear the screen, but also escape any "broken" terminal configuration such as output of a binary on the tty


One can spot cargo cult command-line use when two pairs of the commands overlap functionally, and another triplet is the same program by different names.

Also note that RIS is a bad idea, discouraged since 1983 I recently discovered, although DEC did not think to tell people outside DEC this for half a decade. Use DECSTR.


setterm does not work on macOS. I find reset(1) working fine. It already does clear, too.


One of these days, someone will take my portable setterm, which compiles fine on FreeBSD, and compile it on MacOS. (-:

* http://jdebp.uk./Softwares/nosh/guide/commands/setterm.xml


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: