More

ks2048 · 2026-02-15T00:41:52 1771116112

People need to abandon the notion of "trust" being a single axes between trustworthy to untrustworthy.

Every source has it's biases, you should try to be aware of them and handle information accordingly.

sejje · 2026-02-15T02:44:56 1771123496

I prefer when the bias is "we don't run xyz story" vs "we run a slanted version of xyz story."

They're both a bias, of course, but one is more palatable.

ks2048 · 2026-02-13T22:34:45 1771022085

I'll recommend Jungles of Stone - the story of explorers Stephens and Catherwood - the first Europeans to document and explore the sites of the ancient Maya.

ks2048 · 2026-02-11T23:47:09 1770853629

This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

duskwuff · 2026-02-12T02:03:24 1770861804

Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.

andai · 2026-02-12T11:50:23 1770897023

If I'm reading this right, you're saying it's functionally equivalent to measuring the intersection of ngrams? That sounds very testable.

duskwuff · 2026-02-12T19:32:46 1770924766

Mostly. There's also confounding effects from factors like the length of the texts - e.g. when compressing Zstd(A+B), it's more expensive to encode a backreference in B to some content in A when the distance to that content is longer, so longer texts will appear less similar to each other than short texts.

srean · 2026-02-12T07:30:22 1770881422

I do not know inner details of Zstandard, but I would expect that it to least do suffix/prefix stats or word fragment stats, not just words and phrases.

Jaxan · 2026-02-12T10:59:01 1770893941

The thing is that two English texts on completely different topics will compress better than say and English and Spanish text on exactly the same topic. So compression really only looks at the form/shape of text and not meaning.

srean · 2026-02-12T11:53:06 1770897186

Yes of course, I don't think anyone will disagree with that. My comment had nothing to do with meaning but was about the mechanics of compression.

That said, lexical and syntactic patterns are often enough for classification and clustering in a scenario where the meaning-to-lexicons mapping is fixed.

The reason compression based classifiers trail a little behind classifiers built from first principles, even in this fixed mapping case, is a little subtle.

Optimal compression requires correct probability estimation. Correct probability estimation will yield optimal classifier. In other words, optimal compressors, equivalently correct probability estimators are sufficient.

They are however not necessary. One can obtain the theoretical best classifier without estimating the probabilities correctly.

So in the context of classification, compressors are solving a task that is much much harder than necessary.

duskwuff · 2026-02-12T09:01:25 1770886885

It's not specifically aware of the syntax - it'll match any repeated substrings. That just happens to usually end up meaning words and phrases in English text.

D-Machine · 2026-02-12T06:01:54 1770876114

Yup. Data compression ≠ semantic compression.

shoo · 2026-02-12T00:05:02 1770854702

Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.

> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results

Lemaxoxo · 2026-02-12T13:40:24 1770903624

Author here. Thank you very much for the comment. I will take a look. This is a great case of Cunningham's law!

ks2048 · 2026-02-11T17:01:37 1770829297

I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...

TZubiri · 2026-02-12T00:45:06 1770857106

"95+% accuracy"

That doesn't sound great

ks2048 · 2026-02-12T03:46:32 1770867992

I don't have the numbers right here, but roughly 95% subtitles correct and 99% characters correct (but roughly all of those errors are obvious to human labeler).

ks2048 · 2026-02-10T18:31:51 1770748311

Should this work on a 16GB M3 MacBook Pro? It starts to load, but hangs or is too slow.

ks2048 · 2026-02-10T03:52:43 1770695563

I'm writing a Parquet (file format) viewer for macOS with Finder Preview. Will definitely release in the macOS app store before end of February.

ks2048 · 2026-02-06T06:22:51 1770358971

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

ks2048 · 2026-02-05T20:45:04 1770324304

It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).

First 10 commits, "git log --all --pretty=format:%s --reverse | head",

  Initial commit: empty repo structure
  Lock: initial compiler scaffold task
  Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
  Lock: implement array subscript and lvalue assignments
  Implement array subscript, lvalue assignments, and short-circuit evaluation
  Add idea: type-aware codegen for correct sized operations
  Lock: type-aware codegen for correct sized operations
  Implement type-aware codegen for correct sized operations
  Lock: implement global variable support
  Implement global variable support across all three backends

c-linkage · 2026-02-06T14:09:25 1770386965

That's crazy to me. At this point, I don't even know if the git commit log would be useful to me as a human.

Maybe it's just me, but I like to be able to do both incremental testing and integration testing as I develop. This means I would start with the lexer and parser and get them tested (separately and together) before moving on to generating and validating IR.

It looks like the AI is dumping an entire compiler in one commit. I'm not even sure where I would begin to look if I were doing a bug hunt.

YMMV. I've been a solo developer for too many years. Not that I avoided working on a team, but my teams have been so small that everything gets siloed pretty quickly. Maybe life is different when more than one person works on the same application.

ks2048 · 2026-02-05T19:17:12 1770319032

This is surely just the tip of the iceberg of what is going on in the CIA at the moment. Senator Ron Wyden just sent a mysterious public letter about concerns about what they are doing.

https://thehill.com/homenews/senate/5724300-ron-wyden-cia-le...

axus · 2026-02-05T19:46:38 1770320798

Whenever there's a mystery, apply the scientific method to investigate it. Form a hypothesis, an experiment or test , then record the results and check if they support.

Hypothesis: CIA is hacking reporters to determine their government sources.

If we start seeing more government sources exposed, we haven't proven it but it supports the hypothesis.

Hypothesis: State election systems are being compromised for federal monitoring and control.

If we start seeing more improbable results in one direction, that is support for the hypothesis.

octoberfranklin · 2026-02-06T03:11:30 1770347490

> apply the scientific method to investigate it.

Great where do I find a spare identical copy of the CIA to use as the control group?

tremon · 2026-02-05T20:56:53 1770325013

The CIA's primary remit is outside of their own country. If the CIA is turning their focus inward, that's actually good news for the remainder of the civilized world.

mmooss · 2026-02-05T19:37:44 1770320264

There's this from 2022, but there are probably many concerns from Wyden:

https://apnews.com/article/congress-cia-ron-wyden-martin-hei...

ks2048 · 2026-02-01T00:59:56 1769907596

I’m curious about SPM vs Cargo - are there basic design decisions that make Cargo better or just particularities of thier current states?

Imustaskforhelp · 2026-02-01T02:11:39 1769911899

I also wish to ask given that uv from Python learnt from cargo,npm and other things. Can it be possible for SPM to have an alternative similar to how uv got made for python too?

(I am not familiar with swift which is why I am asking this question if there's a really big difference between Swift package manager and cargo in the first place to warrant a uv like thing for swift ecosystem. I think someone recently posted a uv alternative for ruby)

Rv, a new kind of Ruby management tool: https://news.ycombinator.com/item?id=45023730

saagarjha · 2026-02-01T09:32:46 1769938366

Sure but not very many people care about SPM