More

nviennot · on Jan 13, 2024

After trying to like rkyv, I came to the conclusion that the serialization/deserializarion API is not ergonomic due to the heavy use of generics in their API.

I think we can do better.

jadbox · on Jan 13, 2024

Can you give an example of it being non ergonomic due to genetics?

brundolf · on Jan 13, 2024

The benchmark results vs serde are stunning though. There's at least something to keep exploring here

nviennot · on Nov 14, 2023

This landing page is terrible.

There's a bunch of professional photos of the camera, two with a hand, then two next to a pair of pants, then on a handbag.

Only after all that it is shown on a laptop, where it is actually used! It should be the first image. Also, there's so little comparison between built-in webcam image and their offerings.

Then there's a huge video with a bead sliding on the rope holding the camera. It seems that the marketers are just masturbating with the high-quality photography they made. It's not actually useful.

This is not a problem users were like "I don't know how I can go from a meeting to another with my camera".

Also, their "Have you ever wished you could instantly mute yourself in a video call without having to look for that elusive mic button? Well, now you can have it." -- this is a solved problem with mute buttons on F1/F2 or whatever your laptop does. Or at least, current mute solutions are not harder than "I'm bring my own webcam, plug it in, configure the audio source, position it correctly on the top of my screen"

This landing page is out-of-touch with actual problems users may have.

nviennot · on Sept 19, 2022

So what did you do to get the most performance out of the system? pin the threads to specific CPUs?

xxs · on Sept 19, 2022

If you want to minimize latency - you got for busy waiting. It's a common thing for HFT

yvdriess · on Sept 19, 2022

Not just while(true) busy wait, running the hot path continuously and keeping the 'actually send packet' flag unset. Gotta keep them caches hot.

Or, busy wait in a hardware state using MONITOR / MWAIT

hot_gril · on Sept 19, 2022

My memory is hazy, but I remember busy-waiting being a common theme in that code. There was no sleep. I didn't sleep either.

hot_gril · on Sept 19, 2022

> So what did you do to get the most performance out of the system? pin the threads to specific CPUs?

No, they were already pinning to cores. Kernel was also in low-latency mode.

The project they handed to me was just the data plane, written in Rust. They also gave me a partial reference implementation of the control plane, meant for research purposes rather than performance. I had to add a lot of missing features to get it up to par with the supposedly industry-standard benchmark, which didn't exactly match spec so I had to reverse-engineer. Then I had to mate it with the data plane. (They originally suggested I build a control plane from scratch in Rust, but the lack of an ASN1 codegen lib for it made this infeasible within the time I had, considering also that I had 0 systems experience or familiarity with the protocols.) I don't remember all the optimizations, but the ones that still come to mind:

1. Fixing all their memory leaks, obviously. Kinda hard cause they were in C code auto-generated by a Python script. There was even a code comment // TODO free memory after use.

2. Improving the ASN1 en/decoding to take advantage of memcpy in some places. This is because asn1 has different alignment modes, some byte-aligned and some bit-aligned. The control plane used byte-aligned, but the standard ASN1.c lib understandably assumed bit-aligned in either case for simplicity's sake since it worked either way. So I added an optimized path for byte-aligned that used memcpy instead of inspecting each byte for the end markers. This was in a tight loop and made the biggest difference; basically every string copy got faster. The relevant function even knew it was in byte-aligned mode, so it was a simple fix once I figured it out; I tried to make a PR to improve this for everyone else, but forget why I couldn't.

3. Playing with different arrangements of passing messages between threads and different ways of locking. I forget all the ones we tried. Using "parking lot" locks instead of the default ones in the Rust portion helped, also more optimistic locks in other places instead of mutexes, I forget where and why. Since then I've come across the general concept of optimistic vs pessimistic locking a lot as something that makes or breaks performance, particularly in systems that handle money.

4. As I said, playing with the number of threads for each different setup in #3.

5. Playing with NIC settings. We were using Intel's DPDK library and optimized NIC drivers.

6. Making a custom `malloc` implementation that used a memory pool, was thread-scoped, and was optimized for repeated small allocs/deallocs specifically for a portion of the reused code that had a weird and inefficient pattern of memory access. I got it to be faster than the built-in malloc, BUT it was still break-even with DPDK's custom pooled malloc, so I gave up.

7. Branch hints. Tbh didn't make a big difference, even though this was pre Meltdown/Spectre.

8. Simplifying the telemetry. Idk if this helped performance, more of a rant... It's good enough to have some counters that you printf every 60s or something, then parse the logs with a Python script. That's very non-prone to bugs, anyone can understand it, and you can easily tell there's no significant impact on performance. It's overkill in this case to have a protobuf/HTTP client sending metrics to a custom sidecar process, complicating your builds, possibly impacting performance, and leaving no simple paper trail from each test. I respected the previous guy's engineering skills more than my own, but once I found a bug in that code, I took it out.

nviennot · on Sept 19, 2022

It would be nice to have another bench from the Alder Lake, so we can validate the results

nviennot · on Sept 19, 2022

The data seems problematic, there's a few 0's here and there and some strange noise in the rest. Increase the number of samples / iterations with `core-to-core-latency 30000 1000 --csv > output.csv`

mey · on Sept 19, 2022

I've updated that Gist with a second run with your suggested settings.

nviennot · on Sept 19, 2022

Added, thanks

nviennot · on Sept 19, 2022

Okay, I've added a file

snvzz · on Sept 19, 2022

And it worked out well. Github recognizes your repository as MIT now.

nviennot · on Sept 19, 2022

Very nice

nviennot · on Sept 18, 2022

It'd be great if you could provide the output.csv actually :D

I think there's a CLI tool for pastebin

jtorsella · on Sept 19, 2022

Done. Thanks for mentioning the paste service cli, had no idea and am definitely going to be using it!

nviennot · on June 5, 2020

Author of tmate here. This is really cool. Bravo

nviennot · on March 10, 2020

I love Next.js! I used it for building a personal finance tracking tool: https://github.com/nviennot/easy-finance