You are right - there will be a big difference for these two different workloads. As I am 100% sure @kragen already knows, this is arguably the first, smallest step in thinking for you to realize just how many possible workloads you might want to model. :-) And to realize why senior engineers will say pithy things like "the best benchmark is your actual application".
While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.
Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
Well, you can use it for everything. I've used it for in memory hash tables, trees, etc. To be clear, for those who may not understand, the "interpretation" in these "millibenchmarks" for something as simple as an allocator is not like Python or Bash or something slow. It could literally be "mmap a binary file of integers and either all/free the binary numbers". So, yeah, there might be some CPU pipeline overhead in unpredictable branches and you might still want to try to measure that, but beyond that there is basically no work and any real workload might have one-ish hard to predict branch leading to allocator operations. So, even unadjusted for overhead it's not so bad. And it keeps your measurements fast so you can get statistically many of them even if there is an "inner min loop" to try to isolate the best case or something. (Well, at least if what you are measuring is fast.) The downside/tradeoff is just lack of fidelity to interaction effects in the full application like resource competition/branch predictor disruption/etc.
You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
EDIT reply to parent EDIT
> benchmarking some workload is good enough
Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
FWIW, I thought your big advice to 8dcc was pretty decent.
While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.