I find it really interesting that it uses a Mamba hybrid with Transformers. Is it the only significant model right now using (at least partially) SSM layers? This must contribute to lower VRAM requirements right? Does it impact how KV caching works?
Maybe what they should do in the future is just automatically provide AI reviews to all papers and state that the work of the reviewers is to correct any problems or fill details that were missed. That would encourage manual review of the AI's work and would also allow authors to predict what kind of feedback they'll get in a structured way. (eg say the standard prompt used was made public so authors could optimize their submission for the initial automatic review, forcing the human reviewer to fill in the gaps)
ok of course the human reviewers could still use AI here but then so could the authors, ad infinitum..
A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
I would love to read more, but apart from not finding a lot of time lately, when I do read, it's fiction. Occasionally I have read a textbook on a topic I am really interested in, and I've read blogs and articles on various sciency themes, but when it comes to books, I have just never been very into reading non-fiction. I don't try often, but when I do, I get one or two chapters in and just .. fail to pick it up again.
I know that non-fiction would be "good for me." Particularly reading more in topics I'm less knowledgable about, like finance and business and politics. Personal growth. However, I do find that fiction helps expand my perspective and even, somehow, knowledge, but it's different from non-fiction, less direct. I don't read for that, explicitly, although I do like the effect. But I read because.. I guess, because it's nice for my brain to be somewhere else. I don't know. But non-fiction has never done it for me.. my mind just gets.. bored, I think, trying to absorb what someone else wants me to know. Even when I find the topic interesting.
I guess there are people who like non-fiction and people who like fiction and they often cross-over but I think most people lean one way or the other. I can see there being positives and negatives to either side. People who equally read both must be rare? Or maybe it's just my impression.
I think this depends heavily on which non-fiction, particularly when contrasted with which fiction you're currently reading.
I don't think reading the same self-help books as a bunch of CEO's who see themselves as bold outsiders to the system will actually benefit you; it didn't make them self-aware.
Fiction contains information and ideas; it helps you expand your horizons, and that's generally a good thing. As long as you're not reading a very limited subset of fiction, it will be beneficial.
Reading science fiction has given me ideas that I would have never had before. I can comfortably say that it has expanded my narrow mind. Even pulp space-opera helped here!
Apart from that, taking the time to grok the architecture or top-rated issues of open source projects helps to make you a better developer - or at-least avoid obvious mistakes when coding some new feature of your own.
It is a strange phenomenon though, these walls of text that LLMs output, when you consider that one thing they're really good at is summarization, and that if they are trained on bug report data, you'd think they would reproduce it in terms of style and conciseness.
Is it mainly post-training that causes this behaviour? They seem to do it for everything, like they are really biased towards super verbose output these days. Maybe something to do with reasoning models being trained for longer output?
This sounds more correct to me. I've read previously somewhere that better generalization is usually associated with wider, smoother minima, and this is why regularization is important, because it has a smoothing function on the loss landscape.
Yes. This is also not hard to see intuitively from scratch.
Say you have a smooth but highly flexible model y = f(x) and some data points you are fitting with a machine learning algorithm. For whatever reason, the algorithm decides it wants to reduce training error by interpolating some specific point, (x0,y0), without negatively affecting training error on nearby points. The direct, guaranteed successful way to do this is to adjust the model to y0 = f(x0) exactly on x0 by adding a Dirac delta there, leaving the rest of f exactly as-is. But this cannot be done on a differentiable model, as it would create a discontinuity. The next best thing that such a model can actually do is replace the Dirac delta with a smooth but very narrow bump (e.g. Gaussian). But this narrow bump will inevitably have extremely high curvature at x0, since the bump is flat at x0 and it has to merge with the neighborhood around x0 in a very short distance.
Think of driving: if you have to change lanes in a very short distance, you're going to have to steer hard. Steering is curvature.