Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No they’re usually done at each attention layer.


Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2


All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).


Thanks! I'm not super up to date on all the ML stuff :)


Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.


This is normal. Rope was introduced after bert/gpt2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: