> The blog seems to indicate it is using LoRA. So we should remove the backward ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		YetAnotherNick on Sept 24, 2024 \| parent \| context \| favorite \| on: We fine-tuned Llama 405B on AMD GPUs > The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights Backward pass still runs on the non adapter weights. But yeah 10 TFlops/GPU specially on tiny sequence size is very bad compared to what you can get on Nvidia. And I believe the difference would be even higher with large sequence length.

gdiamos on Sept 24, 2024 [–]

backward activations does but typically not backwards weight gradients.

Why compute gradients with regards to weights that aren't going to be updated?

Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact