I feel like both this comment and the parent comment highlight how RL has been g...

mistercheph · 2025-11-26T06:55:02 1764140102

care to correct the misunderstanding?

mountainriver · 2025-11-26T14:38:08 1764167888

I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.

They also force exploration as a part of the algorithm.

They can be used for synthetic data generation once the reward model is good enough.

phyalow · 2025-11-26T09:41:02 1764150062

Its reductive, but also roughly correct.

singularity2001 · 2025-11-26T14:03:23 1764165803

While collecting data according to policy is part of RL, 'reductive' is an understatement. It's like saying algebra is all about scalar products. Well yes, 1%