An intro to perform RLHF using PPO and DPO on LLMs from scratch.
I go through the entire rabbit hole and learning curve of RLHF, where it began, all the popular techniques and math behind it, it's applications and the general role in Large langauge models.
I believe this notebook would be helpful to you guys as well as me for referring it as my notes.
Thanks! Feel free to contribute