| Efficient RLHF: Reducing the Memory Usage of PPO
paper page: huggingface.co/papers/2309.00…
Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO),… twitter.com/i/web/status/1… pic.twitter.com/nxoXhSyvev |