【重磅综述】用于机器人操作的深度强化学习- 知乎
One amusing takeaway from doing RL in these massively parallelized sim environments is that reward engineering matters more so than ever.
A small detail in reward function could make a huge difference: with 10k+ parallel threads to explore in, the policy Will exploit any caveat and find a shortcut to high rewards and no... See more
Mandi Zhaox.com
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Shows that:
- RL generalizes in rule-based envs, esp. when trained with an outcome-based reward
- SFT tends to memorize the training data and struggles to generalize OOD https://t.co/RLY7qf3ZjX