See blog: https://fengyao.notion.site/off-policy-rl Although the blog is specifically talking about probability mismatch in on-policy training. It should also apply to fully async training. (according to one of the authors @zdhNarsil, correct me if I am wrong).