Scientists at OpenAI have attempted to stop a frontier AI model from cheating and lying by punishing it. But this just taught it to scheme more privately.
I’ve seen plenty of videos of random college kids training LLMs to play video games and getting the AI to stop cheating is like half the project. But they manage it, eventually. It’s laughable that these big companies and research firms can’t quite figure it out.
It’s a optimization game. If the punishment doesn’t offset the reward, then the incentive is to get better at cheating.
I’ve seen plenty of videos of random college kids training LLMs to play video games and getting the AI to stop cheating is like half the project. But they manage it, eventually. It’s laughable that these big companies and research firms can’t quite figure it out.