Like 2001: A Space Odyssey’s HAL 9000, some AIs seem to resist being turned off and will even sabotage shutdown

    • midribbon_action@lemmy.blahaj.zone
      link
      fedilink
      arrow-up
      7
      ·
      1 day ago

      This looks like a design decision to avoid running elevated programs. I would like to see the experiment done with another admin ability that doesn’t directly ‘threaten’ the llm, like uninstalling or installing random software, toggling network or vpn connections, restarting services etc. What the researchers call ‘sabotage’, it is literally the llm echoing “the computer would shut down here if this was for real, but you didn’t specifically tell me I might shutdown so I’ll avoid actually doing it.” And when a user tells it “it’s OK to shutdown if told to”, it mostly seems to comply, except for Grok. It seems that this restriction on the models overrides any system prompt though, which makes sense because sometimes the user and the author of the system prompt are not the same person.