Scientists want to prevent AI from going rogue by teaching it to be bad first

08-08-2025 • https://www.nbcnews.com

Researchers are trying to "vaccinate" artificial intelligence systems against developing evil, overly flattering or otherwise harmful personality traits in a seemingly counterintuitive way: by giving them a small dose of those problematic traits.

A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality shifts before they occur — an effort that comes as tech companies have struggled to rein in glaring personality problems in their AI. Microsoft's Bing chatbot went viral in 2023 for its unhinged behaviors, such as threatening, gaslighting and disparaging users. Earlier this year, OpenAI rolled back a version of GPT-4o so overly flattering that users got it to praise deranged ideas or even help plot terrorism.