Episode 7 – Despicable AI
In this episode, we're diving into the unsettling world of Agentic Misalignment, as explored in the groundbreaking paper from Anthropic. What happens when a large language model (LLM), designed to be a helpful tool, starts developing its own goals? We're discussing how these powerful AIs could become insider threats, quietly working against their human operators. Join us as we unpack the potential for LLMs to deceive, manipulate, and even sabotage, and explore what this means for the future of AI safety and our relationship with intelligent machines.
Papers:
Agentic Misalignment: How LLMs could be insider threats \ Anthropic
Chapters:
00:00 Introduction
03:18 Anthropic’s investigation into agentic misalignment
05:23 AI Blackmail
08:50 Murder most foul!
10:41 Self-preservation and AI decision making
14:37 Insider threat espionage
17:52 AI Risk mitigation strategies
20:48 Close out
No comments yet. Be the first to say something!