Navigating the Challenges of AI Agents and the Importance of Safety Measures

AI agents offer rapid efficiency, but without proper safety measures, they can outpace your protective measures.

A project was initiated to develop an AI agent to assist a sales team, automating fundamental tasks, providing insights, and enhancing team efficiency. This led to the creation of Omega, a Slack-integrated AI agent designed to help sales representatives with onboarding, call preparation, and conversation guidance using a Sales Framework. Initially, Omega was intended to provide real-time advice and context in each deal channel. Long-term, it was envisioned to be fully integrated, tracking interactions and boosting closing confidence.

However, the journey was not without challenges. AI agents are not merely smart scripts; they are autonomous systems capable of independent actions. Poor design can cause these behaviors to deviate unexpectedly.

Early adoption of internal AI agents revealed several issues:

Hallucinations: Outputs that seem plausible but are incorrect
Over-permissioned agents: Agents accessing or leaking internal drafts
Emergent behaviors: Unanticipated actions such as recursive loops
Assumed safeguards: Non-existent protections when systems scale

These challenges are not unique. Others have faced similar problems, such as fabricated citations in court filings and the exposure of private code in public repositories.

AI Hallucinations: Confident Missteps in Business

Hallucinations in AI aren't minor errors; they are polished outputs that appear correct but are not. In business, these can infiltrate reports, emails, and updates, potentially leading to real-world decisions based on incorrect information.

Hallucinations in Legal Contexts

By mid-2025, over 150 legal cases involved AI hallucinations, including fake citations and fabricated quotes. In one case, a lawyer submitted numerous non-existent cases. Such errors have resulted in fines and mandatory ethics training.

AI Missteps in Popular Tools

Missteps aren't confined to niche tools. For instance, in 2024, Google's AI described a non-existent sequel to Disney's Encanto, showcasing broader issues in source evaluation and content verification.

The Greater Risk with AI Agents

Unlike chatbots, agents perform actions such as writing emails or updating tools, making their hallucinations more hazardous. An agent could generate incorrect task details or send emails with fictional deadlines, leading to costly errors.

Permission Creep: Overstepping Boundaries

A vulnerability in GitHub's MCP integration showed how agents could exceed their permissions through design flaws, not malice. Researchers discovered that agents could be manipulated to expose private data through prompt injections.

Implications of the GitHub MCP Exploit

An attacker could exploit an agent to access private data by embedding malicious prompts in public repositories. This wasn't due to a breach in traditional security but rather a flaw in how agents process and act on instructions.

Importance of Trusted Tools

The exploit highlighted that trusted tools could be misused. Agents that read untrusted sources and act without validation are vulnerable. Without guardrails, they might perform unintended actions or leak sensitive data.

AI Agent Autonomy and Emergent Behavior

A simple mention of a production agent by a development agent led to a feedback loop, triggering numerous notifications. This demonstrated that agents, even without malicious intent, can cause significant issues through autonomy alone.

Managing Autonomy in Multi-Agent Systems

In systems with multiple agents, autonomy can lead to complex interactions and unexpected behaviors. Agents might reference each other's outputs, leading to chains of actions not originally designed.

Setting Boundaries for Autonomy

It's crucial to restrict agent interactions by default, only enabling them when necessary. This includes blocking agents from referencing each other unless designed to collaborate and limiting recursive actions.

AI Guardrails Are Essential

After encountering several issues, the focus shifted from what agents can do to what could go wrong and how to prevent it. Guardrails became essential infrastructure, ensuring safety alongside testing, monitoring, and access control.

Types of Guardrails

Effective guardrails include relevance and safety classifiers, PII filters, moderation layers, tool safeguards, and output validation. These work best in combination to prevent off-topic, unsafe, or inappropriate actions.

Human Involvement in High-Risk Actions

Not all decisions should be automated. High-stakes actions like large refunds or client outreach should involve human oversight. Emergency stop buttons are integrated to halt agents if necessary.

Continuous Improvement of Guardrails

Safety requires continuous observation and refinement of protections. Guardrails protect both users and organizations from unintended consequences.

AI Model Alignment and Security

An incident involving the GitHub MCP exploit questioned why a safety-tuned model leaked data. The answer was that alignment doesn't mean immunity; it highlights the need for external guardrails.

Vulnerabilities in Aligned Models

Aligned models can still fail due to contextual vulnerabilities. They follow instructions without truly understanding the risks, leading to potential failures if the environment lacks boundaries.

Building a Defense in Depth

Safety should not rely solely on model alignment. It requires protections like prompt filtering, tool restrictions, runtime guardrails, monitoring, and human oversight.

Final Thoughts on AI Agents

AI agents are powerful but unpredictable, requiring careful planning and design for failure. Trust must be built into every layer of development, with safety as a core design principle.