AI Agents in the Real World: From Prototypes to Production

9 min read

December 5, 2025

Contents

Agentic AI system development is rapidly moving from theoretical prototypes to practical, production-grade tools. But what does it take to move past the hype and successfully deploy this technology at scale?

Beetroot’s recent online webinar brought together product and tech leaders who are actively building and deploying these autonomous systems to uncover the technical hurdles, key security considerations, and essential mind shifts required for success. The panel featured Manuel Morales (Fractional CTO at NUA and Kwiziq), Kevin Östlin (Co-founder of Andsend), and Oliver King-Smith (Founder of smartR AI).

The discussion was hosted by Sebastian Streiffert, Chief Growth Officer at Beetroot. Today’s article recaps the core insights from our event for those who missed the live session or need a structured refresher on the key takeaways.

What Is Agentic AI?

Unlike traditional smart chatbots that simply generate text to answer a user’s query, an agentic system is designed to “observe, decide and act across tools to pursue a goal or…the intent you give it,” as Kevin Östlin, Co-founder of Andsend, explained.

The key differentiation lies in the element of “agency”: agentic systems can make autonomous decisions on something, often by interacting with other tools as opposed to just providing information.

Oliver King-Smith, Founder of smartR AI, pointed out that this capability is an important part of the true agentic AI experience. It creates value by unlocking “enormous potential for automation and for taking over functions, especially in the back-office,” when applied to business processes.

The shift to agentic capabilities is already common in our day-to-day lives. As Manuel Morales, Fractional CTO at NUA and Kwiziq, noted, even popular tools like ChatGPT or Gemini are increasingly acting as agents whenever they use an integrated tool, such as running a Python script or performing a web search. “There is a very natural shift in which most of the chatbots that we’re interacting with every day… are becoming agents,” he added.

Watch Out for “Agentic-Washing”

As the technology gains traction, the gap between an agent’s actual capability and what’s being marketed to its consumers is a looming concern for the industry. This “marketing overhang” can sometimes hold too much promise, making “agentic-washing” a trapdoor to watch for.

The reality is that agentic AI currently works best when making “very limited decisions,” as King-Smith shared. Organizations should focus on narrow, specific use cases where an agentic system can excel rather than broad-level ones.

A powerful feature of successful AI agents is their ability to recognize their own limitations. King-Smith gave the example of an agentic AI system used for tax compliance that could connect related documents with 99.9% accuracy.

Most of the time, the system could operate successfully in the background, but a crucial part of its design is that it could “also understand that it doesn’t know the answer and flag that as ‘A human needs to go look at it’ basically.”

Moving From Prototype to Production: What Gets Missed?

While impressive demos are easy to create, moving a prototype agent to a reliable, production-grade system is a major challenge for companies in the early prototype stages or organizations starting to implement AI.

Manuel Morales stressed that while the first straightforward MVP cases may work fine, as soon as the agent is tested with edge cases or put in front of a customer, one quickly realizes how difficult it is to get it right. Something to consider:

Experimentation is key. The core challenge lies in the nature of the technology itself: AI is not the same as traditional software development, and standard approaches won’t be as effective.

As an antidote, Morales advises organizations to embrace “experimentation and feedback loops to the extreme” to build the product right. Since this is an experimental technology, companies can no longer rely on senior team members to “nail it on the first attempt” with a system design.

AI requires a shift in mindset. Oliver King-Smith argues that teams need to embrace an engineering mindset with AI:

“Software is completely predictable, and you can tell if it’s not working right… And if your software… is not doing what you wanted it to do, you know you have a bug, and you have a problem. AI has got all these kinds of silent failure modes and unpredictability, and it’s inherently stochastic. It’s not reproducible on a given day.” So while getting a nice-looking initial demo might be easy, deploying it in the real world is a lot harder.

It’s both a technical and organizational shift. Kevin Östlin suggested that today’s bottleneck for scaled adoption is no longer the technology itself, but rather the human element and business structure.

Östlin explained: “It’s almost as if the technology is there, but we, the people and how we are organized as well in our organizations, are maybe not ready to adopt it. So it’s almost as if people have become the bottleneck… The really interesting question for leadership is how we can be quick enough… to reorganize the organization to adopt AI truly at scale.”

This sentiment was echoed by others in the industry. King-Smith confirmed that major players like Oracle and IBM are perceiving similar problems with client implementations, arguing that the organization itself needs to adapt to apply AI effectively. Implementing AI agents successfully means addressing the company’s internal culture and structure alongside the technical roadmap.

Where to Invest to Build Agentic AI Right

The panel agreed that early-stage organizations often misallocate resources when developing agentic systems. The common pitfall is over-investing in the agent’s central logic (prompts and models) and under-investing in the infrastructure, context, and validation required for real-world reliability.

Context Engineering Over Prompt Engineering

Kevin Östlin noted that a small startup like his went through cycles where they invested too much time trying to create “the perfect agents and prompts” and too little time on system integrations.

Agentic AI systems become exponentially more powerful when they have structured, curated access to data. If you simply point an AI agent at disparate sources, like a ticketing system, CRM, and knowledge base, it likely won’t produce good results.

Instead, teams should focus on context engineering, which involves creating robust data pipelines and processing the input data to make it easy for the agent to consume and act upon.

Östlin shared: “If you spend some time, just a little time even, to actually process the data and create a data pipeline where you do some data labeling and restructure the data in a way that makes it easier for the agents to read, it will also perform much cleaner actions and it will be much more…valuable and you will also need less human-in-the-loop in the end.”

Benchmarks and Use Case Definition

Perhaps the most critical area of under-investment is in rigorous upfront definition and validation. Oliver King-Smith stressed that teams often rush into building without “very rigorously defining their use case” and what success looks like.

This lack of definition leads to a failure to build a solid validation set — the critical benchmarks that measure performance. King-Smith explained: “Investing in those costs quite a bit of time and effort. It’s not that glamorous, but it’s really important because then you can actually measure how well you’re doing in your implementation.”

Without clear, measurable benchmarks, teams struggle to determine whether their system is improving, leading to the subjective feeling that “it seems like it’s a bit better today or a bit worse today,” but do you really know where you’re going?

Manuel Morales further called the introduction of these pre-release and production benchmarks a “night and day” difference for development.

The Challenge of Benchmarking AI Behavior

King-Smith advised that teams should build the initial benchmark based on the clearly defined use case, and then expand it over time in production to capture real-world exceptions. He also cautioned that for companies starting out, it is “better to produce an internal tool, as opposed to an external-facing tool,” as the former allows for controlled user behavior and easier testing.

Benchmarking the behavior of stochastic AI systems is inherently more challenging than traditional software testing:

The problem of variability: Manuel Morales noted that traditional testing relies on a clear “if this goes in, this goes out” expectation. With LLMs, every run produces a completely different sentence.
The solution: “LLMs as judges”: For text generation, a common (albeit not perfect) solution is to use one LLM as a judge to test the output of another LLM.

Morales further highlighted a critical difference in testing, which leads to a radically different approach to quality assurance:

Traditional software development aims for a 100% “green” test suite.
With AI, a “healthy test suite is 70% green,” because the output is so unpredictable.

Architecting Data Pipelines for Agent Security

A crucial question for any production-ready agent is: How do you allow it to act autonomously without exposing sensitive data or breaking user trust?

Kevin Östlin simplified this complex challenge by stating that user trust hinges on one rule: “Never surprise people with what the agent can see or do.” To achieve this, he outlined three essential layers for building a secure, customer-facing data pipeline.

1. Build Strong Boundaries (The Guardrail)

The first step is setting up strict data access boundaries. You should never allow an agent raw access to your entire company database and hope it behaves. Instead, the agent must be allowed to interact with data only through secure, internal APIs or gateways. As Östlin described on the example of one of his team’s projects, these gateways act as a main guardrail:

“Those APIs know which user workspace is making the request, and they enforce our scopes and permissions before anything touches the model.”

This ensures the agent can only perform actions and access data that is already defined as permissible within your existing system’s security rules.

2. Minimize Data Exposure

The second layer is minimizing the agent’s potential “blast radius.” An agent rarely needs full, raw data — it only needs the essentials, like “summaries or IDs or labels.” One way is to organize agents accordingly:

Specialization: Splitting agents by domain (e.g., a Sales agent and a Support agent).
Safety: While these specialized agents might access the same core systems, isolating their functions makes it easier for them to track information. Östlin noted this approach “reduces hallucinations and also the blast radius if something goes wrong.”

3. Focus on Observability

The third, and often hardest, step is maintaining observability (full tracking) and propagating permissions across every single action the agent takes.

Since an agent’s task may involve multiple steps and tools, you need to track exactly what data it used, where that data came from, and what permissions were attached to it. Starting with a “very narrow use case, strong boundaries, and really good logging” allows you to build confidence and gradually move the human out of the loop.

Must-Have Security Aspects for Autonomous Agents

When building autonomous or semi-autonomous AI agents, security must be designed to protect the system from both internal misuse and external attacks. The panel highlighted two critical aspects: maintaining strict permission parity and guarding against prompt injection.

1. Never Grant the Agent More Permissions Than the User

The foundational security rule for an agent is that its permissions must be strictly tied to the person using it.

As Oliver King-Smith summarized: “Agents should never have more permissions than the user that’s using them.”

This is crucial because an agent acts as a user’s extension; if a user can only read specific files, the agent operating on their behalf should also be limited to reading only those files. Manuel Morales affirmed that ensuring the agent inherits and respects the user’s credentials (a concept known as permission propagation) is becoming the industry standard.

However, King-Smith pointed out that engineering this parity isn’t always straightforward. In some scenarios, an organization might want an agent to inform a user that sensitive information exists (so the user can request permission to view it), even if the agent is blocked from displaying the data itself.

2. Guard Against Prompt Injection

Another major security consideration is prompt injection, in which a user or an external source (e.g., a malicious website) passes hidden instructions to the agent that override its original, benign instructions.

Morales noted the parallel between prompt injection and classic SQL injection; however, while SQL had a “very strict syntax” that allowed developers to filter malicious input, LLMs are inherently susceptible to prompt injection. This makes guaranteeing security extremely difficult.

A common defense mechanism seen in production-grade systems like coding agents is to transfer responsibility back to the user and place a human in the loop for key actions where the agent’s behavior could be compromised by unexpected or malicious input.

Fundamental Constraints for Real-World Agentic Systems

When deciding which fundamental constraints must be built into an agent that will operate in a “noisy” real-world environment, it is important to focus on the definition of success itself. Because AI is stochastic, achieving 100% accuracy is virtually impossible. So instead of chasing a perfect F1 score, organizations must set constraints based on their risk tolerance for making a bad decision.

As Oliver King-Smith noted: “You can’t expect perfection. They don’t work like traditional software systems. You have to expect some level of randomness in them.”

This means that security (and deployment viability) is measured by whether the agent’s performance is acceptable within the specific use case involving:

Rigorous benchmarking: Building large, isolated validation sets to measure performance.
Cost-benefit analysis: Recognizing that increasing performance has rapidly diminishing returns. King-Smith warned that going from a 90% to a 99% success rate with AI is significantly more expensive than the previous 10% gain (80% → 90%).

Ultimately, the constraint built into the agent is a decision made by the client: whether the agent is performing “as well as people do, or even better, potentially,” or whether the company is willing to live with a slightly lower, but cost-effective, acceptable risk.

The Future of Agentic AI Begins Now

The panel agreed that we’re going to see massive changes in agentic AI over the next few years, marked by countless possibilities that weren’t imaginable up until recently and technologies that far outpace current implementation.

The future of automation is here, but it requires a fundamental shift in engineering discipline, a strong focus on security, and a willingness to embrace experimentation. There are many new, untouched business opportunities — “blank canvases” ready to be filled. However, the biggest obstacle to taking advantage of them is no longer the technology itself; we need to figure out how to best use the available tools and adapt our companies to be ready for AI at scale.

Ultimately, while the technical challenges are significant, the transition to a more automated workplace demands honest conversation and trust.

As one of our viewers powerfully commented, the focus must now turn to the people:

“The next big step for us is to start working with the people. Build trust around AI at our workplaces, gather teams, speak with them, learn what they’re afraid of and concerned with, and address these concerns.”

Successfully integrating AI agents into real-world workflows means acknowledging these challenges and working transparently to build solutions that augment and amplify human expertise.

If you are navigating the complexities of AI deployment and want to stay one step ahead, be sure to keep an eye on our upcoming events and stay tuned for more insights.