Kais Kermani
Machine Learning Engineer
Remember J.A.R.V.I.S. from Iron Man? That intelligent assistant that seemed to have a solution for everything? While we’re not quite there yet, the rapid evolution of Large Language Models (LLMs) like GPT-4, Claude, and Gemini is bringing us closer than ever. Today’s LLMs are impressive. They can generate content, translate languages, and even write code. But let’s be real — they’re still pretty much glorified text processors.
So, how do we take the next step towards creating systems that can not only understand, but also act on the world around them? Enter the world of AI agents — your LLM’s ticket to superhero school.
In this post, we’ll cover:
Excited to see how we can turn LLMs from talkers into doers? Let’s dive in!
Picture this: You ask your AI, “Where did I leave my keys?” A typical LLM might respond:
“I apologize, but as a language model, I don’t have access to your physical environment or personal memories. I can’t locate your keys or recall where you last placed them.”
Frustrating, right? This response highlights three limitations of LLMs:
So how can we now elevate our trusty text processors to autonomous problem solvers capable of interacting with the world? Enter AI agents. But what exactly are AI agents? The term can be a bit nebulous in the AI community, with various definitions floating around. But for our purposes, we’ll explore a comprehensive view that includes three key core features of AI agents:
Note that not all implementations labeled as “agents” necessarily include all three of these components. In the current AI landscape, many systems labeled as “agents” primarily focus on tool integration.
The key takeaway is that these components help overcome the intrinsic limitations of LLMs, giving them a new level of autonomous agency! Rather than simply responding to queries or following predefined paths, agents can assess the situation, choose the appropriate tools, and execute the actions necessary to achieve their goals. It’s this ability to make context-aware decisions and dynamically interact with the world around them, that truly sets AI agents apart from traditional chatbots or simple query systems.
For businesses, this means AI agents can seamlessly integrate with existing systems like CRMs, ERP platforms, and even IoT devices. Sounds exciting, right?
Over the last year, several frameworks have emerged to help developers expand LLMs with agent capabilities. To name a few: LangChain, CrewAI, Semantic Kernel, Autogen, Phidata, … While each framework has its own unique flavor and features, they all have one thing in common: they all act as wrappers around your LLM.
What I mean with wrapper here is that the framework doesn’t change the underlying LLM, but rather gives the LLM ‘brain’ the body and tools it needs to turn knowledge into action.
At their core, agent frameworks typically offer:
For our example, we’ll be using Phidata, a user friendly open-source framework that allows developers to create agents in a Pythonic way. However, the concepts we’ll cover are generally applicable across various agent frameworks.
Now that we understand the potential of AI agents, let’s see how we can put these concepts into practice. In the next section, we’ll walk through a real-world example of building an AI agent using the Phidata framework.
Picture this: It’s a typical day in your life as an ML engineer. You’re deep in the code, wrestling with neural networks and drowning in data. The world outside your IDE barely exists.
Your day might go something like this:
Now, maybe you’re one of those mythical creatures who always arrives five minutes early. If so, I both admire you and slightly resent you. But for the rest of us, wouldn’t it be nice to have a superhero assistant telling you when to leave and how to get there on time?
What if I told you that AI could be that help? Not just any AI, mind you, but an AI agent! Let me introduce you to my smart assistant who will make sure you never miss an appointment again.
The first step in staying on top of your schedule is making sure all your appointments are in your calendar. Here’s how our AI agent handles that task.
In the GIF below, you can see how the agent calls the calendar tool to create a new event. Note how the agent correctly populates all the required fields to create an event, even though we didn’t explicitly have to specify this. Talking about intelligence..
Once your events are in the calendar, the next step is figuring out when to leave. This is where the AI agents reasoning capabilities really shine.
In the GIF, when asked about how to arrive at our plans in time, we can see the that the agent retrieves the event details from the calendar, extracts the location, and then uses the route planner tool to calculate the best time to leave. Note that it factors in your current location, travel time, and even public transport delays. Pretty impressive right?
Alright, let’s take a peek under the hood and explore how all of this works.
At the core of this project is the Assistant class from Phidata, the technical implementation of what we call an AI agent. This class acts as a wrapper around your favorite Large Language Model (LLM), providing it with tool-calling capabilities and other extensions. It’s the body to the LLM’s brain, remember?
Here’s how you can define an assistant using Phidata:
The python snippet shows how you can define an assistant in the Phidata framework. Note that there exist many more features such as memory and knowledge, but those are out of scope for this demo.
Now let’s talk about the real driver of the magic of AI agents — tools. In Phidata, tools are simply Python functions that the agent can use to interact with external systems. No magic, just good ol’ Python.
But how does the agent know how to use these functions?
Glad you asked! Besides the informative function name, the AI agent understands how to use these functions based on their parameters and docstrings.
Phidata comes with a wide variety of pre-built toolkits, like DuckDuckGo for web search and Arxiv for accessing research papers. But for this example, we wrote our own integration with the Google Maps API, just like we did for Google Calendar.
Note that for brevity, we’ve omitted some code details, such as the API key.
Great! With this setup, the AI agent only needs to know what parameters the function expects, and it can call the function whenever it deems it necessary.
While we’ve focused on a single AI agent in our example, frameworks like Phidata also allow for the implementation of more complex systems using teams of specialized agents.
In fact, our time management system could be expanded to use multiple agents working together:
Here is a brief example of how this team might handle our route time calculation:
This team-based approach allows for more specialised handling of complex tasks. However for many applications a well-designed agent can go a long way!
While we won’t delve deeper into multi-agent systems here, it’s an exciting area with great potential. — more on that soon.
While LLMs are impressive in generating text and solving complex queries, they remain constrained by their lack of memory, real-world interaction, and ability to take action. By giving LLMs access to tools, memory, and the ability to interact with the world, we are unlocking new levels of autonomy and intelligence — AI agents! This opens up exciting possibilities, from managing your calendar to automating complex business processes, all with a level of responsiveness and context-awareness that static models just can’t achieve.
While we may not have created J.A.R.V.I.S. just yet, AI agents are a major leap toward truly intelligent assistants. And if you think that’s exciting, just wait until you hear what happens when they start working together… 😉 More on that soon in Part 2!