Overview of How to Build AI Agents
This reading note is a curated summary of papers and blog posts on AI agents. Many texts are directly quoted or adapted from the original sources. The primary goal is to consolidate and summarize key resources on AI agents for easy reference and study. Updates will be added as new developments arise.
Note: Touch-ups by chatGPT.
Qinyuan Wu, last updated: 2025.01.15
What Are Agents?
What’s the difference between AI agents and AI models?
- Models: Trained systems that do not alter their parameters in the environment, such as pre-trained or fine-tuned LLMs. They serve as centralized decision-makers for agent processes.
- Workflows: Systems where LLMs and tools are orchestrated through predefined code paths. Tools bridge the gap between foundational models and external data/services.
- Agents: Systems where LLMs dynamically manage their processes and tool usage, maintaining control over how they accomplish tasks.
Agents achieve their goals using cognitive architectures that process information iteratively, make informed decisions, and refine actions based on previous outputs.
Key Points:
- Enabling the LLM’s reasoning capabilities to make good decisions:
- Task decomposition: Frameworks like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) are helpful.
- Self-reflection: Allow agents to iteratively refine decisions and correct mistakes.
- Ensuring the decision-making LLM uses the right tools.
- Providing feedback from the environment and determining when to stop iterating.
Improving Reasoning and Planning Abilities
Task Decomposition
-
Chain of Thought (CoT): A standard prompting technique instructing models to “think step by step,” breaking complex tasks into simpler steps. CoT enhances performance by utilizing more test-time computation and making the model’s reasoning process interpretable.
-
Tree of Thoughts (ToT): Extends CoT by exploring multiple reasoning possibilities at each step. Problems are decomposed into thought steps, generating multiple thoughts per step, forming a tree structure. Searches can use BFS (breadth-first search) or DFS (depth-first search), with states evaluated by classifiers or majority vote.
Self-Reflection
- ReAct: Combines reasoning and acting by expanding the action space to include task-specific actions and natural language reasoning traces. This enables interaction with the environment (e.g., using APIs) while documenting the reasoning process.
- Reflexion: A framework equipping agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion uses a reinforcement learning setup where the reward model provides binary rewards, and actions follow the ReAct structure, incorporating task-specific actions and language-based reasoning.
- Chain of Hindsight
- Algorithm Distillation
Cognitive Architecture: Memory for Reasoning and Planning
Memory plays a crucial role in an agent’s reasoning process:
Human Brain vs. Agent Memory
Human cognitive architecture broadly includes:
- Sensory Memory: Learning embedding representations for raw inputs (e.g., text, images).
- Short-Term Memory: In-context learning, limited by the finite context window of transformers.
- Long-Term Memory: External vector stores for fast retrieval during query time.
Some researchers suggest aligning an agent’s cognitive architecture with the human brain’s.
Data Stores
Data stores convert incoming documents into vector database embeddings, allowing agents to extract necessary information. For example, Retrieval-Augmented Generation (RAG) uses vector embeddings to retrieve contextually relevant information.
External APIs
- Tool Design: Tools should be clearly defined and well-documented, with prompt engineering as detailed as the overall model prompts. Recommendations from Building Effective Agents:
- Give the model enough tokens to “think” before committing to decisions.
- Use natural formats familiar to the model.
- Avoid excessive formatting overhead.
- Enhancing Model Performance: Discussed further in the next section.
Extensions
Extensions bridge the gap between agents and APIs by teaching the agent:
- How to use API endpoints with examples.
- What arguments or parameters are required for successful API calls.
Functions
Functions provide developers more control over API execution and data flow. For instance, a user requesting a ski trip suggestion might involve:
Initial output from Ski Model
The model’s output can be structured in JSON for easier parsing:
Ski Model Output in JSON Format
This allows for better API usage:
Ski Agent - External API Integration
Workflows
To integrate data stores and APIs in an agent system, developers build workflows using LLMs and tools. Examples include:
Prompt chaining with programmatic checks
Routing
Parallelization
Orchestrator-workers
Evaluator-optimizer
Model Context Protocol (MCP)
The Model Context Protocol (MCP) establishes secure connections to external systems like content repositories and business tools, ensuring models produce relevant, safe responses.
Environments
Agents interact with their environments as “text games,” receiving textual observations and producing textual actions.
- Physical Environments: AI interacts with the physical world via perceptual inputs (e.g., vision, audio) converted into text and robotic planners executing commands.
- Dialogue Environments: Agents engage in linguistic interactions, assisting with tasks or collaborating with other agents in simulations, debates, or problem-solving.
- Digital Environments: AI operates in virtual platforms like APIs or websites, augmenting knowledge and computation in cost-effective, testable settings.
Strategies to improve model tool selection:
- In-Context Learning: Example-based inference (e.g., ReAct framework).
- Retrieval-Based In-Context Learning: Dynamically retrieves relevant information or tools from external memory.
- Fine-Tuning: Pre-training on larger, specific datasets for enhanced performance.
Case Studies
Resources
Research Papers
Blog Posts and Reports
Frameworks