Overview of How to Build AI Agents

This reading note is a curated summary of papers and blog posts on AI agents. Many texts are directly quoted or adapted from the original sources. The primary goal is to consolidate and summarize key resources on AI agents for easy reference and study. Updates will be added as new developments arise.

Note: Touch-ups by chatGPT.

Qinyuan Wu, last updated: 2025.01.15

AI Agents

What Are Agents?

What’s the difference between AI agents and AI models?

Models: Trained systems that do not alter their parameters in the environment, such as pre-trained or fine-tuned LLMs. They serve as centralized decision-makers for agent processes.
Workflows: Systems where LLMs and tools are orchestrated through predefined code paths. Tools bridge the gap between foundational models and external data/services.
Agents: Systems where LLMs dynamically manage their processes and tool usage, maintaining control over how they accomplish tasks.

Overview of an AI agent, figure from Building Effective Agents

Agents achieve their goals using cognitive architectures that process information iteratively, make informed decisions, and refine actions based on previous outputs.

Key Points:

Enabling the LLM’s reasoning capabilities to make good decisions:
- Task decomposition: Frameworks like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) are helpful.
- Self-reflection: Allow agents to iteratively refine decisions and correct mistakes.
Ensuring the decision-making LLM uses the right tools.
Providing feedback from the environment and determining when to stop iterating.

Improving Reasoning and Planning Abilities

Task Decomposition

Chain of Thought (CoT): A standard prompting technique instructing models to “think step by step,” breaking complex tasks into simpler steps. CoT enhances performance by utilizing more test-time computation and making the model’s reasoning process interpretable.
Tree of Thoughts (ToT): Extends CoT by exploring multiple reasoning possibilities at each step. Problems are decomposed into thought steps, generating multiple thoughts per step, forming a tree structure. Searches can use BFS (breadth-first search) or DFS (depth-first search), with states evaluated by classifiers or majority vote.

Self-Reflection

ReAct: Combines reasoning and acting by expanding the action space to include task-specific actions and natural language reasoning traces. This enables interaction with the environment (e.g., using APIs) while documenting the reasoning process.

Figure from ReAct: Synergizing Reasoning and Acting in Language Models

Reflexion: A framework equipping agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion uses a reinforcement learning setup where the reward model provides binary rewards, and actions follow the ReAct structure, incorporating task-specific actions and language-based reasoning.

Figure from Reflexion: Language Agents with Verbal Reinforcement Learning

Chain of Hindsight
Algorithm Distillation

Cognitive Architecture: Memory for Reasoning and Planning

Memory plays a crucial role in an agent’s reasoning process:

Figure from Cognitive Architectures for Language Agents

Human Brain vs. Agent Memory

Human cognitive architecture broadly includes:

Figure from Lil’s log: LLM Powered Autonomous Agents

Sensory Memory: Learning embedding representations for raw inputs (e.g., text, images).
Short-Term Memory: In-context learning, limited by the finite context window of transformers.
Long-Term Memory: External vector stores for fast retrieval during query time.

Some researchers suggest aligning an agent’s cognitive architecture with the human brain’s.

Figure from Cognitive Architectures for Language Agents

Using Tools

Data Stores

Data stores convert incoming documents into vector database embeddings, allowing agents to extract necessary information. For example, Retrieval-Augmented Generation (RAG) uses vector embeddings to retrieve contextually relevant information.

Figure from Google White Book: Agents

External APIs

Tool Design: Tools should be clearly defined and well-documented, with prompt engineering as detailed as the overall model prompts. Recommendations from Building Effective Agents:
- Give the model enough tokens to “think” before committing to decisions.
- Use natural formats familiar to the model.
- Avoid excessive formatting overhead.
Enhancing Model Performance: Discussed further in the next section.

Extensions

Extensions bridge the gap between agents and APIs by teaching the agent:

How to use API endpoints with examples.
What arguments or parameters are required for successful API calls.

Figure from Google White Book: Agents

Functions

Functions provide developers more control over API execution and data flow. For instance, a user requesting a ski trip suggestion might involve:

Initial output from Ski Model

The model’s output can be structured in JSON for easier parsing:

Ski Model Output in JSON Format

This allows for better API usage:

Ski Agent - External API Integration

Workflows

To integrate data stores and APIs in an agent system, developers build workflows using LLMs and tools. Examples include:

Prompt chaining with programmatic checks

Routing

Parallelization

Orchestrator-workers

Evaluator-optimizer

Model Context Protocol (MCP)

The Model Context Protocol (MCP) establishes secure connections to external systems like content repositories and business tools, ensuring models produce relevant, safe responses.

Environments

Agents interact with their environments as “text games,” receiving textual observations and producing textual actions.

Physical Environments: AI interacts with the physical world via perceptual inputs (e.g., vision, audio) converted into text and robotic planners executing commands.
Dialogue Environments: Agents engage in linguistic interactions, assisting with tasks or collaborating with other agents in simulations, debates, or problem-solving.
Digital Environments: AI operates in virtual platforms like APIs or websites, augmenting knowledge and computation in cost-effective, testable settings.

Enhancing Model Performance with Targeted Learning

Strategies to improve model tool selection:

In-Context Learning: Example-based inference (e.g., ReAct framework).
Retrieval-Based In-Context Learning: Dynamically retrieves relevant information or tools from external memory.
Fine-Tuning: Pre-training on larger, specific datasets for enhanced performance.