Overview of How to Build AI Agents

This reading note is a curated summary of papers and blog posts on AI agents. Many texts are directly quoted or adapted from the original sources. The primary goal is to consolidate and summarize key resources on AI agents for easy reference and study. Updates will be added as new developments arise.

Note: Touch-ups by chatGPT.

Qinyuan Wu, last updated: 2025.01.15



What Are Agents?

What’s the difference between AI agents and AI models?

alt text

Overview of an AI agent, figure from Building Effective Agents

Agents achieve their goals using cognitive architectures that process information iteratively, make informed decisions, and refine actions based on previous outputs.

Key Points:

  1. Enabling the LLM’s reasoning capabilities to make good decisions:
  2. Ensuring the decision-making LLM uses the right tools.
  3. Providing feedback from the environment and determining when to stop iterating.

Improving Reasoning and Planning Abilities

Task Decomposition

  1. Chain of Thought (CoT): A standard prompting technique instructing models to “think step by step,” breaking complex tasks into simpler steps. CoT enhances performance by utilizing more test-time computation and making the model’s reasoning process interpretable.

  2. Tree of Thoughts (ToT): Extends CoT by exploring multiple reasoning possibilities at each step. Problems are decomposed into thought steps, generating multiple thoughts per step, forming a tree structure. Searches can use BFS (breadth-first search) or DFS (depth-first search), with states evaluated by classifiers or majority vote.

Self-Reflection

  1. ReAct: Combines reasoning and acting by expanding the action space to include task-specific actions and natural language reasoning traces. This enables interaction with the environment (e.g., using APIs) while documenting the reasoning process.
alt text

Figure from ReAct: Synergizing Reasoning and Acting in Language Models

  1. Reflexion: A framework equipping agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion uses a reinforcement learning setup where the reward model provides binary rewards, and actions follow the ReAct structure, incorporating task-specific actions and language-based reasoning.
alt text

Figure from Reflexion: Language Agents with Verbal Reinforcement Learning

  1. Chain of Hindsight
  2. Algorithm Distillation

Cognitive Architecture: Memory for Reasoning and Planning

Memory plays a crucial role in an agent’s reasoning process:

alt text

Figure from Cognitive Architectures for Language Agents

Human Brain vs. Agent Memory

Human cognitive architecture broadly includes:

alt text

Figure from Lil’s log: LLM Powered Autonomous Agents

  1. Sensory Memory: Learning embedding representations for raw inputs (e.g., text, images).
  2. Short-Term Memory: In-context learning, limited by the finite context window of transformers.
  3. Long-Term Memory: External vector stores for fast retrieval during query time.

Some researchers suggest aligning an agent’s cognitive architecture with the human brain’s.

alt text

Figure from Cognitive Architectures for Language Agents

Using Tools

Data Stores

Data stores convert incoming documents into vector database embeddings, allowing agents to extract necessary information. For example, Retrieval-Augmented Generation (RAG) uses vector embeddings to retrieve contextually relevant information.

alt text

Figure from Google White Book: Agents

External APIs

  1. Tool Design: Tools should be clearly defined and well-documented, with prompt engineering as detailed as the overall model prompts. Recommendations from Building Effective Agents:
  2. Enhancing Model Performance: Discussed further in the next section.

Extensions

Extensions bridge the gap between agents and APIs by teaching the agent:

  1. How to use API endpoints with examples.
  2. What arguments or parameters are required for successful API calls.
alt text

Figure from Google White Book: Agents

Functions

Functions provide developers more control over API execution and data flow. For instance, a user requesting a ski trip suggestion might involve:

Ski Model

Initial output from Ski Model

The model’s output can be structured in JSON for easier parsing:

Ski JSON

Ski Model Output in JSON Format

This allows for better API usage:

Ski Agent

Ski Agent - External API Integration

Workflows

To integrate data stores and APIs in an agent system, developers build workflows using LLMs and tools. Examples include:

Prompt chaining

Prompt chaining with programmatic checks

Routing

Routing

Parallelization

Parallelization

Orchestrator-workers

Orchestrator-workers

Evaluator-optimizer

Evaluator-optimizer

Model Context Protocol (MCP)

The Model Context Protocol (MCP) establishes secure connections to external systems like content repositories and business tools, ensuring models produce relevant, safe responses.

Environments

Agents interact with their environments as “text games,” receiving textual observations and producing textual actions.

  1. Physical Environments: AI interacts with the physical world via perceptual inputs (e.g., vision, audio) converted into text and robotic planners executing commands.
  2. Dialogue Environments: Agents engage in linguistic interactions, assisting with tasks or collaborating with other agents in simulations, debates, or problem-solving.
  3. Digital Environments: AI operates in virtual platforms like APIs or websites, augmenting knowledge and computation in cost-effective, testable settings.

Enhancing Model Performance with Targeted Learning

Strategies to improve model tool selection:

  1. In-Context Learning: Example-based inference (e.g., ReAct framework).
  2. Retrieval-Based In-Context Learning: Dynamically retrieves relevant information or tools from external memory.
  3. Fine-Tuning: Pre-training on larger, specific datasets for enhanced performance.

Case Studies


Resources

Research Papers

Blog Posts and Reports

Frameworks