lesson-14

18 MIN READ | UPDATED: 2026-05-07

🎯 Learning Objectives for This Session

Welcome, future AI architects, to Part 14 of the LangGraph Multi-Agent Masterclass! Remember the ambitious "Universal AI Content Agency" we built previously? It can already plan, research, write, and even edit. But now, we need to give it a superpower—a "time machine" that allows it to "travel back in time," or rather, resume exactly from where it left off.

By the end of this session, you will be able to:

  1. Thoroughly understand the core value of LangGraph Checkpoints: Why it is the "killer feature" for building robust and efficient multi-agent systems, and its application scenarios in real-world projects.
  2. Master the configuration and usage of SqliteSaver and MemorySaver: Learn how to introduce these two storage solutions into your LangGraph applications and understand their respective pros and cons.
  3. Introduce persistent state storage to your "AI Content Agency": Ensure that even if the system crashes, the network disconnects, or the process restarts, our agents can seamlessly resume work from where they left off, avoiding duplicated effort and wasted resources.
  4. Master practical techniques for restoring workflows from historical state snapshots: Learn how to use the thread_id mechanism to accurately locate and restore the execution state of a specific session.

Ready? Let's embark on our LangGraph time machine journey!

📖 Concept Explanation

Staying Online When Disconnected: Why Do We Need Checkpoints?

Imagine your "Universal AI Content Agency" is working around the clock to generate a 10,000-word article for you. The Planner has just finished a complex structural outline, the Researcher is navigating through massive amounts of data, and the Writer is halfway through drafting... Suddenly, the server goes offline! A power outage occurs! Or your program crashes due to a minor bug!

If you don't have Checkpoints, congratulations, all the hard work your agents previously did—those expensive LLM calls and time-consuming data processing—goes down the drain. You have to start over from scratch, wasting not only precious computing resources and LLM tokens (which cost real money!), but also a massive amount of time. It's an absolute nightmare!

This is the core value of LangGraph Checkpoints: It acts like an auto-save feature in a video game, silently saving a complete copy of your "game progress" at every critical node of your multi-agent workflow. No matter what unexpected situation arises, you can "load your save" at any time and continue from where you last left off, truly achieving a "staying online when disconnected" resilience.

The Core Mechanism of Checkpoints

LangGraph's Checkpoints mechanism is built on top of its core State abstraction. In LangGraph, the execution state of the entire multi-agent workflow is encapsulated within a State object. Whenever an Agent (node) finishes executing, it receives the current State, processes it according to its logic, and then returns a new State update.

The working principle of Checkpoints is: whenever the State is updated, LangGraph automatically stores a snapshot of this latest State. When your application needs to recover, it loads the latest State for a specific session (identified by a thread_id) from storage, and the workflow can then resume execution from this State.

It's like shooting a movie. After every take (Agent execution), the director records the current scene, actor positions, and prop placements (State) by saving a snapshot. If filming is interrupted, the next time they start, they only need to reference the records to precisely restore the setup to exactly how it was when they stopped.

Two Common Savers: In-Memory and Persistent

LangGraph provides several implementations of BaseCheckpointSaver, among which the most commonly used and easiest to get started with are:

  1. MemorySaver (In-Memory Storage):
    • Characteristics: As the name suggests, it stores all state snapshots in the program's RAM.
    • Pros: Extremely fast and simple to configure. Ideal for development, debugging, testing, or short-lived tasks that don't require persistence.
    • Cons: Not persistent! Once the program process terminates, all stored states are lost. Just like a computer's RAM, it wipes clean when powered off.
  2. SqliteSaver (SQLite File Storage):
    • Characteristics: Stores state snapshots in a local SQLite database file. SQLite is a lightweight, embedded database that doesn't require a separate server process.
    • Pros: Persistent! States are written to a disk file, so even if the program process terminates, it can recover from the file on the next startup. Configuration is relatively simple, and performance is sufficient for most small to medium-sized applications.
    • Cons: For ultra-large-scale, high-concurrency, or distributed deployment scenarios, SQLite might not be the best choice. File I/O performance could become a bottleneck.

Additionally, LangGraph supports more robust persistence solutions like PostgresSaver and RedisSaver. These are suitable for production environments with higher demands for performance, scalability, and high availability. However, for our current "AI Content Agency" project, SqliteSaver is more than enough and serves perfectly to demonstrate the core concepts of persistent storage.

Mermaid Diagram: Checkpoints Workflow

Let's use a Mermaid diagram to visually understand how Checkpoints work within our "AI Content Agency."

graph TD
    subgraph AI Content Agency Workflow
        A[Start Task e.g., Generate Social Media Content] --> B{Agent: Planner};
        B -- Update AgencyState --> C(Checkpoint: Save Current State Snapshot);
        C --> D{Agent: Researcher};
        D -- Update AgencyState --> E(Checkpoint: Save Current State Snapshot);
        E --> F{Agent: Writer};
        F -- Update AgencyState --> G(Checkpoint: Save Current State Snapshot);
        G --> H{Agent: Editor};
        H -- Update AgencyState --> I(Checkpoint: Save Current State Snapshot);
        I --> J[Task Completed];
    end

    subgraph Exception and Recovery Mechanism
        K[System Interruption/Error/Process Restart];
        K --> L{Restart Application};
        L -- Use the same thread_id --> M[Load AgencyState from Latest Checkpoint];
        M --> N[Resume Workflow Execution];
    end

    style C fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style E fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style G fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style I fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style M fill:#fcc,stroke:#333,stroke-width:2px,color:#000;
    style N fill:#afa,stroke:#333,stroke-width:2px,color:#000;

Diagram Explanation:

  • AI Content Agency Workflow: Our agents (Planner, Researcher, Writer, Editor) execute tasks sequentially. Whenever an Agent finishes its work and updates the AgencyState, the Checkpoint node is triggered to save the current AgencyState in its entirety.
  • Exception and Recovery Mechanism: If a system interruption or error occurs during the execution of any Agent (K), when the application restarts (L), we can pass in the same thread_id. This allows LangGraph to automatically load (M) the latest AgencyState from the Checkpoint, and the workflow can then resume (N) right after the Agent where it was interrupted, rather than starting from scratch.

See? Checkpoints are like giving your multi-agent system an "immortality buff," greatly enhancing the system's resilience and reliability. For any complex, long-running AI application, this is an indispensable key feature!

💻 Practical Code Drill (Application in the Agency Project)

Now, let's apply these principles to our "Universal AI Content Agency" project. We will simulate the process of a Planner agent conducting content planning, simulate an interruption halfway through, and then use SqliteSaver to recover from the point of interruption.

1. Define AgencyState

First, we need an AgencyState capable of carrying the working state of our agency. To demonstrate Checkpoints, we'll simplify it to only include planning tasks and completed tasks.

import operator
from typing import Annotated, TypedDict, List
from langchain_core.messages import BaseMessage # Retained for AgencyState generality, though unused in this session
from langgraph.graph import StateGraph, END
from langgraph.checkpoint import MemorySaver, SqliteSaver # Import both Savers
import os
import time
import json # Used for data serialization in SqliteSaver

# 1. Define the working state of our AI content agency
# TypedDict defines the state structure, Annotated with operator.add indicates the list type is appendable
class AgencyState(TypedDict):
    """
    Represents the overall working state of the AI content agency.
    """
    # List of tasks pending planning
    planning_tasks: Annotated[List[str], operator.add]
    # List of completed planning tasks
    completed_plans: Annotated[List[str], operator.add]
    # Name of the task currently being processed
    current_task: str
    # Agent execution path (used for tracking and debugging)
    agent_path: Annotated[List[str], operator.add]

# Clean up any existing old database files to ensure a fresh start every run
# In a real production environment, do not delete arbitrarily; this is just for demonstration convenience
if os.path.exists("agency_checkpoints.sqlite"):
    os.remove("agency_checkpoints.sqlite")
    print("已清理旧的 agency_checkpoints.sqlite 文件。")

2. Define Agent Node (PlannerAgent)

We'll create a simplified PlannerAgent that simulates the execution of planning tasks and uses time.sleep() to simulate time-consuming operations, allowing us to "catch" the moment it gets interrupted.

# 2. Define a simplified Planner Agent node
class PlannerAgent:
    """
    Content Planner Agent, responsible for breaking down large content tasks into smaller planning steps.
    """
    def __init__(self, name: str):
        self.name = name

    def plan(self, state: AgencyState) -> AgencyState:
        """
        Core logic of the Planner Agent: receives current state, plans tasks, and returns updated state.
        """
        print(f"\n[{self.name}] 正在处理任务: '{state['current_task']}'...")
        time.sleep(1.5) # Simulate thinking time during the planning process

        new_tasks = []
        # Simulate different planning outputs based on the current task
        if state['current_task'] == "生成社交媒体内容规划":
            print(f"[{self.name}] 正在分解 '生成社交媒体内容规划'...")
            new_tasks = ["确定目标受众", "构思主题", "撰写初稿大纲"]
        elif state['current_task'] == "确定目标受众":
            print(f"[{self.name}] 正在分解 '确定目标受众'...")
            new_tasks = ["分析用户画像", "定义受众特征"]
        elif state['current_task'] == "构思主题":
            print(f"[{self.name}] 正在分解 '构思主题'...")
            new_tasks = ["头脑风暴热门话题", "市场趋势分析"]
        elif state['current_task'] == "撰写初稿大纲":
            print(f"[{self.name}] 正在分解 '撰写初稿大纲'...")
            new_tasks = ["确定文章结构", "分配章节任务"]
        else:
            # If the current task has no further subtasks, it is considered complete
            print(f"[{self.name}] 完成任务: '{state['current_task']}'")
            return {
                "completed_plans": [state['current_task']], # Add current task to the completed list
                "planning_tasks": [], # Clear pending tasks, as this branch only handles single task completion
                "current_task": "", # Clear current task
                "agent_path": [self.name]
            }

        # If there are new subtasks, update the state
        if new_tasks:
            next_task = new_tasks.pop(0) # Pop the first one to be the next current task
            return {
                "planning_tasks": new_tasks, # Put remaining subtasks into the pending planning list
                "current_task": next_task, # Update current task
                "agent_path": [self.name]
            }
        else:
            # Theoretically won't reach here, as completion is handled above
            return {"agent_path": [self.name]}

3. Build Graph and Integrate Checkpoints

We will create a simple linear workflow executed in a loop by the PlannerAgent until all planning tasks are completed. The key is that we pass the SqliteSaver instance into the checkpointer parameter of the StateGraph.

# 3. Build LangGraph workflow
def create_agency_workflow(checkpointer=None):
    """
    Create the AI content agency workflow and integrate Checkpointer as needed.
    """
    workflow = StateGraph(AgencyState)
    planner = PlannerAgent("内容规划师")

    # Add Planner node
    workflow.add_node("planner", planner.plan)

    # Define routing logic: if there are still planning tasks, continue passing to Planner; otherwise, END.
    def route_next_task(state: AgencyState):
        """
        Determine the next route based on the current state.
        """
        # Check planning_tasks and current_task to ensure correct completion judgment
        remaining_tasks = state.get('planning_tasks', [])
        current_task_val = state.get('current_task', '')

        # If current task is completed (and no new subtasks generated), and pending list is empty
        if not current_task_val and not remaining_tasks:
            print(f"[Router] 所有规划任务已完成,工作流结束。")
            return END
        
        # If current task is completed, but pending list still has tasks
        # Or current task is not yet completed, but has generated new subtasks (i.e., current_task updated to new subtask)
        # At this point, if planning_tasks has content, or current_task has a value, it should continue
        if current_task_val or remaining_tasks:
             print(f"[Router] 还有任务,继续交给 Planner 处理。")
             return "planner"
        else:
            print(f"[Router] 未知状态,默认结束。")
            return END # Fallback, theoretically won't reach here

    workflow.set_entry_point("planner") # Set entry point to planner
    workflow.add_conditional_edges(
        "planner",
        route_next_task,
        {"planner": "planner", END: END} # Route to planner or END
    )

    # Compile workflow and pass in checkpointer
    app = workflow.compile(checkpointer=checkpointer)
    return app

4. Run and Restore Demo

Now, let's demonstrate how SqliteSaver achieves resumable execution, as well as the non-persistent nature of MemorySaver.

if __name__ == "__main__":
    # --- Demonstrate SqliteSaver (Persistent Storage) ---
    print("--- 演示 SqliteSaver (持久化存储) ---")
    db_file = "agency_checkpoints.sqlite"
    # Ensure each demo starts from a clean database file