The Era of Experience

Deep Dive: OaK Architecture & The Era of Experience

  • Speaker: Richard Sutton (Father of Reinforcement Learning)
  • Core Theme: Next-Gen Agent Architecture: OaK (Options and Knowledge)
  • Key Quote“What we want is a machine that can learn from experience.” — Alan Turing

1. The Paradigm Shift: Why LLMs Are Not Enough

Sutton argues that we are transitioning between major eras of AI development. To understand why we need a new architecture, we must look at the timeline.

The evolution from simulation to static data, and finally to interactive experience.

graph LR
    

    %% Era 1
    subgraph Era1 ["2014 - 2018"]
        direction TB
        A["Era of Simulation
(Simulated Env)"] A1("Atari Games") A2("AlphaGo / AlphaZero") A --> A1 A --> A2 end %% Era 2 subgraph Era2 ["2018 - 2023"] direction TB B["Era of Human Data
(Static Datasets)"] B1("GPT-3 / GPT-4") B2("ChatGPT / Claude") B --> B1 B --> B2 note1["Limitation: Imitation
Lack of Interaction"] end %% Era 3 subgraph Era3 ["2024 Onwards"] direction TB C["Era of Experience
(First-person Interaction)"] C1("AlphaProof") C2("OaK Architecture") C --> C1 C --> C2 note2["Goal: Superhuman
Self-generated Data"] end %% Connections Era1 -.-> Era2 Era2 ==>|Paradigm Shift| Era3 %% Apply Styles class A,A1,A2 past; class B,B1,B2,note1 present; class C,C1,C2,note2 future;

Key Analysis of the Eras

  • Era of Human Data (Status Quo):
    • Source: Static, historical human text/code.
    • Limitation: Intelligence is capped by the quality of human data. It is essentially “Imitation Learning.”
  • Era of Experience (The Future):
    • SourceFirst-person Experience (Sensation + Action + Reward).
    • Advantage: Like AlphaGo’s “Move 37,” RL can discover strategies humans have never found through trial-and-error.

2. Core Architecture: OaK (Options and Knowledge)

To adapt to this “Era of Experience,” Sutton proposes the OaK Architecture. This is not just a policy network, but a complete cognitive system.

How the internal components connect to build a “Mind”.

graph TD
    %% Styles
    style Agent fill:#f4f6f7,stroke:#333,stroke-width:2px
    style World fill:#eafaf1,stroke:#333,stroke-width:2px

    subgraph World [External Environment]
        Observation(Observation / Sensation)
        Reward(Reward Signal)
    end

    subgraph Agent [The OaK Agent]
        direction TB
        
        P["Perception
Generate State Features"] subgraph Mind [Decision & Modeling] RP["Reactive Policy & Options
(Behaviors)"] VF["Value Functions
Eval (Main + Subtasks)"] TM["Transition Model (Knowledge)
(Prediction)"] end Planning["Planning
(Longer jumps with models)"] end Action(Action) %% Data Flow World -->|Obs + Reward| P P -->|State Features| RP P -->|State Features| VF P -->|State Features| TM TM <-->|Predict Future| Planning Planning -.->|Improve| RP RP -->|Output Option| Action Action -->|Act upon| World

Component Definitions

  1. O - Options (Skills):
    • The unit of behavior is not an atomic Action, but a temporally extended skill.
    • Definition: A pair (π,γ)(\pi, \gamma).
    • Example: “Open the door” (Sequence of moves) vs. “Move hand 1cm” (Atomic).
  2. K - Knowledge (World Model):
    • Specifically refers to the Transition Model.
    • Function: It predicts, “If I execute this Option, where will I end up?”

3. Runtime Mechanism: The Dynamic Loop

OaK is not a static script. It is a process performed in parallel at runtime to essentially “grow” a mind.

The cycle of curiosity and mastery.

graph TD
    %% === Node Definitions ===
    Start((Start))
    Step1["Generate State Features
(Perception)"] Step2["Create Subproblems
(For highly-ranked features)"] Step3["Learn Options
(Policy + Termination)"] Step4["Learn Knowledge
(Transition Models)"] Step5["Plan with Models
(Longer Jumps)"] %% === Main Flow === Start --> Step1 Step1 -->|Extract New Features| Step2 Step2 -->|Set Internal Goals| Step3 Step3 -->|Master Skills| Step4 Step4 -->|Predict Consequences| Step5 Step5 -->|Better Strategy| Step1 %% === Side Notes (Cognitive Process) === subgraph Meaning [Cognitive Evolution] direction TB L1["Curiosity: Finding new things"] L2["Practice: Learning control"] L3["Understanding: Building World Model"] L4["Wisdom: Long-term Planning"] end %% Connections Step2 -.-> L1 Step3 -.-> L2 Step4 -.-> L3 Step5 -.-> L4 %% === Highlighting === style Step2 fill:#f9e79f,stroke:#f1c40f,stroke-width:2px style Step4 fill:#aed6f1,stroke:#3498db,stroke-width:2px

How it works:

  1. Feature Generation: The agent notices a new feature (e.g., “The door is open”).
  2. Subproblem Creation (Curiosity): It asks, “How can I make the door open?”
  3. Learn Options (Skill): It practices until it masters opening the door.
  4. Learn Knowledge (Understanding): It learns that “Opening the door leads to the hallway.”
  5. Planning: Now it can plan: “Open door -> Go to Hallway,” jumping over the micro-steps.

4. Comparison: Standard RL vs. OaK

Why is OaK better for AGI?

Feature Standard RL Agent OaK Agent (Sutton’s Vision)
Goal Single Main Goal Multiple Goals (Main + Countless Subproblems)
Unit of Behavior Atomic Action Option / Skill
World Model Pixel-level / Next-step State-to-State / Consequence Prediction
Driver Extrinsic Reward Curiosity & Feature Attainment
Planning Short-sighted or expensive Long-jump Planning (Temporal Abstraction)

5. Key Takeaways for Review

  1. The Route to AGI: Sutton argues LLMs are great interfaces to knowledge, but RL is the core of intelligencebecause only RL breaks the data ceiling through interaction.
  2. Revival of Hierarchical RL (HRL): OaK is essentially the ultimate form of HRL—Automatic sub-goal discovery, Automatic skill learning, and Automatic high-level modeling.
  3. The Nature of Planning: In OaK, planning happens at the “Concept/Skill” level, not the “Pixel” level. This mirrors human reasoning (e.g., “Drive to airport” vs. “Move foot 1cm forward”).

The Era of Experience
https://yima-gu.github.io/2025/11/30/talks/The Era of Experience/
作者
Yima Gu
发布于
2025年12月1日
许可协议