The Era of Experience

Deep Dive: OaK Architecture & The Era of Experience

Speaker: Richard Sutton (Father of Reinforcement Learning)
Core Theme: Next-Gen Agent Architecture: OaK (Options and Knowledge)
Key Quote: “What we want is a machine that can learn from experience.” — Alan Turing

1. The Paradigm Shift: Why LLMs Are Not Enough

Sutton argues that we are transitioning between major eras of AI development. To understand why we need a new architecture, we must look at the timeline.

The evolution from simulation to static data, and finally to interactive experience.

Key Analysis of the Eras

Era of Human Data (Status Quo):
- Source: Static, historical human text/code.
- Limitation: Intelligence is capped by the quality of human data. It is essentially “Imitation Learning.”
Era of Experience (The Future):
- Source: First-person Experience (Sensation + Action + Reward).
- Advantage: Like AlphaGo’s “Move 37,” RL can discover strategies humans have never found through trial-and-error.

2. Core Architecture: OaK (Options and Knowledge)

To adapt to this “Era of Experience,” Sutton proposes the OaK Architecture. This is not just a policy network, but a complete cognitive system.

How the internal components connect to build a “Mind”.

graph TD
    %% Styles
    style Agent fill:#f4f6f7,stroke:#333,stroke-width:2px
    style World fill:#eafaf1,stroke:#333,stroke-width:2px

    subgraph World [External Environment]
        Observation(Observation / Sensation)
        Reward(Reward Signal)
    end

    subgraph Agent [The OaK Agent]
        direction TB
        
        P["Perception
Generate State Features"]
        
        subgraph Mind [Decision & Modeling]
            RP["Reactive Policy & Options
(Behaviors)"]
            VF["Value Functions
Eval (Main + Subtasks)"]
            TM["Transition Model (Knowledge)
(Prediction)"]
        end
        
        Planning["Planning
(Longer jumps with models)"]
    end
    
    Action(Action)

    %% Data Flow
    World -->|Obs + Reward| P
    P -->|State Features| RP
    P -->|State Features| VF
    P -->|State Features| TM
    
    TM <-->|Predict Future| Planning
    Planning -.->|Improve| RP
    
    RP -->|Output Option| Action
    Action -->|Act upon| World

Component Definitions

O - Options (Skills):
- The unit of behavior is not an atomic Action, but a temporally extended skill.
- Definition: A pair $(\pi, \gamma)$ .
- Example: “Open the door” (Sequence of moves) vs. “Move hand 1cm” (Atomic).
K - Knowledge (World Model):
- Specifically refers to the Transition Model.
- Function: It predicts, “If I execute this Option, where will I end up?”

3. Runtime Mechanism: The Dynamic Loop

OaK is not a static script. It is a process performed in parallel at runtime to essentially “grow” a mind.

The cycle of curiosity and mastery.

graph TD
    %% === Node Definitions ===
    Start((Start))
    Step1["Generate State Features
(Perception)"]
    Step2["Create Subproblems
(For highly-ranked features)"]
    Step3["Learn Options
(Policy + Termination)"]
    Step4["Learn Knowledge
(Transition Models)"]
    Step5["Plan with Models
(Longer Jumps)"]
    
    %% === Main Flow ===
    Start --> Step1
    Step1 -->|Extract New Features| Step2
    Step2 -->|Set Internal Goals| Step3
    Step3 -->|Master Skills| Step4
    Step4 -->|Predict Consequences| Step5
    Step5 -->|Better Strategy| Step1

    %% === Side Notes (Cognitive Process) ===
    subgraph Meaning [Cognitive Evolution]
        direction TB
        L1["Curiosity: Finding new things"]
        L2["Practice: Learning control"]
        L3["Understanding: Building World Model"]
        L4["Wisdom: Long-term Planning"]
    end
    
    %% Connections
    Step2 -.-> L1
    Step3 -.-> L2
    Step4 -.-> L3
    Step5 -.-> L4

    %% === Highlighting ===
    style Step2 fill:#f9e79f,stroke:#f1c40f,stroke-width:2px
    style Step4 fill:#aed6f1,stroke:#3498db,stroke-width:2px

How it works:

Feature Generation: The agent notices a new feature (e.g., “The door is open”).
Subproblem Creation (Curiosity): It asks, “How can I make the door open?”
Learn Options (Skill): It practices until it masters opening the door.
Learn Knowledge (Understanding): It learns that “Opening the door leads to the hallway.”
Planning: Now it can plan: “Open door -> Go to Hallway,” jumping over the micro-steps.

4. Comparison: Standard RL vs. OaK

Why is OaK better for AGI?

Feature	Standard RL Agent	OaK Agent (Sutton’s Vision)
Goal	Single Main Goal	Multiple Goals (Main + Countless Subproblems)
Unit of Behavior	Atomic Action	Option / Skill
World Model	Pixel-level / Next-step	State-to-State / Consequence Prediction
Driver	Extrinsic Reward	Curiosity & Feature Attainment
Planning	Short-sighted or expensive	Long-jump Planning (Temporal Abstraction)

5. Key Takeaways for Review

The Route to AGI: Sutton argues LLMs are great interfaces to knowledge, but RL is the core of intelligencebecause only RL breaks the data ceiling through interaction.
Revival of Hierarchical RL (HRL): OaK is essentially the ultimate form of HRL—Automatic sub-goal discovery, Automatic skill learning, and Automatic high-level modeling.
The Nature of Planning: In OaK, planning happens at the “Concept/Skill” level, not the “Pixel” level. This mirrors human reasoning (e.g., “Drive to airport” vs. “Move foot 1cm forward”).

Talks > AI

#AI Talks #Reinforcement Learning #OaK Architecture #Richard Sutton

The Era of Experience

https://yima-gu.github.io/2025/11/30/talks/The Era of Experience/

作者

Yima Gu

发布于

2025年12月1日

许可协议

CSAPP - 01.bits, Bytes and Integers 上一篇

DL Notes-7 Generative Models 下一篇