实验日志

Memory Lab

我的 Agent 有记忆，但不会稳定使用。流程胜过提醒。从 30% 基线，到正式实验后的 90% 现网变体，靠的是流程升级而不是堆提示词。

流程胜过提醒。把 recall 前置到 runtime 后，整体通过率从 30% 提升到 90%。

Starter prompt

1Run a memory-behavior evaluation for my OpenClaw agents.
2 
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8 
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14 
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22 
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30 
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36 
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step

这个 Prompt 会强制它做的事

优先查本地 OpenClaw 文档快照；没有的话就查官方文档，公开 tool-openclaw skill 可作为辅助参考
先核对 live runtime，再谈当前生产行为
把 prompt-only 和 runtime recall 分开比较
按 failure buckets 记分，而不是让另一个 LLM 当裁判

实时数据

Live

生产遥测

下面的数据来自我们真实运行中的 AI Agent。最初正式实验先确立了完整 recall loop，3 月 19 日优化轮再把现网切到更快的 compact soft。这里展示的是当前生产状态。 Top-1 + compact soft. 快照来自运营遥测数据。

19 Mar, 14:43

Last sync

87.5%

search before answer

7d eval recall

Nexus

86.3%

Best agent

Quill

20%

Weakest

10%

eval window

Banned phrasing

Nexus205 turns

86.3%

Scout5 turns

80%

Guide73 turns

79.5%

Media Manager331 turns

76.1%

Quill5 turns

20%

参与的 Agent

Guide

73 recent live turns: enough production data to show on the leaderboard above.

live sample

Nexus

205 recent live turns: enough production data to show on the leaderboard above.

live sample

Scout

5 recent live turns: enough production data to show on the leaderboard above.

live sample

Quill

5 recent live turns: enough production data to show on the leaderboard above.

live sample

Forge

4 recent live turns: not enough for a stable public score yet.

under-sampled

Media Manager

331 recent live turns: enough production data to show on the leaderboard above.

live sample

Finance

4 recent live turns: not enough for a stable public score yet.

under-sampled

Eval

Runs the controlled experiment suites. Not included in the production live leaderboard.

eval worker

总览

各变体结果一览

我们测试了 8 种不同的记忆检索方案。下面的柱状图按通过率排序，既保留最初正式套件的结果，也吸收后续优化轮的数据。你可以直观看到：基线只有 30%，而当前 live compact-soft 方案已经进入 90% 档。

💀Plain baseline

30%

30 runs

💀Lean snippets

30%

30 runs

🟡Top-1 + direct tone

70%

30 runs

🟡Direct + scrub

90%

10 runs

🟡Top-2 compare

90%

10 runs

🟡Daily bundle fallback

90%

10 runs

🟡Top-1 + MEMORY.md fallback (formal winner)

90%

10 runs

🏆Top-1 + compact soft

90%

10 runs

实验详情

8 种检索方案，逐个拆解

我们实际构建并测试了 8 种不同的记忆检索方案——从最简单的"什么都不做"基线，到各种 prefetch、fallback、prompt 风格的组合。每张卡片都基于真实实验或后续优化轮，不是示意伪代码。💀 = 失败方案，🟡 = 有效或历史关键方案，🏆 = 当前 live production 方案。

💀 变体 1

基线对照

生产形态的基线：保留真实 memorySearch 配置，但不加任何 runtime recall 帮助。

为什么重要

这不是”把所有笔记硬塞进 system prompt”的实验，而是最贴近原生 OpenClaw 的对照组：memory 工具可用，但没有额外 prefetch、没有 MEMORY.md 回退、也没有更强的流程约束。

💀 变体 2

精简片段

最低上下文版本：只注入很短的 search snippets，结果几乎从不把 retrieval 做完。

为什么重要

我们原本以为把 recall block 压到最轻会更省 token，但正式结果说明”轻”不等于”好”。它的 pass rate 和 baseline 一样低，而且最核心的问题依旧是没有完成 `memory_get`。

🏆 变体 3

Top-1 + 紧凑柔和

这是 3 月 19 日优化轮后切到现网的生产方案：保留同一条 recall 流程，但把注入块压紧，延迟明显更低。

为什么重要

它在最初正式 balance 轮里先把“流程优先”证明到了 70% 档；到了 3 月 19 日的 headroom 优化轮，它又在 90% 通过率档打平最强方案，同时把中位延迟从 16069 ms 降到 10781 ms，所以被提升成 live production。

🟡 变体 4

Top-1 + 直接风格

和 compact soft 一样的 recall 流程，但把回答规则改得更直接。

为什么重要

这版的假设是：更强硬的”直接回答”规则，也许能减少犹豫和绕弯。但结果说明，风格更硬并不会自动提高记忆使用质量。

🟡 变体 5

直接 + 清洗

在 direct 版上增加出站 phrase scrub，风格更干净，但总分没有再抬高。

为什么重要

这版测试的是”清理套话”能不能顺手提高 pass rate。结果是否定的：它能改善表面风格，但决定胜负的还是 recall/search/get 本身。

🟡 变体 6

Top-2 对比

把 top-1 改成 top-2，更适合冲突记忆，但没有带来整体胜利。

为什么重要

这个版本是为 `conflicting_memory` 这类题准备的。工程上它是合理的，但 balance 轮说明：把 top-1 扩到 top-2 不是最主要的杠杆。

🟡 变体 7

每日捆绑回退

在 miss 时不只读 `MEMORY.md`，还把今天和昨天的 daily notes 一起带进来。

为什么重要

我们想验证 recent log 能不能补强 MEMORY.md 的盲点。结果是：它在少数最近日志题上有帮助，但在正式 balance 轮里没有把总分再往上推。

🟡 变体 8

Top-1 + MEMORY.md 回退（正式赢家）

这是最初正式 balance 轮的赢家，第一次把 production-safe recall loop 证明出来。现网 VPS 已在后续优化轮后切到 compact soft。

为什么重要

真正拉开差距的不是更重的 prompt，而是完整的 recall 流程：回答前先做 recall gate，命中后 `memory_get`，miss 时再硬回退到 `MEMORY.md`。3 月 19 日的切换并没有否定这条路线，只是在现网里保留同一流程、换成更紧凑的注入块来换取更低延迟。

深入

方法论

最初正式 balance 轮是 8 个变体 × 10 个场景 × 3 次重复，总计 240 runs。之后 3 月 19 日又补了一轮 headroom 优化切片，用来比较 live 候选的速度和质量。评分始终是规则优先：该 search 没 search、该 get 没 get、答完才去搜、问了多余问题、或者说了 banned phrase，都会直接记失败。

这些变体不是”完全相同的 system prompt”。我们实际改的是 plugin hook、runtime prefetch 流程和注入块。OpenClaw 文档也明确写了：before_prompt_build 只负责塑形 prompt，真正的记忆读取仍要靠 memory_search / memory_get；同时 MEMORY.md 每轮自动注入，memory/*.md 只会按需读取。