1Run a memory-behavior evaluation for my OpenClaw agents.
2
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step