ADR 0005 — E2E pillar scenarios
- Status: Accepted
- Date: 2026-05-23
- Issue: #80 (epic #75)
- Builds on: ADR 0001 (Docker images), 0002 (auth flow), 0003 (runner + TOML), 0004 (assertion libs)
Context
Epic #75 has landed the e2e plumbing — Docker base + per-CLI images, the auth flow, the TOML-driven runner, and the three assertion libraries. What remains is the content: the four pillar scenarios that exercise the framework's load-bearing surfaces end-to-end.
The four pillars from issue #80 are:
| # | Pillar | Verifies |
|---|---|---|
| 01 | Layered context | 00–60 rule files deployed to a CLI actually steer behavior |
| 02 | Cross-tool memory | A drawer written by CLI A is read by CLI B via shared MemPalace |
| 03 | Skill build | build-components.sh emits artifacts a CLI can invoke |
| 04 | Harness loop | harness-report → MemPalace → harness-curator round-trip |
Two structural questions had to be settled before any scenario can be written:
-
Where does scenario logic live — inside the container, or on the host orchestrating containers? The runner today (
tests/e2e/run.sh) builds a singledocker run … <cli> <args>invocation per<scenario,cli>pair, captures stdout/stderr/exit, and emits one TAP line. It does not source any per-scenario script, and it does not exportE2E_LIB_DIR/E2E_REPORT_DIRdespite the assertion lib README announcing them as the public contract. -
How does scenario 02 spin up a shared MemPalace process? No
docker-compose.ymlexists. Thecrewrig/e2e-mempalace:latestimage exists (docker/e2e/mempalace.Dockerfile) but no orchestration wraps it.
This ADR settles both.
Decision
D1. Scenario = host-side orchestration script
Each scenario is a directory under tests/e2e/scenarios/<name>/
containing at minimum a host-executable run.sh. The runner discovers
the script and delegates orchestration to it. The script is responsible
for:
- Composing and launching the CLI container(s) it needs.
- Spinning up any sidecar (MemPalace, fixture servers) and tearing it down on exit.
- Running assertions via the sourced
lib/*.shhelpers, on the host, against artifacts written to the mounted case directory. - Emitting a TAP subtest block to stdout and exiting
0/1/78(the runner's existing skip convention).
Rejected alternative — in-container scenarios (CLI receives a probe
prompt as args, writes artifacts to a mounted volume, runner
post-asserts on the host). It breaks down for scenario 02 (needs two
sequential containers sharing state) and scenario 04 (needs the
out-of-container harness-curator step). Host orchestration covers
every pillar; in-container does not. The cost — slightly heavier
scenario scripts — is paid once per pillar.
D2. Scenario contract (env-var interface)
The runner exports the following before invoking
scenarios/<name>/run.sh:
| Var | Value |
|---|---|
E2E_LIB_DIR |
absolute path to tests/e2e/lib/ |
E2E_REPORT_DIR |
the per-case directory (<run>/<cli>/<scenario>/) |
E2E_CLI |
claude | gemini | copilot |
E2E_IMAGE |
resolved cli.<name>.image from effective.json |
E2E_EFFECTIVE_JSON |
absolute path to effective.json |
E2E_CREWRIG_E2E_HOME |
resolved $CREWRIG_E2E_HOME (already expanded) |
E2E_SCENARIO_DIR |
absolute path to scenarios/<name>/ |
The scenario's exit code maps to TAP exactly as today: 0 → ok,
78 → skip with diagnostic, anything else → not ok.
The scenario MUST write its own TAP subtest plan to
${E2E_REPORT_DIR}/scenario.tap for drill-down. The top-level
run.tap keeps the one-line-per-pair grain (no nested plan
explosion); operators inspecting a failure follow the path printed in
the runner's diagnostic.
D3. Runner change — minimal delegation branch
tests/e2e/run.sh gains one branch inside the main loop, gated on the
presence of scenarios/<name>/run.sh:
scenario_script="${SCRIPT_DIR}/scenarios/${scenario}/run.sh"
if [[ -x "$scenario_script" ]]; then
env E2E_LIB_DIR="${SCRIPT_DIR}/lib" \
E2E_REPORT_DIR="$case_dir" \
E2E_CLI="$cli" \
E2E_IMAGE="$image" \
E2E_EFFECTIVE_JSON="$EFFECTIVE_JSON" \
E2E_CREWRIG_E2E_HOME="$(e2e_e2e_home)" \
E2E_SCENARIO_DIR="${SCRIPT_DIR}/scenarios/${scenario}" \
"$scenario_script" \
>"${case_dir}/stdout" 2>"${case_dir}/stderr"
exit_code=$?
else
# Existing legacy path — direct docker run.
…
fi
The legacy path is preserved so the existing smoke-version-style entries
in defaults.toml (none today, but the schema allows them) keep
working. The two paths are mutually exclusive per scenario, decided by
file presence — no new TOML field needed.
D4. Sidecar lifecycle — per-scenario, not project-wide
Scenario 02 owns its MemPalace sidecar. The script:
- Creates a unique docker volume (
crewrig-e2e-mp-${RUN_ID}). - Starts
crewrig/e2e-mempalace:latestdetached with that volume mounted at/dataand a unique container name. - Runs container A (writer), waits, runs container B (reader), both mounting the same volume read-write.
- Asserts on the host via
assert_drawer_present(which shells into a throwawaycrewrig/e2e-mempalacecontainer with the same volume). - Tears down container + volume in an
EXITtrap, even on failure.
No project-wide docker-compose.yml is introduced. Compose adds a
declarative surface and a parallel orchestration model we do not yet
need; reaching for it now would lock in a topology before we know
which scenarios share what. Per-scenario docker run keeps each
pillar's blast radius confined to its own directory and matches how
the auth-flow scripts already operate.
D5. File layout per scenario
tests/e2e/scenarios/
├── 01-layered-context/
│ ├── run.sh # host orchestrator
│ ├── probe.prompt # prompt fed to the CLI
│ └── expected.regex # structural assertion input
├── 02-cross-tool-memory/
│ ├── run.sh
│ ├── writer.prompt
│ └── reader.prompt
├── 03-skill-build/
│ ├── run.sh
│ └── sample-skill.prompt
└── 04-harness-loop/
├── run.sh
├── friction.prompt
└── judge-criterion.txt
Each run.sh opens with the canonical preamble from
tests/e2e/lib/README.md:
#!/usr/bin/env bash
set -euo pipefail
: "${E2E_LIB_DIR:?runner must export E2E_LIB_DIR}"
: "${E2E_REPORT_DIR:?runner must export E2E_REPORT_DIR}"
source "${E2E_LIB_DIR}/assert.sh"
source "${E2E_LIB_DIR}/structural.sh"
source "${E2E_LIB_DIR}/llm_judge.sh"
D6. TOML changes — metadata only
Each scenario gets a [scenarios.<name>] table in defaults.toml
limited to discovery metadata:
[scenarios.01-layered-context]
description = "00-60 rule files steer the CLI's profile-aware answer."
applies_to = ["claude", "gemini", "copilot"]
[scenarios.02-cross-tool-memory]
description = "Drawer written by CLI A is readable by CLI B via shared MemPalace."
applies_to = ["claude", "gemini"] # Copilot path tracked as a parity gap.
[scenarios.03-skill-build]
description = "build-components.sh emits CLI-specific artefacts; sample skill invokes."
applies_to = ["claude", "gemini", "copilot"]
[scenarios.04-harness-loop]
description = "harness-report → MemPalace → harness-curator round-trip."
applies_to = ["claude", "gemini", "copilot"]
command_args is intentionally omitted — the scenario script owns
container invocation, including any per-CLI argv differences.
Blast radius
| Surface | Change | Risk |
|---|---|---|
tests/e2e/run.sh |
One added branch (D3) gated on file presence. Legacy path unchanged. | Low — single conditional, exit-code mapping reused. |
tests/e2e/defaults.toml |
Four [scenarios.*] entries with metadata only. |
Low — additive; today's 1..0 short-circuit replaced by 4 × N(applies_to) cases. |
tests/e2e/lib/* |
None. The contract advertised in lib/README.md is finally met by the runner. |
None. |
tests/e2e/scenarios/** |
New tree. | Confined. |
docker/e2e/* |
None — image set is sufficient. | None. |
| CI | First wall-clock impact: each scenario adds 1–3 container starts per CLI per run. Scenario 02 also pulls/runs the MemPalace image. | Medium — gated behind auth readiness (e2e_auth_ready), so unconfigured CI legs continue to SKIP cleanly. |
docs/cli-matrix.md |
Add a row per scenario × CLI; record scenario 02's Copilot deferral with evidence (Copilot has no MemPalace MCP wiring yet — see community-config/skills/* parity check). |
Required per AGENTS.md CLI Matrix Maintenance. |
Consequences
Positive.
- The lib README's
E2E_LIB_DIR/E2E_REPORT_DIRcontract becomes real instead of aspirational. - Scenarios can compose multi-container choreography (02, 04) without inventing a new runner concept.
- The TOML stays narrow: discovery metadata, not orchestration script.
- Adding a fifth scenario is a one-directory drop-in, no runner change.
Negative.
- Scenarios duplicate the docker-argv assembly that the legacy runner
path centralises. Mitigation: extract a
lib/docker_invoke.shhelper in a follow-up if duplication exceeds three scenarios. Not blocking for #80 — four scenarios is on the boundary. - Per-scenario MemPalace lifecycle means scenario 02 cannot run truly
in parallel with itself across CLIs without volume-name
disambiguation. The unique-volume-per-
RUN_IDconvention (D4) handles the cross-CLI case; intra-scenario parallelism is out of scope. - The judge call budget (
max_calls = 30from ADR 0004) is shared by all scenarios. Scenarios 01 and 04 are the only LLM-judge users; budget per run remains comfortable.
Open questions deferred to the developer
- Exact prompt wording for each pillar — the developer iterates against the live CLIs.
- Whether scenario 02's Copilot leg can be salvaged via a CLI-agnostic MemPalace drawer probe rather than an MCP path. If yes, drop the parity gap; if no, file the evidence per AGENTS.md Gap-acceptance evidence rule.
- Whether
lib/docker_invoke.shshould land in #80 or wait for the duplication signal in a later PR.