ADR 0009 — Judge gemini backend (api_key + OAuth)
Status: Accepted (issue #128)
Context
ADR 0007 (issue #125) defined the pluggable driver layer for
tests/e2e/lib/llm_judge.sh: one file per backend at
tests/e2e/lib/llm_judge_drivers/<backend>.sh, exposing
_llm_judge_driver_<backend>_preflight and
_llm_judge_driver_<backend>_call. ADR 0008 (issue #126) extended that
contract with an auth_mode field and shipped claude-code.sh as the
template for an OAuth-capable driver — credentials read from a CLI
credential store on disk, fed to the upstream Messages API via
Authorization: Bearer.
Issue #128 closes the symmetry: the Gemini CLI is a first-class CrewRig
target, contributors already run task e2e:auth:gemini to mint Gemini
credentials, and there is no reason the LLM-judge oracle should require
a separate Google API key to use Gemini as the judge model. A gemini
backend that mirrors claude-code — api_key and oauth auth modes,
on-disk credential store override, soft-fail to UNCERTAIN when strict
is off — keeps the driver matrix consistent and unblocks contributors
who prefer Gemini for evaluation.
Two surfaces differ from the Anthropic path and shape the decision:
- Google's OAuth tokens are short-lived (~1h). The on-disk credentials
file carries a
refresh_token, not a long-lived access token, so the driver MUST exchange refresh-for-access at preflight time against the Google token endpoint. There is noexpiresAtcheck to lean on — assume every cached access token is stale. - Google APIs accept a per-request
x-goog-user-projectheader that bills quota to a named GCP project. The judge config therefore needs an optionalgcp_projectfield so contributors with a paid project can route quota correctly.
Decision
1. Driver: tests/e2e/lib/llm_judge_drivers/gemini.sh
Implements the ADR 0007 §1 contract.
_preflight
- If
E2E_JUDGE_MOCK=1→printf 'AUTH_TOKEN=mock\n'; return 0(mirrorsanthropic.shandclaude-code.sh). - Read
JUDGE_AUTH_MODE. Supported values:oauth(primary),api_key(fallback for users withGEMINI_JUDGE_API_KEYor equivalent). Any other value → rc=1 with_e2e_assert_diag. auth_mode = "api_key": identical pattern to_llm_judge_driver_anthropic_preflight— readJUDGE_API_KEY_ENVvia indirect expansion, empty → rc=2, otherwise emitAUTH_TOKEN=<key>and return 0. The token is the raw API key; the_callstep appends it as?key=<token>on the URL (Google's API-key transport, not a header).auth_mode = "oauth":-
Credential path:
${GEMINI_CREDENTIALS_PATH:-$HOME/.crewrig-e2e/gemini/oauth_creds.json}. The env override lets the harness point at an alternate location without symlinking. Default matches the path written bytask e2e:auth:gemini. -
Missing or unreadable file → rc=2 (soft auth-missing; core maps to UNCERTAIN per ADR 0007 §3).
-
Refuse to read credential files whose POSIX mode is more permissive than
0600— same guard as ADR 0008 §4a. -
Read
refresh_token,client_id,client_secretvia:refresh="$(jq -r '.refreshToken // empty' "$path")" cid="$(jq -r '.clientId // empty' "$path")" csec="$(jq -r '.clientSecret // empty' "$path")"See §3 — schema is UNVERIFIED.
-
Any missing field → rc=2 (auth-missing, not hard failure).
-
Exchange refresh → access via Google's token endpoint:
resp="$(curl -fsS --max-time 10 \ -X POST https://oauth2.googleapis.com/token \ -d "client_id=${cid}" \ -d "client_secret=${csec}" \ -d "refresh_token=${refresh}" \ -d "grant_type=refresh_token")"curlnon-zero or non-2xx → rc=2 with a# WARN gemini judge: OAuth refresh failed — re-run task e2e:auth:geminiline on stderr. Soft fail; mapped to UNCERTAIN whenstrict=false.access_tokenempty/null in response → rc=2, same warning shape.
-
On success:
printf 'AUTH_TOKEN=%s\n' "$access_token"; return 0.
-
_call
Single positional api_key (semantically: the bearer token under
oauth, the raw API key under api_key). Branches on
JUDGE_AUTH_MODE:
api_key: POST tohttps://generativelanguage.googleapis.com/v1beta/models/${JUDGE_MODEL}:generateContent?key=${api_key}with the prompt body. No Authorization header.oauth: POST to the same URL without the?key=query parameter, with-H "Authorization: Bearer ${api_key}". WhenJUDGE_GCP_PROJECTis non-empty, additionally forward-H "x-goog-user-project: ${JUDGE_GCP_PROJECT}".
Request body uses the Gemini generateContent shape:
{
"contents": [{"parts": [{"text": "<rendered prompt>"}]}],
"generationConfig": {"temperature": <JUDGE_TEMPERATURE>}
}
Retry loop, counter increment, and verdict-extraction regex are
copy-equivalent to anthropic.sh — the duplication is deliberate per
ADR 0007 §1.
2. New optional field: [judge].gcp_project
[judge]
backend = "gemini"
auth_mode = "oauth"
model = "gemini-2.0-flash"
# Optional: bill quota to a specific GCP project.
# gcp_project = "my-eval-project"
Loader changes in _llm_judge_load_config (tests/e2e/lib/llm_judge.sh):
- Default literal:
gcp_project="". - Parsed via
jq -r '.judge.gcp_project // ""' effective.json. - Emitted as
JUDGE_GCP_PROJECT=%qalongside the existingJUDGE_*lines. - Declared
localandexported inllm_judge()so the driver reads it via plain expansion.
The core never branches on JUDGE_GCP_PROJECT; the driver is solely
responsible for translating it into the x-goog-user-project header.
Empty string → header omitted entirely (do not emit
-H "x-goog-user-project: ", which some curl/libcurl combos forward
as a literal empty value and confuses the API).
3. oauth_creds.json schema — UNVERIFIED
UNVERIFIED — verify before merge. The Gemini CLI's on-disk
credential schema is not documented in this repository. The driver's
best guess matches the
google-auth-library authorized_user format
that the gcloud SDK and most Google Node/Python libraries serialize to
disk, with camelCase keys:
{
"refreshToken": "1//0g…",
"clientId": "…apps.googleusercontent.com",
"clientSecret": "GOCSPX-…",
"type": "authorized_user"
}
The developer MUST run the following against a freshly-minted
~/.crewrig-e2e/gemini/oauth_creds.json (or the developer's own
~/.config/gemini/… path) before landing the driver:
jq 'keys' ~/.crewrig-e2e/gemini/oauth_creds.json
If the observed schema differs (snake_case keys, nested under a
credentials object, alternate field names like refresh_token),
update the jq selectors in _preflight and amend this ADR in the
same PR. Do not guess. Two common variants the developer should be
ready for:
- snake_case top-level (
refresh_token,client_id,client_secret) — gcloud'sapplication_default_credentials.jsonconvention. - nested under
installed— the raw OAuth client JSON downloaded from Google Cloud Console.
Whichever form the Gemini CLI writes, the driver must select against
it directly with jq; no auto-detection across shapes.
4. Backward compatibility
-
gcp_projectdefaults to"". Existing configs see zero behavior change. -
No existing driver (
anthropic,claude-code) is modified. -
New
local.toml.examplestanza (commented) documents both auth modes for the gemini backend:# Use Gemini as the LLM judge. # # OAuth mode — reuses the credentials minted by `task e2e:auth:gemini`: # [judge] # backend = "gemini" # auth_mode = "oauth" # model = "gemini-2.0-flash" # temperature = 0.0 # strict = false # gcp_project = "my-eval-project" # optional, forwards x-goog-user-project # # API-key mode — requires GEMINI_JUDGE_API_KEY in env: # [judge] # backend = "gemini" # auth_mode = "api_key" # api_key_env = "GEMINI_JUDGE_API_KEY" # model = "gemini-2.0-flash"
File list
| Path | Change |
|---|---|
tests/e2e/lib/llm_judge.sh |
Loader: parse + export JUDGE_GCP_PROJECT. No quorum or verdict changes. |
tests/e2e/lib/llm_judge_drivers/gemini.sh |
New. api_key + oauth preflight (with refresh exchange), generateContent call with optional x-goog-user-project. |
tests/e2e/defaults.toml |
Add gcp_project = "" under [judge] with a one-line comment. |
tests/e2e/local.toml.example |
Add commented [judge] stanzas demoing backend = "gemini" for both modes. |
docs/adr/0009-judge-gemini-backend.md |
This file. |
scripts/tests/test-e2e-llm-judge-lib.sh |
New cases: api_key happy-path, oauth happy-path (stubbed creds + stubbed token endpoint via JUDGE_ENDPOINT indirection), missing file → UNCERTAIN, missing refresh field → UNCERTAIN, token endpoint 4xx → UNCERTAIN, gcp_project header forwarded, empty gcp_project header omitted. |
tests/e2e/lib/README.md |
One-paragraph note pointing at ADR 0009. |
docs/cli-matrix.md |
Update row covering judge backends to show Gemini parity. |
Non-goals
- No change to
anthropicorclaude-codedrivers. This ADR adds a third file; existing drivers are untouched. - No access-token caching. The driver exchanges refresh-for-access
on every preflight. A 1-hour cache would reduce token-endpoint
traffic but adds a write surface (cache file, lock, expiry) that
this ADR explicitly defers. The cost is one extra HTTPS round-trip
per
llm_judgeinvocation, which the existing retry budget absorbs. - No Vertex AI backend.
generativelanguage.googleapis.comonly. A Vertex variant (regional endpoint, different auth scope, ADC pickup) is a separate driver if and when needed. - No
gcloud auth print-access-tokenfallback. The on-disk credentials file is the single source of truth; shelling out togcloudwould couple the e2e harness to an unrelated CLI. - No bundled-component changes. Nothing under
community-config/,.gemini/, or.claude/is touched. - No version bump. No
community-config/skills/*/SKILL.mdorcommunity-config/agents/*/AGENT.mdis modified.
Blast radius
- Files modified on
maintoday: 5 (llm_judge.sh,defaults.toml,local.toml.example,test-e2e-llm-judge-lib.sh,cli-matrix.md). - Files added: 2 (the driver and this ADR).
tests/e2e/lib/README.mdmay be created if it does not yet exist. - Public-contract additions:
- New
[judge].gcp_projectfield — additive, default"". - New
geminibackend value — additive. - New
GEMINI_CREDENTIALS_PATHenv override — additive. - New outbound dependency:
oauth2.googleapis.com/token(refresh exchange) andgenerativelanguage.googleapis.com(judge call). The latter is already a peer ofapi.anthropic.comfrom a network egress standpoint; the token endpoint is new.
- New
- Risks:
- Schema drift on
oauth_creds.json. Mitigated by the UNVERIFIED flag in §3 — the developer confirms the path empirically before merge. - Refresh failure surface. Google's token endpoint can return
invalid_grantfor revoked refresh tokens; the driver maps this to UNCERTAIN via rc=2, which is correct understrict=falsebut could mask a chronically broken setup. The# WARNstderr line is the operator's signal. - Refresh-token exfiltration via
GEMINI_CREDENTIALS_PATH+ attacker-controlledJUDGE_ENDPOINT. Same threat model as ADR 0008 §4a — accepted, mitigated only by the0600permission check. - Quota routing. Forwarding
x-goog-user-projectagainst a project the calling identity does not have access to surfaces a 403 inside_call. Caller maps to UNCERTAIN via the existing malformed-output path — acceptable.
- Schema drift on
- Reversibility: Easy. The driver file can be deleted and the
gcp_projectfield removed; default behavior is untouched.