CrewRig

Scripting Conventions

Bash and Python glue scripts under scripts/ and hooks/ follow these rules. The list is short on purpose: each rule comes from a real production incident where a script silently misbehaved for weeks. If you have a real reason to break a rule, mark the line with # acknowledged-exception: <reason> so the exception is greppable.


Rule 1 — No broad or bare except in Python glue

A try block must catch the narrowest exception that justifies the recovery path. Bare except:, except Exception:, and except BaseException: are banned because they silently swallow ImportError, KeyboardInterrupt, SystemExit, and unrelated runtime bugs.

Why

The session-transcript hook (hooks/mempalace-transcript.sh) silently dropped every entry for months after the MemPalace v3.x upgrade. A broad except was hiding an ImportError for a class that no longer existed (PalaceGraph). The fix landed in 1429cdb; this rule exists so the failure mode does not return.

Bad

try:
    from mempalace.mcp_server import tool_add_drawer
    result = tool_add_drawer(...)
except:
    pass  # never know what failed

Good

try:
    from mempalace.mcp_server import tool_add_drawer
except ImportError as e:
    print(f"IMPORT_ERROR: {e}", file=sys.stderr)
    sys.exit(2)

result = tool_add_drawer(...)
if not result.get("success"):
    print(f"ADD_FAILED: {result.get('error', 'unknown')}", file=sys.stderr)
    sys.exit(3)

If continuing past a failure is genuinely correct, surface it:

try:
    drawer_dt = datetime.strptime(date_str, "%Y-%m-%d")
except ValueError:
    stats["skipped_format"] += 1   # counted, not silent
    continue

If stats["skipped_format"] == total_scanned, escalate — that means the recovery path is masking a structural assumption error.


Rule 2 — No success-log-before-result

A log line that says "did X" must be guarded by the actual outcome of X. Logs that fire regardless of the underlying call's exit status produce months of "looks fine" with zero data flowing.

Why

The transcript hook printed mempalace-transcript: persisted ... to stderr on every invocation, including the ones where the underlying Python heredoc had crashed during import. The success line lived outside the return-code check that gated it. PR #31 / commit 1429cdb.

Bad

"$MEMPALACE_PYTHON" - <<PYEOF
... # may exit non-zero
PYEOF
echo "persisted entry"  # fires regardless

Good

STATUS=$("$MEMPALACE_PYTHON" - 2>&1 <<'PYEOF'
... # may exit non-zero
PYEOF
)
STATUS_RC=$?
if [ "$STATUS_RC" -eq 0 ]; then
  echo "persisted entry"
else
  echo "FAILED to persist (rc=$STATUS_RC): $STATUS" >&2
fi

The principle generalizes: the log line must be on the success branch of an if that read the actual result.


Rule 3 — No [ test ] && cmd && exit N under set -e

Use an explicit if … then … fi block. The chained-test pattern looks like a one-line guard, but under set -e it kills the script on the success path: when the test is false, the compound returns non-zero, and set -e aborts before the next intended statement. The script appears to fail silently on every successful run.

Why

Surfaced in commit 212318c:

The pattern [ -z "$FOUND" ] && echo "..." && exit 1 would kill the script via set -e whenever FOUND was set (i.e., on every successful install), because the negated test returns non-zero.

The trap repeats itself across files because the one-liner is shorter than the if block. It is shorter; it is also wrong.

Bad

set -e
[ ! -d "$REPO_DIR/extensions/$TIER/$EXT" ] && echo "Error: extension '$EXT' not found." && exit 1
do_install "$EXT"

Good

set -e
if [ ! -d "$REPO_DIR/extensions/$TIER/$EXT" ]; then
  echo "Error: extension '$EXT' not found." >&2
  exit 1
fi
do_install "$EXT"

The explicit if block makes the failure path unambiguous and the success path immune to the set -e trap. Send the error to >&2 while you are there.

A single-&& form ([[ ! cond ]] && exit 1) is the same family in lighter form and gets the same treatment.


Rule 4 — Defense-in-depth checks must verify their inputs

When a check fires only on a partial view of its input, the check passing is not the same as the underlying invariant being true. Make checks independently verifiable — surface the size of the input they actually saw, so the next reader can spot a wedge upstream.

Why

PR #39 (scripts/prune-transcripts.sh) shipped a pagination loop whose termination condition (count <= offset + len(drawers)) misread the API contract and exited after the first page. The script had a perfectly correct format-mismatch escalation that fired on the 100 drawers it saw, hiding the fact that 11 850 of 11 850 drawers in the wing had not been scanned. Two review cycles missed the upstream wedge because the downstream check kept producing an actionable error.

Bad

# Pagination loop: imagine it terminates after the first page (broken
# upstream — only 100 of 11 850 drawers scanned).
for drawer in drawers_seen_so_far:
    if not matches_format(drawer):
        skipped_format += 1

# This downstream check fires on the partial sample and looks correct.
# The user sees an actionable error message and concludes the script
# worked. The 11 750 unscanned drawers stay invisible — the wedge
# upstream is hidden because the downstream check refused to look past
# the slice it was handed.
if skipped_format == len(drawers_seen_so_far):
    sys.exit(3)  # silent on input size — silent on the wedge

Good

# Surface what was actually scanned, so a wedge upstream is visible.
# A reviewer comparing "Scanned: 100" against an expected wing size of
# ~10 000 catches the upstream pagination bug at a glance.
print(f"Scanned: {stats['total_scanned']} drawer(s)")
print(f"Format mismatch: {stats['skipped_format']}")
if stats["skipped_format"] == stats["total_scanned"] > 0:
    print("Error: 100% format mismatch on a sample of "
          f"{stats['total_scanned']} drawer(s).", file=sys.stderr)
    sys.exit(3)

A reviewer comparing "Scanned: 100" against an expected wing size of ~10 000 catches the upstream bug at a glance.


Acknowledged exceptions

If a real reason justifies breaking one of these rules (e.g. a third-party library that requires a bare except), tag the line:

# acknowledged-exception: vendored library raises bare strings, not exceptions
[ -f "$marker" ] && load_legacy "$marker" && exit 0
# acknowledged-exception: aiohttp 3.x raises CancelledError as bare except path
try:
    await client.fetch(url)
except:  # noqa: E722
    cleanup()
    raise

The tag is greppable, so the exception is auditable. New tags should be discussed in the PR.


CI grep check

.github/workflows/scripting-conventions.yml runs a small grep pass on scripts/ and hooks/ to flag new occurrences of the chained-test pattern and bare except. The check is intentionally lightweight; it does not try to be a full linter. False positives are handled with the acknowledged-exception tag.