# Diagnose & fix opencode async polling (round 2) ## Context The first round added the `model` field to `/prompt_async` (commit `1c07435`). On the dev server (`http://127.0.0.1:4242`, password `jenesuispasunhacker`) the user now sees two symptoms: 1. **First `ai ping` on a fresh session "crashes" the computer.** The session is created, the user message is posted, opencode visibly generates a pong reply — but `ai` returns an error path and `startup/servers.lua` runs `parallel.waitForAny(shell, eventloop)` followed by `os.shutdown()`. Any path where either coroutine returns shuts the machine down. So the visible "crash" is really: *some* exit in our coroutine + harness-style shutdown logic in production. 2. **Subsequent `ai` calls (including `ai sessions`) block.** The user reports the second `ai ping` is "blocked indefinitely" and even `ai sessions` doesn't return — which is odd because `ai.listSessions` does not poll, so this might either be a misreport or evidence of a wider hang. The most load-bearing hypothesis is that `findAssistantMessage` in `apis/libai.lua` (introduced by `c61254d`) never matches: it assumes our submitted `messageID` is the **user** message id and looks for the *next* assistant message after it. The opencode docs we wrote in the same commit say the submitted id is the **assistant** message id (`docs/opencode_api.md` line 105: "Get a message by ID. Opencode validates caller-provided message IDs; use IDs starting with `msg`."). If the docs are right, our user message has a server-generated id, our submitted id appears on the assistant message itself, `seenSubmitted` stays `false`, and we poll until `opencc.poll_timeout_seconds` expires. This explains both observations: - "First ping crashes" → poll timeout returns `false, 'delai depasse en attendant la reponse AI'`. `programs/ai.lua` prints the error and returns. Then the shell prompt gets shown again — *unless* something else in the program causes a Lua error in the call chain that makes either coroutine exit. We need a probe to confirm. - "Second ping blocks" → opencode keeps the previous assistant generation in-flight (we never acknowledged it) and queues new prompts behind it, OR the next poll loop is the same one, just on a session where opencode now refuses to generate a new turn. The user also asked for two ergonomics: - A `--verbose` flag on `programs/ai.lua` so headless probes can see polling progress. - A way to disable `os.shutdown()` outside the harness (out of scope for the bug fix, captured as a follow-up). ## Plan ### Phase A — confirm the hypothesis with probes (no code changes) Run these against `http://127.0.0.1:4242` with Basic auth `opencode:jenesuispasunhacker`. 1. Create a session and capture its id: ``` curl -s -u opencode:jenesuispasunhacker -X POST \ -H 'Content-Type: application/json' \ -d '{"title":"probe"}' \ http://127.0.0.1:4242/session ``` 2. POST a known messageID via prompt_async: ``` curl -s -u opencode:jenesuispasunhacker -X POST \ -H 'Content-Type: application/json' \ -d '{"messageID":"msg_probe_1","parts":[{"type":"text","text":"reply with exactly: pong"}],"model":{"providerID":"anthropic","modelID":"claude-opus-4-7"}}' \ http://127.0.0.1:4242/session//prompt_async ``` 3. Poll the list and inspect ids/roles: ``` curl -s -u opencode:jenesuispasunhacker \ http://127.0.0.1:4242/session//message | jq '.[] | {id: .info.id, role: .info.role, finish: .info.finish, completed: .info.time.completed}' ``` **Expected outcomes**: - If `msg_probe_1` appears with `role == "assistant"` → the doc-stated semantics are correct, `findAssistantMessage` is wrong, and we should poll for the message *whose own id matches* our submitted id. - If `msg_probe_1` appears with `role == "user"` → c61254d's reading is correct, the bug is elsewhere (model dispatch, message decoding, etc.). - If `msg_probe_1` doesn't appear at all → opencode is silently dropping it; investigate model or auth. Record which case is real. The fix branches on this. ### Phase B — code changes (drive both cases) Regardless of which probe outcome wins, `findAssistantMessage` is brittle (only handles one of two id-placement conventions and fails silently). Replace it with a more defensive lookup that handles both, in `apis/libai.lua` around lines 268–280: ```lua local function findAssistantMessage(messages, submittedMessageId) -- Case 1 (docs): our id is the assistant message id. for _, m in ipairs(messages) do if type(m) == 'table' and type(m.info) == 'table' and m.info.id == submittedMessageId and m.info.role == 'assistant' then return m; end end -- Case 2 (c61254d empirical): our id is the user message id; assistant follows. local seen = false; for _, m in ipairs(messages) do if type(m) == 'table' and type(m.info) == 'table' then if m.info.id == submittedMessageId then seen = true; elseif seen and m.info.role == 'assistant' then return m; end end end return nil; end ``` This keeps the existing `pollMessage` flow and `isMessageComplete` check unchanged; it just stops missing the message when opencode's id-placement matches the docs. ### Phase C — `--verbose` for `programs/ai.lua` Add a `--verbose` (and `-v`) flag, parsed before `command` is taken. When set: - Pass a `log` callback into `ai.ask` (libai already accepts `log` in `luaExec`; extend `api.ask` to call `options.log(message)` from `pollMessage` at each poll iteration with: `'poll attempt #N: msgs=K, found=' .. (decoded and decoded.info.id or 'nil') .. ', complete=' .. tostring(isComplete)`). - `programs/ai.lua`'s `askAndPrint` / `printSessions` / ping handler all print these via a `[ai]` prefix when `--verbose` is on. This makes the next round of headless probing useful without re-instrumenting the code. ### Phase D — harness verification With the dev server already running, drive the probes via the harness so the diagnostic loop is reproducible: ``` just trapos-exec ' settings.set("opencc.server_url","http://127.0.0.1:4242"); settings.set("opencc.password","jenesuispasunhacker"); settings.set("opencc.provider_id","anthropic"); settings.set("opencc.model_id","claude-opus-4-7"); settings.unset("opencc.session_id"); shell.run("/programs/ai.lua","--verbose","ping"); ' ``` Expected: `pong` printed; no timeout. Then re-run without `--verbose` to confirm the regression no longer reproduces. Also run `just check` and `just test`. Update the existing `tests/ai.lua` `'ask polls async message until completion'` test if the message-list shape changes to include the assistant-id-match case. Add a new test: `'ask finds reply when submitted id matches assistant message itself'` — message list contains a user message with a server id and an assistant message with `info.id == submitted` and `time.completed`. ### Files - `apis/libai.lua` — rewrite `findAssistantMessage`; thread an optional `log` callback through `api.ask` → `pollMessage`. - `programs/ai.lua` — add `--verbose`/`-v` flag, prefixing log lines with `[ai]`. - `tests/ai.lua` — new test for assistant-id-match; adjust any test that asserts the exact poll output if needed. - Version bumps per [ADR-0011](docs/adrs/adr-0011-repo-conventions.md): `packages/trapos-ai/ccpm.json` patch bump, `packages/trapos/ccpm.json` patch bump, mirror in `packages/index.json`, `manifest.json`. ## Verification 1. Probe outputs from Phase A captured (and pasted into the PR or commit message). 2. `just trapos-exec` invocation from Phase D returns `pong` on a fresh session id. 3. Repeat the same invocation — second ping also returns `pong` (i.e., second-call hang is gone). 4. `ai sessions` (added to the same probe script) returns the session list within a couple of seconds. 5. `just check` clean; `just test` passes; new test green. ## Out of scope (captured as follow-ups) - **Harness vs production shutdown.** `startup/servers.lua` ends with `os.shutdown()` so the headless harness exits cleanly. In production this is undesirable because *any* shell exit shuts the computer down. Follow-up: gate the shutdown on a setting (`trapos.harness_mode`) or an environment signal so production keeps the shell up. - **Auto-detect opencode default model** instead of requiring `opencc.provider_id` / `opencc.model_id` (kept from previous plan). - **Session auto-recovery on poll timeout.** Optional `opencc.reset_session_on_timeout` setting that calls `clearSession()` after a timeout so the next run starts on a fresh session. Likely unnecessary once `findAssistantMessage` is fixed, but worth revisiting if the second-ping hang persists.