8.5 KiB
Diagnose & fix opencode async polling (round 2)
Context
The first round added the model field to /prompt_async (commit 1c07435). On the dev server (http://127.0.0.1:4242, password jenesuispasunhacker) the user now sees two symptoms:
- First
ai pingon a fresh session "crashes" the computer. The session is created, the user message is posted, opencode visibly generates a pong reply — butaireturns an error path andstartup/servers.luarunsparallel.waitForAny(shell, eventloop)followed byos.shutdown(). Any path where either coroutine returns shuts the machine down. So the visible "crash" is really: some exit in our coroutine + harness-style shutdown logic in production. - Subsequent
aicalls (includingai sessions) block. The user reports the secondai pingis "blocked indefinitely" and evenai sessionsdoesn't return — which is odd becauseai.listSessionsdoes not poll, so this might either be a misreport or evidence of a wider hang.
The most load-bearing hypothesis is that findAssistantMessage in apis/libai.lua (introduced by c61254d) never matches: it assumes our submitted messageID is the user message id and looks for the next assistant message after it. The opencode docs we wrote in the same commit say the submitted id is the assistant message id (docs/opencode_api.md line 105: "Get a message by ID. Opencode validates caller-provided message IDs; use IDs starting with msg."). If the docs are right, our user message has a server-generated id, our submitted id appears on the assistant message itself, seenSubmitted stays false, and we poll until opencc.poll_timeout_seconds expires.
This explains both observations:
- "First ping crashes" → poll timeout returns
false, 'delai depasse en attendant la reponse AI'.programs/ai.luaprints the error and returns. Then the shell prompt gets shown again — unless something else in the program causes a Lua error in the call chain that makes either coroutine exit. We need a probe to confirm. - "Second ping blocks" → opencode keeps the previous assistant generation in-flight (we never acknowledged it) and queues new prompts behind it, OR the next poll loop is the same one, just on a session where opencode now refuses to generate a new turn.
The user also asked for two ergonomics:
- A
--verboseflag onprograms/ai.luaso headless probes can see polling progress. - A way to disable
os.shutdown()outside the harness (out of scope for the bug fix, captured as a follow-up).
Plan
Phase A — confirm the hypothesis with probes (no code changes)
Run these against http://127.0.0.1:4242 with Basic auth opencode:jenesuispasunhacker.
- Create a session and capture its id:
curl -s -u opencode:jenesuispasunhacker -X POST \ -H 'Content-Type: application/json' \ -d '{"title":"probe"}' \ http://127.0.0.1:4242/session - POST a known messageID via prompt_async:
curl -s -u opencode:jenesuispasunhacker -X POST \ -H 'Content-Type: application/json' \ -d '{"messageID":"msg_probe_1","parts":[{"type":"text","text":"reply with exactly: pong"}],"model":{"providerID":"anthropic","modelID":"claude-opus-4-7"}}' \ http://127.0.0.1:4242/session/<SID>/prompt_async - Poll the list and inspect ids/roles:
curl -s -u opencode:jenesuispasunhacker \ http://127.0.0.1:4242/session/<SID>/message | jq '.[] | {id: .info.id, role: .info.role, finish: .info.finish, completed: .info.time.completed}'
Expected outcomes:
- If
msg_probe_1appears withrole == "assistant"→ the doc-stated semantics are correct,findAssistantMessageis wrong, and we should poll for the message whose own id matches our submitted id. - If
msg_probe_1appears withrole == "user"→ c61254d's reading is correct, the bug is elsewhere (model dispatch, message decoding, etc.). - If
msg_probe_1doesn't appear at all → opencode is silently dropping it; investigate model or auth.
Record which case is real. The fix branches on this.
Phase B — code changes (drive both cases)
Regardless of which probe outcome wins, findAssistantMessage is brittle (only handles one of two id-placement conventions and fails silently). Replace it with a more defensive lookup that handles both, in apis/libai.lua around lines 268–280:
local function findAssistantMessage(messages, submittedMessageId)
-- Case 1 (docs): our id is the assistant message id.
for _, m in ipairs(messages) do
if type(m) == 'table' and type(m.info) == 'table'
and m.info.id == submittedMessageId and m.info.role == 'assistant' then
return m;
end
end
-- Case 2 (c61254d empirical): our id is the user message id; assistant follows.
local seen = false;
for _, m in ipairs(messages) do
if type(m) == 'table' and type(m.info) == 'table' then
if m.info.id == submittedMessageId then
seen = true;
elseif seen and m.info.role == 'assistant' then
return m;
end
end
end
return nil;
end
This keeps the existing pollMessage flow and isMessageComplete check unchanged; it just stops missing the message when opencode's id-placement matches the docs.
Phase C — --verbose for programs/ai.lua
Add a --verbose (and -v) flag, parsed before command is taken. When set:
- Pass a
logcallback intoai.ask(libai already acceptsloginluaExec; extendapi.askto calloptions.log(message)frompollMessageat each poll iteration with:'poll attempt #N: msgs=K, found=' .. (decoded and decoded.info.id or 'nil') .. ', complete=' .. tostring(isComplete)). programs/ai.lua'saskAndPrint/printSessions/ ping handler all print these via a[ai]prefix when--verboseis on.
This makes the next round of headless probing useful without re-instrumenting the code.
Phase D — harness verification
With the dev server already running, drive the probes via the harness so the diagnostic loop is reproducible:
just trapos-exec '
settings.set("opencc.server_url","http://127.0.0.1:4242");
settings.set("opencc.password","jenesuispasunhacker");
settings.set("opencc.provider_id","anthropic");
settings.set("opencc.model_id","claude-opus-4-7");
settings.unset("opencc.session_id");
shell.run("/programs/ai.lua","--verbose","ping");
'
Expected: pong printed; no timeout. Then re-run without --verbose to confirm the regression no longer reproduces.
Also run just check and just test. Update the existing tests/ai.lua 'ask polls async message until completion' test if the message-list shape changes to include the assistant-id-match case. Add a new test: 'ask finds reply when submitted id matches assistant message itself' — message list contains a user message with a server id and an assistant message with info.id == submitted and time.completed.
Files
apis/libai.lua— rewritefindAssistantMessage; thread an optionallogcallback throughapi.ask→pollMessage.programs/ai.lua— add--verbose/-vflag, prefixing log lines with[ai].tests/ai.lua— new test for assistant-id-match; adjust any test that asserts the exact poll output if needed.- Version bumps per ADR-0011:
packages/trapos-ai/ccpm.jsonpatch bump,packages/trapos/ccpm.jsonpatch bump, mirror inpackages/index.json,manifest.json.
Verification
- Probe outputs from Phase A captured (and pasted into the PR or commit message).
just trapos-execinvocation from Phase D returnspongon a fresh session id.- Repeat the same invocation — second ping also returns
pong(i.e., second-call hang is gone). ai sessions(added to the same probe script) returns the session list within a couple of seconds.just checkclean;just testpasses; new test green.
Out of scope (captured as follow-ups)
- Harness vs production shutdown.
startup/servers.luaends withos.shutdown()so the headless harness exits cleanly. In production this is undesirable because any shell exit shuts the computer down. Follow-up: gate the shutdown on a setting (trapos.harness_mode) or an environment signal so production keeps the shell up. - Auto-detect opencode default model instead of requiring
opencc.provider_id/opencc.model_id(kept from previous plan). - Session auto-recovery on poll timeout. Optional
opencc.reset_session_on_timeoutsetting that callsclearSession()after a timeout so the next run starts on a fresh session. Likely unnecessary oncefindAssistantMessageis fixed, but worth revisiting if the second-ping hang persists.