130 lines
8.5 KiB
Markdown
130 lines
8.5 KiB
Markdown
# Diagnose & fix opencode async polling (round 2)
|
||
|
||
## Context
|
||
|
||
The first round added the `model` field to `/prompt_async` (commit `1c07435`). On the dev server (`http://127.0.0.1:4242`, password `jenesuispasunhacker`) the user now sees two symptoms:
|
||
|
||
1. **First `ai ping` on a fresh session "crashes" the computer.** The session is created, the user message is posted, opencode visibly generates a pong reply — but `ai` returns an error path and `startup/servers.lua` runs `parallel.waitForAny(shell, eventloop)` followed by `os.shutdown()`. Any path where either coroutine returns shuts the machine down. So the visible "crash" is really: *some* exit in our coroutine + harness-style shutdown logic in production.
|
||
2. **Subsequent `ai` calls (including `ai sessions`) block.** The user reports the second `ai ping` is "blocked indefinitely" and even `ai sessions` doesn't return — which is odd because `ai.listSessions` does not poll, so this might either be a misreport or evidence of a wider hang.
|
||
|
||
The most load-bearing hypothesis is that `findAssistantMessage` in `apis/libai.lua` (introduced by `c61254d`) never matches: it assumes our submitted `messageID` is the **user** message id and looks for the *next* assistant message after it. The opencode docs we wrote in the same commit say the submitted id is the **assistant** message id (`docs/opencode_api.md` line 105: "Get a message by ID. Opencode validates caller-provided message IDs; use IDs starting with `msg`."). If the docs are right, our user message has a server-generated id, our submitted id appears on the assistant message itself, `seenSubmitted` stays `false`, and we poll until `opencc.poll_timeout_seconds` expires.
|
||
|
||
This explains both observations:
|
||
- "First ping crashes" → poll timeout returns `false, 'delai depasse en attendant la reponse AI'`. `programs/ai.lua` prints the error and returns. Then the shell prompt gets shown again — *unless* something else in the program causes a Lua error in the call chain that makes either coroutine exit. We need a probe to confirm.
|
||
- "Second ping blocks" → opencode keeps the previous assistant generation in-flight (we never acknowledged it) and queues new prompts behind it, OR the next poll loop is the same one, just on a session where opencode now refuses to generate a new turn.
|
||
|
||
The user also asked for two ergonomics:
|
||
- A `--verbose` flag on `programs/ai.lua` so headless probes can see polling progress.
|
||
- A way to disable `os.shutdown()` outside the harness (out of scope for the bug fix, captured as a follow-up).
|
||
|
||
## Plan
|
||
|
||
### Phase A — confirm the hypothesis with probes (no code changes)
|
||
|
||
Run these against `http://127.0.0.1:4242` with Basic auth `opencode:jenesuispasunhacker`.
|
||
|
||
1. Create a session and capture its id:
|
||
```
|
||
curl -s -u opencode:jenesuispasunhacker -X POST \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"title":"probe"}' \
|
||
http://127.0.0.1:4242/session
|
||
```
|
||
2. POST a known messageID via prompt_async:
|
||
```
|
||
curl -s -u opencode:jenesuispasunhacker -X POST \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"messageID":"msg_probe_1","parts":[{"type":"text","text":"reply with exactly: pong"}],"model":{"providerID":"anthropic","modelID":"claude-opus-4-7"}}' \
|
||
http://127.0.0.1:4242/session/<SID>/prompt_async
|
||
```
|
||
3. Poll the list and inspect ids/roles:
|
||
```
|
||
curl -s -u opencode:jenesuispasunhacker \
|
||
http://127.0.0.1:4242/session/<SID>/message | jq '.[] | {id: .info.id, role: .info.role, finish: .info.finish, completed: .info.time.completed}'
|
||
```
|
||
|
||
**Expected outcomes**:
|
||
- If `msg_probe_1` appears with `role == "assistant"` → the doc-stated semantics are correct, `findAssistantMessage` is wrong, and we should poll for the message *whose own id matches* our submitted id.
|
||
- If `msg_probe_1` appears with `role == "user"` → c61254d's reading is correct, the bug is elsewhere (model dispatch, message decoding, etc.).
|
||
- If `msg_probe_1` doesn't appear at all → opencode is silently dropping it; investigate model or auth.
|
||
|
||
Record which case is real. The fix branches on this.
|
||
|
||
### Phase B — code changes (drive both cases)
|
||
|
||
Regardless of which probe outcome wins, `findAssistantMessage` is brittle (only handles one of two id-placement conventions and fails silently). Replace it with a more defensive lookup that handles both, in `apis/libai.lua` around lines 268–280:
|
||
|
||
```lua
|
||
local function findAssistantMessage(messages, submittedMessageId)
|
||
-- Case 1 (docs): our id is the assistant message id.
|
||
for _, m in ipairs(messages) do
|
||
if type(m) == 'table' and type(m.info) == 'table'
|
||
and m.info.id == submittedMessageId and m.info.role == 'assistant' then
|
||
return m;
|
||
end
|
||
end
|
||
-- Case 2 (c61254d empirical): our id is the user message id; assistant follows.
|
||
local seen = false;
|
||
for _, m in ipairs(messages) do
|
||
if type(m) == 'table' and type(m.info) == 'table' then
|
||
if m.info.id == submittedMessageId then
|
||
seen = true;
|
||
elseif seen and m.info.role == 'assistant' then
|
||
return m;
|
||
end
|
||
end
|
||
end
|
||
return nil;
|
||
end
|
||
```
|
||
|
||
This keeps the existing `pollMessage` flow and `isMessageComplete` check unchanged; it just stops missing the message when opencode's id-placement matches the docs.
|
||
|
||
### Phase C — `--verbose` for `programs/ai.lua`
|
||
|
||
Add a `--verbose` (and `-v`) flag, parsed before `command` is taken. When set:
|
||
- Pass a `log` callback into `ai.ask` (libai already accepts `log` in `luaExec`; extend `api.ask` to call `options.log(message)` from `pollMessage` at each poll iteration with: `'poll attempt #N: msgs=K, found=' .. (decoded and decoded.info.id or 'nil') .. ', complete=' .. tostring(isComplete)`).
|
||
- `programs/ai.lua`'s `askAndPrint` / `printSessions` / ping handler all print these via a `[ai]` prefix when `--verbose` is on.
|
||
|
||
This makes the next round of headless probing useful without re-instrumenting the code.
|
||
|
||
### Phase D — harness verification
|
||
|
||
With the dev server already running, drive the probes via the harness so the diagnostic loop is reproducible:
|
||
|
||
```
|
||
just trapos-exec '
|
||
settings.set("opencc.server_url","http://127.0.0.1:4242");
|
||
settings.set("opencc.password","jenesuispasunhacker");
|
||
settings.set("opencc.provider_id","anthropic");
|
||
settings.set("opencc.model_id","claude-opus-4-7");
|
||
settings.unset("opencc.session_id");
|
||
shell.run("/programs/ai.lua","--verbose","ping");
|
||
'
|
||
```
|
||
|
||
Expected: `pong` printed; no timeout. Then re-run without `--verbose` to confirm the regression no longer reproduces.
|
||
|
||
Also run `just check` and `just test`. Update the existing `tests/ai.lua` `'ask polls async message until completion'` test if the message-list shape changes to include the assistant-id-match case. Add a new test: `'ask finds reply when submitted id matches assistant message itself'` — message list contains a user message with a server id and an assistant message with `info.id == submitted` and `time.completed`.
|
||
|
||
### Files
|
||
|
||
- `apis/libai.lua` — rewrite `findAssistantMessage`; thread an optional `log` callback through `api.ask` → `pollMessage`.
|
||
- `programs/ai.lua` — add `--verbose`/`-v` flag, prefixing log lines with `[ai]`.
|
||
- `tests/ai.lua` — new test for assistant-id-match; adjust any test that asserts the exact poll output if needed.
|
||
- Version bumps per [ADR-0011](docs/adrs/adr-0011-repo-conventions.md): `packages/trapos-ai/ccpm.json` patch bump, `packages/trapos/ccpm.json` patch bump, mirror in `packages/index.json`, `manifest.json`.
|
||
|
||
## Verification
|
||
|
||
1. Probe outputs from Phase A captured (and pasted into the PR or commit message).
|
||
2. `just trapos-exec` invocation from Phase D returns `pong` on a fresh session id.
|
||
3. Repeat the same invocation — second ping also returns `pong` (i.e., second-call hang is gone).
|
||
4. `ai sessions` (added to the same probe script) returns the session list within a couple of seconds.
|
||
5. `just check` clean; `just test` passes; new test green.
|
||
|
||
## Out of scope (captured as follow-ups)
|
||
|
||
- **Harness vs production shutdown.** `startup/servers.lua` ends with `os.shutdown()` so the headless harness exits cleanly. In production this is undesirable because *any* shell exit shuts the computer down. Follow-up: gate the shutdown on a setting (`trapos.harness_mode`) or an environment signal so production keeps the shell up.
|
||
- **Auto-detect opencode default model** instead of requiring `opencc.provider_id` / `opencc.model_id` (kept from previous plan).
|
||
- **Session auto-recovery on poll timeout.** Optional `opencc.reset_session_on_timeout` setting that calls `clearSession()` after a timeout so the next run starts on a fresh session. Likely unnecessary once `findAssistantMessage` is fixed, but worth revisiting if the second-ping hang persists.
|