cc-libs/NEXT_PLAN.md

# Diagnose & fix opencode async polling (round 2)

## Context

The first round added the `model` field to `/prompt_async` (commit `1c07435`). On the dev server (`http://127.0.0.1:4242`, password `jenesuispasunhacker`) the user now sees two symptoms:

1. **First `ai ping` on a fresh session "crashes" the computer.** The session is created, the user message is posted, opencode visibly generates a pong reply — but `ai` returns an error path and `startup/servers.lua` runs `parallel.waitForAny(shell, eventloop)` followed by `os.shutdown()`. Any path where either coroutine returns shuts the machine down. So the visible "crash" is really: *some* exit in our coroutine + harness-style shutdown logic in production.
2. **Subsequent `ai` calls (including `ai sessions`) block.** The user reports the second `ai ping` is "blocked indefinitely" and even `ai sessions` doesn't return — which is odd because `ai.listSessions` does not poll, so this might either be a misreport or evidence of a wider hang.

The most load-bearing hypothesis is that `findAssistantMessage` in `apis/libai.lua` (introduced by `c61254d`) never matches: it assumes our submitted `messageID` is the **user** message id and looks for the *next* assistant message after it. The opencode docs we wrote in the same commit say the submitted id is the **assistant** message id (`docs/opencode_api.md` line 105: "Get a message by ID. Opencode validates caller-provided message IDs; use IDs starting with `msg`."). If the docs are right, our user message has a server-generated id, our submitted id appears on the assistant message itself, `seenSubmitted` stays `false`, and we poll until `opencc.poll_timeout_seconds` expires.

This explains both observations:
- "First ping crashes" → poll timeout returns `false, 'delai depasse en attendant la reponse AI'`. `programs/ai.lua` prints the error and returns. Then the shell prompt gets shown again — *unless* something else in the program causes a Lua error in the call chain that makes either coroutine exit. We need a probe to confirm.
- "Second ping blocks" → opencode keeps the previous assistant generation in-flight (we never acknowledged it) and queues new prompts behind it, OR the next poll loop is the same one, just on a session where opencode now refuses to generate a new turn.

The user also asked for two ergonomics:
- A `--verbose` flag on `programs/ai.lua` so headless probes can see polling progress.
- A way to disable `os.shutdown()` outside the harness (out of scope for the bug fix, captured as a follow-up).

## Plan

### Phase A — confirm the hypothesis with probes (no code changes)

Run these against `http://127.0.0.1:4242` with Basic auth `opencode:jenesuispasunhacker`.

1. Create a session and capture its id:
   ```
   curl -s -u opencode:jenesuispasunhacker -X POST \
     -H 'Content-Type: application/json' \
     -d '{"title":"probe"}' \
     http://127.0.0.1:4242/session
   ```
2. POST a known messageID via prompt_async:
   ```
   curl -s -u opencode:jenesuispasunhacker -X POST \
     -H 'Content-Type: application/json' \
     -d '{"messageID":"msg_probe_1","parts":[{"type":"text","text":"reply with exactly: pong"}],"model":{"providerID":"anthropic","modelID":"claude-opus-4-7"}}' \
     http://127.0.0.1:4242/session/<SID>/prompt_async
   ```
3. Poll the list and inspect ids/roles:
   ```
   curl -s -u opencode:jenesuispasunhacker \
     http://127.0.0.1:4242/session/<SID>/message | jq '.[] | {id: .info.id, role: .info.role, finish: .info.finish, completed: .info.time.completed}'
   ```

**Expected outcomes**:
- If `msg_probe_1` appears with `role == "assistant"` → the doc-stated semantics are correct, `findAssistantMessage` is wrong, and we should poll for the message *whose own id matches* our submitted id.
- If `msg_probe_1` appears with `role == "user"` → c61254d's reading is correct, the bug is elsewhere (model dispatch, message decoding, etc.).
- If `msg_probe_1` doesn't appear at all → opencode is silently dropping it; investigate model or auth.

Record which case is real. The fix branches on this.

### Phase B — code changes (drive both cases)

Regardless of which probe outcome wins, `findAssistantMessage` is brittle (only handles one of two id-placement conventions and fails silently). Replace it with a more defensive lookup that handles both, in `apis/libai.lua` around lines 268–280:

```lua
local function findAssistantMessage(messages, submittedMessageId)
  -- Case 1 (docs): our id is the assistant message id.
  for _, m in ipairs(messages) do
    if type(m) == 'table' and type(m.info) == 'table'
        and m.info.id == submittedMessageId and m.info.role == 'assistant' then
      return m;
    end
  end
  -- Case 2 (c61254d empirical): our id is the user message id; assistant follows.
  local seen = false;
  for _, m in ipairs(messages) do
    if type(m) == 'table' and type(m.info) == 'table' then
      if m.info.id == submittedMessageId then
        seen = true;
      elseif seen and m.info.role == 'assistant' then
        return m;
      end
    end
  end
  return nil;
end
```

This keeps the existing `pollMessage` flow and `isMessageComplete` check unchanged; it just stops missing the message when opencode's id-placement matches the docs.

### Phase C — `--verbose` for `programs/ai.lua`

Add a `--verbose` (and `-v`) flag, parsed before `command` is taken. When set:
- Pass a `log` callback into `ai.ask` (libai already accepts `log` in `luaExec`; extend `api.ask` to call `options.log(message)` from `pollMessage` at each poll iteration with: `'poll attempt #N: msgs=K, found=' .. (decoded and decoded.info.id or 'nil') .. ', complete=' .. tostring(isComplete)`).
- `programs/ai.lua`'s `askAndPrint` / `printSessions` / ping handler all print these via a `[ai]` prefix when `--verbose` is on.

This makes the next round of headless probing useful without re-instrumenting the code.

### Phase D — harness verification

With the dev server already running, drive the probes via the harness so the diagnostic loop is reproducible:

```
just trapos-exec '
  settings.set("opencc.server_url","http://127.0.0.1:4242");
  settings.set("opencc.password","jenesuispasunhacker");
  settings.set("opencc.provider_id","anthropic");
  settings.set("opencc.model_id","claude-opus-4-7");
  settings.unset("opencc.session_id");
  shell.run("/programs/ai.lua","--verbose","ping");
'
```

Expected: `pong` printed; no timeout. Then re-run without `--verbose` to confirm the regression no longer reproduces.

Also run `just check` and `just test`. Update the existing `tests/ai.lua` `'ask polls async message until completion'` test if the message-list shape changes to include the assistant-id-match case. Add a new test: `'ask finds reply when submitted id matches assistant message itself'` — message list contains a user message with a server id and an assistant message with `info.id == submitted` and `time.completed`.

### Files

- `apis/libai.lua` — rewrite `findAssistantMessage`; thread an optional `log` callback through `api.ask` → `pollMessage`.
- `programs/ai.lua` — add `--verbose`/`-v` flag, prefixing log lines with `[ai]`.
- `tests/ai.lua` — new test for assistant-id-match; adjust any test that asserts the exact poll output if needed.
- Version bumps per [ADR-0011](docs/adrs/adr-0011-repo-conventions.md): `packages/trapos-ai/ccpm.json` patch bump, `packages/trapos/ccpm.json` patch bump, mirror in `packages/index.json`, `manifest.json`.

## Verification

1. Probe outputs from Phase A captured (and pasted into the PR or commit message).
2. `just trapos-exec` invocation from Phase D returns `pong` on a fresh session id.
3. Repeat the same invocation — second ping also returns `pong` (i.e., second-call hang is gone).
4. `ai sessions` (added to the same probe script) returns the session list within a couple of seconds.
5. `just check` clean; `just test` passes; new test green.

## Out of scope (captured as follow-ups)

- **Harness vs production shutdown.** `startup/servers.lua` ends with `os.shutdown()` so the headless harness exits cleanly. In production this is undesirable because *any* shell exit shuts the computer down. Follow-up: gate the shutdown on a setting (`trapos.harness_mode`) or an environment signal so production keeps the shell up.
- **Auto-detect opencode default model** instead of requiring `opencc.provider_id` / `opencc.model_id` (kept from previous plan).
- **Session auto-recovery on poll timeout.** Optional `opencc.reset_session_on_timeout` setting that calls `clearSession()` after a timeout so the next run starts on a fresh session. Likely unnecessary once `findAssistantMessage` is fixed, but worth revisiting if the second-ping hang persists.