cc-libs/NEXT_PLAN.md

8.5 KiB
Raw Blame History

Diagnose & fix opencode async polling (round 2)

Context

The first round added the model field to /prompt_async (commit 1c07435). On the dev server (http://127.0.0.1:4242, password jenesuispasunhacker) the user now sees two symptoms:

  1. First ai ping on a fresh session "crashes" the computer. The session is created, the user message is posted, opencode visibly generates a pong reply — but ai returns an error path and startup/servers.lua runs parallel.waitForAny(shell, eventloop) followed by os.shutdown(). Any path where either coroutine returns shuts the machine down. So the visible "crash" is really: some exit in our coroutine + harness-style shutdown logic in production.
  2. Subsequent ai calls (including ai sessions) block. The user reports the second ai ping is "blocked indefinitely" and even ai sessions doesn't return — which is odd because ai.listSessions does not poll, so this might either be a misreport or evidence of a wider hang.

The most load-bearing hypothesis is that findAssistantMessage in apis/libai.lua (introduced by c61254d) never matches: it assumes our submitted messageID is the user message id and looks for the next assistant message after it. The opencode docs we wrote in the same commit say the submitted id is the assistant message id (docs/opencode_api.md line 105: "Get a message by ID. Opencode validates caller-provided message IDs; use IDs starting with msg."). If the docs are right, our user message has a server-generated id, our submitted id appears on the assistant message itself, seenSubmitted stays false, and we poll until opencc.poll_timeout_seconds expires.

This explains both observations:

  • "First ping crashes" → poll timeout returns false, 'delai depasse en attendant la reponse AI'. programs/ai.lua prints the error and returns. Then the shell prompt gets shown again — unless something else in the program causes a Lua error in the call chain that makes either coroutine exit. We need a probe to confirm.
  • "Second ping blocks" → opencode keeps the previous assistant generation in-flight (we never acknowledged it) and queues new prompts behind it, OR the next poll loop is the same one, just on a session where opencode now refuses to generate a new turn.

The user also asked for two ergonomics:

  • A --verbose flag on programs/ai.lua so headless probes can see polling progress.
  • A way to disable os.shutdown() outside the harness (out of scope for the bug fix, captured as a follow-up).

Plan

Phase A — confirm the hypothesis with probes (no code changes)

Run these against http://127.0.0.1:4242 with Basic auth opencode:jenesuispasunhacker.

  1. Create a session and capture its id:
    curl -s -u opencode:jenesuispasunhacker -X POST \
      -H 'Content-Type: application/json' \
      -d '{"title":"probe"}' \
      http://127.0.0.1:4242/session
    
  2. POST a known messageID via prompt_async:
    curl -s -u opencode:jenesuispasunhacker -X POST \
      -H 'Content-Type: application/json' \
      -d '{"messageID":"msg_probe_1","parts":[{"type":"text","text":"reply with exactly: pong"}],"model":{"providerID":"anthropic","modelID":"claude-opus-4-7"}}' \
      http://127.0.0.1:4242/session/<SID>/prompt_async
    
  3. Poll the list and inspect ids/roles:
    curl -s -u opencode:jenesuispasunhacker \
      http://127.0.0.1:4242/session/<SID>/message | jq '.[] | {id: .info.id, role: .info.role, finish: .info.finish, completed: .info.time.completed}'
    

Expected outcomes:

  • If msg_probe_1 appears with role == "assistant" → the doc-stated semantics are correct, findAssistantMessage is wrong, and we should poll for the message whose own id matches our submitted id.
  • If msg_probe_1 appears with role == "user" → c61254d's reading is correct, the bug is elsewhere (model dispatch, message decoding, etc.).
  • If msg_probe_1 doesn't appear at all → opencode is silently dropping it; investigate model or auth.

Record which case is real. The fix branches on this.

Phase B — code changes (drive both cases)

Regardless of which probe outcome wins, findAssistantMessage is brittle (only handles one of two id-placement conventions and fails silently). Replace it with a more defensive lookup that handles both, in apis/libai.lua around lines 268280:

local function findAssistantMessage(messages, submittedMessageId)
  -- Case 1 (docs): our id is the assistant message id.
  for _, m in ipairs(messages) do
    if type(m) == 'table' and type(m.info) == 'table'
        and m.info.id == submittedMessageId and m.info.role == 'assistant' then
      return m;
    end
  end
  -- Case 2 (c61254d empirical): our id is the user message id; assistant follows.
  local seen = false;
  for _, m in ipairs(messages) do
    if type(m) == 'table' and type(m.info) == 'table' then
      if m.info.id == submittedMessageId then
        seen = true;
      elseif seen and m.info.role == 'assistant' then
        return m;
      end
    end
  end
  return nil;
end

This keeps the existing pollMessage flow and isMessageComplete check unchanged; it just stops missing the message when opencode's id-placement matches the docs.

Phase C — --verbose for programs/ai.lua

Add a --verbose (and -v) flag, parsed before command is taken. When set:

  • Pass a log callback into ai.ask (libai already accepts log in luaExec; extend api.ask to call options.log(message) from pollMessage at each poll iteration with: 'poll attempt #N: msgs=K, found=' .. (decoded and decoded.info.id or 'nil') .. ', complete=' .. tostring(isComplete)).
  • programs/ai.lua's askAndPrint / printSessions / ping handler all print these via a [ai] prefix when --verbose is on.

This makes the next round of headless probing useful without re-instrumenting the code.

Phase D — harness verification

With the dev server already running, drive the probes via the harness so the diagnostic loop is reproducible:

just trapos-exec '
  settings.set("opencc.server_url","http://127.0.0.1:4242");
  settings.set("opencc.password","jenesuispasunhacker");
  settings.set("opencc.provider_id","anthropic");
  settings.set("opencc.model_id","claude-opus-4-7");
  settings.unset("opencc.session_id");
  shell.run("/programs/ai.lua","--verbose","ping");
'

Expected: pong printed; no timeout. Then re-run without --verbose to confirm the regression no longer reproduces.

Also run just check and just test. Update the existing tests/ai.lua 'ask polls async message until completion' test if the message-list shape changes to include the assistant-id-match case. Add a new test: 'ask finds reply when submitted id matches assistant message itself' — message list contains a user message with a server id and an assistant message with info.id == submitted and time.completed.

Files

  • apis/libai.lua — rewrite findAssistantMessage; thread an optional log callback through api.askpollMessage.
  • programs/ai.lua — add --verbose/-v flag, prefixing log lines with [ai].
  • tests/ai.lua — new test for assistant-id-match; adjust any test that asserts the exact poll output if needed.
  • Version bumps per ADR-0011: packages/trapos-ai/ccpm.json patch bump, packages/trapos/ccpm.json patch bump, mirror in packages/index.json, manifest.json.

Verification

  1. Probe outputs from Phase A captured (and pasted into the PR or commit message).
  2. just trapos-exec invocation from Phase D returns pong on a fresh session id.
  3. Repeat the same invocation — second ping also returns pong (i.e., second-call hang is gone).
  4. ai sessions (added to the same probe script) returns the session list within a couple of seconds.
  5. just check clean; just test passes; new test green.

Out of scope (captured as follow-ups)

  • Harness vs production shutdown. startup/servers.lua ends with os.shutdown() so the headless harness exits cleanly. In production this is undesirable because any shell exit shuts the computer down. Follow-up: gate the shutdown on a setting (trapos.harness_mode) or an environment signal so production keeps the shell up.
  • Auto-detect opencode default model instead of requiring opencc.provider_id / opencc.model_id (kept from previous plan).
  • Session auto-recovery on poll timeout. Optional opencc.reset_session_on_timeout setting that calls clearSession() after a timeout so the next run starts on a fresh session. Likely unnecessary once findAssistantMessage is fixed, but worth revisiting if the second-ping hang persists.