cc-libs/docs/adrs/adr-0002-eventloop-and-service-bus.md

8.2 KiB

ADR 0002: Eventloop Substrate, Service Bus, and Async Discipline

Status

Accepted

Date

2026-06-07

Context

ComputerCraft is event-driven. Direct os.pullEvent loops are easy to write but hard to compose when multiple things need to happen at the same time. Without a single substrate the repo accumulated several distinct problems:

  • Each long-lived process owned a private event loop, including the router (programs/router.lua was a hand-rolled while true / os.pullEvent). With N autostart servers, parallel.waitForAll ran N coroutines each pumping an independent os.pullEventRaw. Events were broadcast to every coroutine but only one would have a relevant handler — wasteful and conceptually awkward.
  • _G.isRouterEnabled mutated send behavior across the codebase. apis/net.lua sendRaw switched its transmit path based on a global flag set by the router program, so the same function call meant different things depending on which machine ran it.
  • Channel numbers leaked into every client. servers/ping-server.lua and programs/ping.lua both duplicated a PING_CHANNEL constant; there was no service registry. Adding a new service meant picking a free integer and replicating it on both ends.
  • Label collision was a silent footgun. Two machines sharing the same label both accepted and rebroadcast packets addressed to that label, producing duplicate responses and duplicate retransmits.
  • os.sleep looked innocent but broke the substrate. Its CC:Tweaked implementation yields via os.pullEvent("timer"). While the sleep is in flight, the enclosing eventloop's os.pullEventRaw is paused; non-timer events are silently discarded; even eventloop.setTimeout callbacks scheduled before the sleep cannot fire until it returns. This bit apis/libai.lua pollMessage, which used a sleep-based throttle and froze the whole loop the moment a caller invoked it from inside a handler.

Net's blast radius at the time of the bus rewrite was small (only ping consumed it), so a clean break was cheaper than incremental patching.

Decision

1. Eventloop is the async substrate

New async code uses apis/eventloop.lua. Event handlers, timers, server listeners, and UI behavior compose through the eventloop instead of each feature owning its own blocking loop.

  • Prefer eventloop.register, setTimeout, onStart, onStop, and startLoop for async behavior.
  • APIs that listen for events accept an existing event loop as a constructor argument, the way apis/net.lua does. Do not create a private loop inside a module.
  • Direct os.pullEvent loops should be rare and justified (CLI programs waiting for a single reply are the main exception).
  • A handler that returns api.STOP auto-unregisters.

2. One boot eventloop and a service-name bus

startup/servers.lua creates a single createEventLoop() instance, stores it at _G.bootEventLoop, runs autostart server files (which register handlers and return without blocking), then runs parallel.waitForAny(shellFn, eventLoopFn). The shell and the eventloop are the only two coroutines.

apis/net.lua exposes a service-name bus on a single channel:

  • net.serve(name, handler) — register a server handler (server-side).
  • net.call(name, payload, opts) — request/response with timeout (client-side).
  • net.send(name, payload, opts) — fire-and-forget (client-side).
  • net.listen(name, handler) — passive listener.

All traffic flows on channel 10 and is demultiplexed inside the packet body via a service field. Channel numbers stop being a public concept. require('/apis/net')() returns a singleton bound to _G.bootEventLoop when present, otherwise an ephemeral instance. CLI programs stay standalone: net.call internally uses os.pullEvent with a timer, so programs do not need the boot eventloop to receive a response.

programs/router.lua registers handlers on the same boot eventloop everything else uses. It owns a TTL-based label map extracted into apis/librouter.lua for testability. Machines with a label autostart servers/net-registrar.lua, which periodically broadcasts (id, label) so the router can resolve label-addressed packets. Duplicate label registrations are rejected with a printed warning. _G.isRouterEnabled is gone; the router service flips a local flag via net.setRouter(true) instead.

3. os.sleep discipline

In library, server, and program code that may run inside an eventloop (directly or transitively), use eventloop.setTimeout for any waiting, throttling, polling, or retry-with-delay. Libraries that need to temporize must take an eventloop factory through their constructor rather than baking a hardcoded sleep call. apis/net.lua sendRequest is the canonical private-eventloop pattern: create a private eventloop, schedule the wait through setTimeout, then runLoop until the work resolves — synchronous from the caller's perspective, but the dispatcher stays alive internally so handlers can compose around it via parallel.waitForAll.

os.sleep remains acceptable only in narrow cases:

  1. One-shot programs that are purely sequential and register no event handlers — a programs/foo.lua that prints, sleeps, prints again, and exits.
  2. parallel.waitForAny(task, function() sleep(t); end) used as an isolated guard to bound an inner task (e.g. the AI Lua-exec sandbox in apis/libai.lua and the parallel.waitForAny-driven per-case timer in apis/libtest.lua). The guard sleep is private to its own coroutine group; it does not block anything external.
  3. Tests that are themselves driven by libtest's per-case timeout (see ADR-0007).

New code must not expose a sleep injection point on its constructor. If a wait is needed, accept an eventloop factory and schedule through setTimeout. Tests substitute a synchronous deterministic eventloop fake the same way they substitute http or settings.

Consequences

  • Adding a new networked service is now: write a servers/foo.lua that calls net.serve('foo', handler) and returns, then add it to a package's autostart. No channel allocation, no .start() blocking call.
  • The router program returns immediately instead of blocking the shell. Users type router once on the chosen machine and continue using the shell.
  • Label collisions are detected and rejected at registration time, with a clear warning, instead of causing silent duplicate delivery.
  • A router must still be running somewhere on the network for cross-machine label-addressed packets; without one, non-router senders produce packets with routerId = nil and consumers drop them on receive.
  • Programs that need to wait for events still work by direct os.pullEvent, but if a program registers a long-lived handler on _G.bootEventLoop and exits, the handler keeps firing with a stale closure. Programs should prefer call/send over serve/listen. This is documented in apis/net.lua but not enforced.
  • Tests for the router state machine live in tests/router.lua and exercise apis/librouter.lua with an injected clock. Tests for the net packet shape and dispatch live in tests/net.lua with a fake modem.
  • Slightly more ceremony in "synchronous-looking" library functions that wait: a private eventloop plus a small attempt/finish pair. The benefit is clean composition with any caller's eventloop.
  • Test fakes shift from a sleep stub to a synchronous eventloop double. Ergonomics are comparable; the eventloop fake additionally lets tests observe pending and stopped state, catching leaks the sleep stub would have missed.
  • Existing call sites are migrated opportunistically when they cause observable bugs. The first os.sleep migration is apis/libai.lua.

Out of Scope

  • Multi-router topologies. The single-router assumption stays; a network is expected to run router on exactly one machine.
  • Retry and acknowledgement primitives beyond the existing per-call timeout.
  • Unifying libtui, libai, and tuidemo eventloops. They remain private; they are presentation/AI concerns, not network plumbing.