cc-libs/docs/adrs/adr-0015-unified-boot-eventloop-and-service-bus.md

5.2 KiB

ADR 0015: Unified Boot Eventloop and Service-Name Bus

Status

Accepted

Date

2026-06-09

Context

Before this change, the TrapOS networking and boot model had accumulated several issues that were hard to fix incrementally:

  • Label collision was a silent footgun. Two machines sharing the same label both accepted and rebroadcast packets addressed to that label, since programs/router.lua:40 and apis/net.lua:35 both matched os.getComputerLabel() independently. The sender received duplicate responses and the wire carried duplicate retransmits.
  • _G.isRouterEnabled mutated send behavior across the codebase. apis/net.lua sendRaw switched its transmit path based on a global flag set by the router program. This made the same function call mean different things depending on which machine ran it.
  • Every autostart server ran its own private eventloop. Each server file called net.start() which delegated to el.startLoop(). With N autostart entries, parallel.waitForAll ran N coroutines, each pumping an independent os.pullEventRaw. Wasteful and conceptually awkward: events were broadcast to every coroutine but only one would have a relevant handler.
  • The router lived outside the eventloop entirely. programs/router.lua was a hand-rolled while true / os.pullEvent loop, structurally different from every other long-lived process in the repo.
  • Channel numbers leaked into every client. servers/ping-server.lua and programs/ping.lua both duplicated a PING_CHANNEL = 9 constant; there was no service registry. Adding a new service meant picking a free integer and replicating it on both ends.

Net's blast radius was small at this point — only programs/ping.lua and servers/ping-server.lua consumed it — so a clean break was cheaper than incremental patching.

Decision

Adopt three coordinated changes:

  1. One boot eventloop per machine. startup/servers.lua creates a single createEventLoop() instance, stores it at _G.bootEventLoop, runs autostart server files (which register handlers and return without blocking), then runs parallel.waitForAny(shellFn, eventLoopFn). The shell and the eventloop are the only two coroutines.

  2. Service-name addressing on a single bus channel. apis/net.lua exposes net.serve(name, handler), net.call(name, payload, opts), net.send(name, payload, opts), and net.listen(name, handler). All traffic flows on channel 10 and is demultiplexed inside the packet body via a service field. Channel numbers stop being a public concept. require('/apis/net')() returns a singleton bound to _G.bootEventLoop when present, otherwise an ephemeral instance.

  3. Router as a service on the boot eventloop. programs/router.lua registers handlers on the same boot eventloop everything else uses. It owns a TTL-based label map (extracted into apis/librouter.lua for testability). Machines with a label autostart servers/net-registrar.lua, which periodically broadcasts (id, label) so the router can resolve label-addressed packets. Duplicate label registrations are rejected with a printed warning. _G.isRouterEnabled is gone; the router service flips a local flag via net.setRouter(true) instead.

CLI programs stay standalone: net.call internally uses os.pullEvent with a timer, so programs do not need the boot eventloop to receive a response.

Consequences

  • Adding a new networked service is now: write a servers/foo.lua that calls net.serve('foo', handler) and returns, then add it to a package's autostart. No channel allocation, no .start() blocking call.
  • The router program returns immediately instead of blocking the shell. Users type router once on the chosen machine and continue using the shell.
  • Label collisions are detected and rejected at registration time, with a clear warning, instead of causing silent duplicate delivery.
  • The ping API surface changed (net.sendRequestnet.call, net.listenRequestnet.serve). Out-of-tree consumers — if any existed — would need to migrate. Inside the repo only ping needed migration.
  • Programs that need to wait for events still work by direct os.pullEvent, but if a program registers a long-lived handler on _G.bootEventLoop and exits, the handler keeps firing with a stale closure. Programs should prefer call/send over serve/listen. This is documented in apis/net.lua but not enforced.
  • Tests for the router state machine live in tests/router.lua and exercise apis/librouter.lua with an injected clock. Tests for the net packet shape and dispatch live in tests/net.lua with a fake modem.

Out of Scope

  • Multi-router topologies. The single-router assumption stays; a network is expected to run router on exactly one machine.
  • Retry and acknowledgement primitives beyond the existing per-call timeout.
  • Unifying libtui, libai, and tuidemo eventloops. They remain private; they are presentation/AI concerns, not network plumbing.
  • The ccpm package manager. It is recent, tested, and not in pain.