50 lines
5.2 KiB
Markdown
50 lines
5.2 KiB
Markdown
# ADR 0015: Unified Boot Eventloop and Service-Name Bus
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Date
|
|
|
|
2026-06-09
|
|
|
|
## Context
|
|
|
|
Before this change, the TrapOS networking and boot model had accumulated several issues that were hard to fix incrementally:
|
|
|
|
- **Label collision was a silent footgun.** Two machines sharing the same label both accepted and rebroadcast packets addressed to that label, since `programs/router.lua:40` and `apis/net.lua:35` both matched `os.getComputerLabel()` independently. The sender received duplicate responses and the wire carried duplicate retransmits.
|
|
- **`_G.isRouterEnabled` mutated send behavior across the codebase.** [`apis/net.lua`](../../apis/net.lua) `sendRaw` switched its transmit path based on a global flag set by the router program. This made the same function call mean different things depending on which machine ran it.
|
|
- **Every autostart server ran its own private eventloop.** Each server file called `net.start()` which delegated to `el.startLoop()`. With N autostart entries, `parallel.waitForAll` ran N coroutines, each pumping an independent `os.pullEventRaw`. Wasteful and conceptually awkward: events were broadcast to every coroutine but only one would have a relevant handler.
|
|
- **The router lived outside the eventloop entirely.** [`programs/router.lua`](../../programs/router.lua) was a hand-rolled `while true / os.pullEvent` loop, structurally different from every other long-lived process in the repo.
|
|
- **Channel numbers leaked into every client.** `servers/ping-server.lua` and `programs/ping.lua` both duplicated a `PING_CHANNEL = 9` constant; there was no service registry. Adding a new service meant picking a free integer and replicating it on both ends.
|
|
|
|
Net's blast radius was small at this point — only `programs/ping.lua` and `servers/ping-server.lua` consumed it — so a clean break was cheaper than incremental patching.
|
|
|
|
## Decision
|
|
|
|
Adopt three coordinated changes:
|
|
|
|
1. **One boot eventloop per machine.** `startup/servers.lua` creates a single `createEventLoop()` instance, stores it at `_G.bootEventLoop`, runs autostart server files (which register handlers and return without blocking), then runs `parallel.waitForAny(shellFn, eventLoopFn)`. The shell and the eventloop are the only two coroutines.
|
|
|
|
2. **Service-name addressing on a single bus channel.** [`apis/net.lua`](../../apis/net.lua) exposes `net.serve(name, handler)`, `net.call(name, payload, opts)`, `net.send(name, payload, opts)`, and `net.listen(name, handler)`. All traffic flows on channel `10` and is demultiplexed inside the packet body via a `service` field. Channel numbers stop being a public concept. `require('/apis/net')()` returns a singleton bound to `_G.bootEventLoop` when present, otherwise an ephemeral instance.
|
|
|
|
3. **Router as a service on the boot eventloop.** [`programs/router.lua`](../../programs/router.lua) registers handlers on the same boot eventloop everything else uses. It owns a TTL-based label map (extracted into [`apis/librouter.lua`](../../apis/librouter.lua) for testability). Machines with a label autostart [`servers/net-registrar.lua`](../../servers/net-registrar.lua), which periodically broadcasts `(id, label)` so the router can resolve label-addressed packets. Duplicate label registrations are rejected with a printed warning. `_G.isRouterEnabled` is gone; the router service flips a local flag via `net.setRouter(true)` instead.
|
|
|
|
CLI programs stay standalone: `net.call` internally uses `os.pullEvent` with a timer, so programs do not need the boot eventloop to receive a response.
|
|
|
|
## Consequences
|
|
|
|
- Adding a new networked service is now: write a `servers/foo.lua` that calls `net.serve('foo', handler)` and returns, then add it to a package's `autostart`. No channel allocation, no `.start()` blocking call.
|
|
- The router program returns immediately instead of blocking the shell. Users type `router` once on the chosen machine and continue using the shell.
|
|
- Label collisions are detected and rejected at registration time, with a clear warning, instead of causing silent duplicate delivery.
|
|
- The ping API surface changed (`net.sendRequest` → `net.call`, `net.listenRequest` → `net.serve`). Out-of-tree consumers — if any existed — would need to migrate. Inside the repo only ping needed migration.
|
|
- Programs that need to wait for events still work by direct `os.pullEvent`, but if a program registers a long-lived handler on `_G.bootEventLoop` and exits, the handler keeps firing with a stale closure. Programs should prefer `call`/`send` over `serve`/`listen`. This is documented in [`apis/net.lua`](../../apis/net.lua) but not enforced.
|
|
- Tests for the router state machine live in [`tests/router.lua`](../../tests/router.lua) and exercise [`apis/librouter.lua`](../../apis/librouter.lua) with an injected clock. Tests for the net packet shape and dispatch live in [`tests/net.lua`](../../tests/net.lua) with a fake modem.
|
|
|
|
## Out of Scope
|
|
|
|
- Multi-router topologies. The single-router assumption stays; a network is expected to run `router` on exactly one machine.
|
|
- Retry and acknowledgement primitives beyond the existing per-call `timeout`.
|
|
- Unifying `libtui`, `libai`, and `tuidemo` eventloops. They remain private; they are presentation/AI concerns, not network plumbing.
|
|
- The `ccpm` package manager. It is recent, tested, and not in pain.
|