78 lines
8.2 KiB
Markdown
78 lines
8.2 KiB
Markdown
# ADR 0002: Eventloop Substrate, Service Bus, and Async Discipline
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Date
|
|
|
|
2026-06-07
|
|
|
|
## Context
|
|
|
|
ComputerCraft is event-driven. Direct `os.pullEvent` loops are easy to write but hard to compose when multiple things need to happen at the same time. Without a single substrate the repo accumulated several distinct problems:
|
|
|
|
- Each long-lived process owned a private event loop, including the router (`programs/router.lua` was a hand-rolled `while true / os.pullEvent`). With N autostart servers, `parallel.waitForAll` ran N coroutines each pumping an independent `os.pullEventRaw`. Events were broadcast to every coroutine but only one would have a relevant handler — wasteful and conceptually awkward.
|
|
- `_G.isRouterEnabled` mutated send behavior across the codebase. [`apis/net.lua`](../../apis/net.lua) `sendRaw` switched its transmit path based on a global flag set by the router program, so the same function call meant different things depending on which machine ran it.
|
|
- Channel numbers leaked into every client. `servers/ping-server.lua` and `programs/ping.lua` both duplicated a `PING_CHANNEL` constant; there was no service registry. Adding a new service meant picking a free integer and replicating it on both ends.
|
|
- Label collision was a silent footgun. Two machines sharing the same label both accepted and rebroadcast packets addressed to that label, producing duplicate responses and duplicate retransmits.
|
|
- `os.sleep` looked innocent but broke the substrate. Its CC:Tweaked implementation yields via `os.pullEvent("timer")`. While the sleep is in flight, the enclosing eventloop's `os.pullEventRaw` is paused; non-`timer` events are silently discarded; even `eventloop.setTimeout` callbacks scheduled before the sleep cannot fire until it returns. This bit `apis/libai.lua` `pollMessage`, which used a sleep-based throttle and froze the whole loop the moment a caller invoked it from inside a handler.
|
|
|
|
Net's blast radius at the time of the bus rewrite was small (only ping consumed it), so a clean break was cheaper than incremental patching.
|
|
|
|
## Decision
|
|
|
|
### 1. Eventloop is the async substrate
|
|
|
|
New async code uses [`apis/eventloop.lua`](../../apis/eventloop.lua). Event handlers, timers, server listeners, and UI behavior compose through the eventloop instead of each feature owning its own blocking loop.
|
|
|
|
- Prefer `eventloop.register`, `setTimeout`, `onStart`, `onStop`, and `startLoop` for async behavior.
|
|
- APIs that listen for events accept an existing event loop as a constructor argument, the way [`apis/net.lua`](../../apis/net.lua) does. Do not create a private loop inside a module.
|
|
- Direct `os.pullEvent` loops should be rare and justified (CLI programs waiting for a single reply are the main exception).
|
|
- A handler that returns `api.STOP` auto-unregisters.
|
|
|
|
### 2. One boot eventloop and a service-name bus
|
|
|
|
`startup/servers.lua` creates a single `createEventLoop()` instance, stores it at `_G.bootEventLoop`, runs autostart server files (which register handlers and return without blocking), then runs `parallel.waitForAny(shellFn, eventLoopFn)`. The shell and the eventloop are the only two coroutines.
|
|
|
|
[`apis/net.lua`](../../apis/net.lua) exposes a service-name bus on a single channel:
|
|
|
|
- `net.serve(name, handler)` — register a server handler (server-side).
|
|
- `net.call(name, payload, opts)` — request/response with timeout (client-side).
|
|
- `net.send(name, payload, opts)` — fire-and-forget (client-side).
|
|
- `net.listen(name, handler)` — passive listener.
|
|
|
|
All traffic flows on channel `10` and is demultiplexed inside the packet body via a `service` field. Channel numbers stop being a public concept. `require('/apis/net')()` returns a singleton bound to `_G.bootEventLoop` when present, otherwise an ephemeral instance. CLI programs stay standalone: `net.call` internally uses `os.pullEvent` with a timer, so programs do not need the boot eventloop to receive a response.
|
|
|
|
[`programs/router.lua`](../../programs/router.lua) registers handlers on the same boot eventloop everything else uses. It owns a TTL-based label map extracted into [`apis/librouter.lua`](../../apis/librouter.lua) for testability. Machines with a label autostart [`servers/net-registrar.lua`](../../servers/net-registrar.lua), which periodically broadcasts `(id, label)` so the router can resolve label-addressed packets. Duplicate label registrations are rejected with a printed warning. `_G.isRouterEnabled` is gone; the router service flips a local flag via `net.setRouter(true)` instead.
|
|
|
|
### 3. `os.sleep` discipline
|
|
|
|
In library, server, and program code that may run inside an eventloop (directly or transitively), use `eventloop.setTimeout` for any waiting, throttling, polling, or retry-with-delay. Libraries that need to temporize must take an eventloop factory through their constructor rather than baking a hardcoded sleep call. [`apis/net.lua`](../../apis/net.lua) `sendRequest` is the canonical private-eventloop pattern: create a private eventloop, schedule the wait through `setTimeout`, then `runLoop` until the work resolves — synchronous from the caller's perspective, but the dispatcher stays alive internally so handlers can compose around it via `parallel.waitForAll`.
|
|
|
|
`os.sleep` remains acceptable only in narrow cases:
|
|
|
|
1. One-shot programs that are purely sequential and register no event handlers — a `programs/foo.lua` that prints, sleeps, prints again, and exits.
|
|
2. `parallel.waitForAny(task, function() sleep(t); end)` used as an isolated guard to bound an inner task (e.g. the AI Lua-exec sandbox in `apis/libai.lua` and the `parallel.waitForAny`-driven per-case timer in `apis/libtest.lua`). The guard sleep is private to its own coroutine group; it does not block anything external.
|
|
3. Tests that are themselves driven by `libtest`'s per-case timeout (see [ADR-0007](adr-0007-test-framework.md)).
|
|
|
|
New code must not expose a `sleep` injection point on its constructor. If a wait is needed, accept an `eventloop` factory and schedule through `setTimeout`. Tests substitute a synchronous deterministic eventloop fake the same way they substitute `http` or `settings`.
|
|
|
|
## Consequences
|
|
|
|
- Adding a new networked service is now: write a `servers/foo.lua` that calls `net.serve('foo', handler)` and returns, then add it to a package's `autostart`. No channel allocation, no `.start()` blocking call.
|
|
- The router program returns immediately instead of blocking the shell. Users type `router` once on the chosen machine and continue using the shell.
|
|
- Label collisions are detected and rejected at registration time, with a clear warning, instead of causing silent duplicate delivery.
|
|
- A router must still be running somewhere on the network for cross-machine label-addressed packets; without one, non-router senders produce packets with `routerId = nil` and consumers drop them on receive.
|
|
- Programs that need to wait for events still work by direct `os.pullEvent`, but if a program registers a long-lived handler on `_G.bootEventLoop` and exits, the handler keeps firing with a stale closure. Programs should prefer `call`/`send` over `serve`/`listen`. This is documented in [`apis/net.lua`](../../apis/net.lua) but not enforced.
|
|
- Tests for the router state machine live in [`tests/router.lua`](../../tests/router.lua) and exercise [`apis/librouter.lua`](../../apis/librouter.lua) with an injected clock. Tests for the net packet shape and dispatch live in [`tests/net.lua`](../../tests/net.lua) with a fake modem.
|
|
- Slightly more ceremony in "synchronous-looking" library functions that wait: a private eventloop plus a small `attempt`/`finish` pair. The benefit is clean composition with any caller's eventloop.
|
|
- Test fakes shift from a `sleep` stub to a synchronous eventloop double. Ergonomics are comparable; the eventloop fake additionally lets tests observe `pending` and `stopped` state, catching leaks the sleep stub would have missed.
|
|
- Existing call sites are migrated opportunistically when they cause observable bugs. The first `os.sleep` migration is `apis/libai.lua`.
|
|
|
|
## Out of Scope
|
|
|
|
- Multi-router topologies. The single-router assumption stays; a network is expected to run `router` on exactly one machine.
|
|
- Retry and acknowledgement primitives beyond the existing per-call `timeout`.
|
|
- Unifying `libtui`, `libai`, and `tuidemo` eventloops. They remain private; they are presentation/AI concerns, not network plumbing.
|