TALK
THU MAY 7 • 4:15 PM - 5:00 PM
NOVA (LEVEL 2)From Nothing to Everything: One Single Alert to Managing Datacenters with AI
Every platform starts somewhere. At Crusoe, it started with a single Temporal workflow called GpuFellOffTheBus — a scrappy alert handler that fired when a GPU dropped off the PCIe bus, opened a JIRA ticket, and posted to Slack. It worked. So we built another one. Then another.
Years later, Crusoe operates the Lifecycle Workflow: a persistent, never-terminated workflow that supervises every server in our fleet from the moment it’s racked to the moment a customer workload lands on it. Lifecycle orchestrates dozens of functional child workflows — hardware validation, firmware updates, networking configuration, health checks, and AI-assisted remediation decisions — all as a single durable state engine that outlives every other process in the system.
This talk is the story of that evolution: what we built first, what broke, what we learned, and how we arrived at a model where a single alert can now trigger an autonomous chain that manages a datacenter at scale. If you’re early in your Temporal journey or wondering how far the rabbit hole goes, this is the map.