Jeff Ferland
Jeff Ferland
TITLE:Staff Site Reliability Engineer
COMPANY:Crusoe
STATUS UPDATE
Jeff Ferland is a Staff Software Engineer at Crusoe. He has been managing petabytes and moving terrabits for over a decade in the SF Bay Area. He now leads infrastructure automation at Crusoe.
TALK
From Nothing to Everything: One Single Alert to Managing Datacenters with AI
ABSTRACT
Every platform starts somewhere. At Crusoe, it started with a single Temporal workflow called GpuFellOffTheBus — a scrappy alert handler that fired when a GPU dropped off the PCIe bus, opened a JIRA ticket, and posted to Slack. It worked. So we built another one. Then another.
Years later, Crusoe operates the Lifecycle Workflow: a persistent, never-terminated workflow that supervises every server in our fleet from the moment it’s racked to the moment a customer workload lands on it. Lifecycle orchestrates dozens of functional child workflows — hardware validation, firmware updates, networking configuration, health checks, and AI-assisted remediation decisions — all as a single durable state engine that outlives every other process in the system.
This talk is the story of that evolution: what we built first, what broke, what we learned, and how we arrived at a model where a single alert can now trigger an autonomous chain that manages a datacenter at scale. If you’re early in your Temporal journey or wondering how far the rabbit hole goes, this is the map.