Resiliency in Asynchronous and Distributed Systems

Speakers

Jeremy Miller

I hate to break this to you, but your distributed systems are going to experience errors at runtime. External systems you depend on will be down, internal subsystems may be distressed, databases might be overloaded, and any operation that goes over a network is vulnerable to hiccups at runtime. You should also assume that your own system's code will occasionally encounter errors with who knows what unforeseen permutation of inputs or system state. The reasonable goal we should strive for is a system that's resilient in the face of errors and can take the right error handling actions to prevent the system from getting into an inconsistent state by losing in flight work. Hopefully our systems can do this without requiring human intervention or downtime to fix issues related to system errors. In this talk, we'll run through different types of runtime errors and match them with useful exception handling policies. We'll also make sure to understand when and where a transactional outbox and inbox makes sense within your system architecture. I'll be using Wolverine as the messaging framework for the samples, but the conceptual approaches should transfer to any other robust messaging tooling.