Chaos Engineering your .NET applications using Simmy

Posted on Friday, 03 Jan 2020

One package I've been using with great success recently is Simmy, so much so that I feel it deserves its very own blog post.

What is Simmy and why you should use it?

Simmy is a fault-injection library that integrates with Polly, the popular .NET transient-fault-handling library. Its name comes from the Simian Army toolset, a suite of tools created by engineers at Netflix who recognised that designing a fault tolerant architecture wasn't enough - you have to exercise it, normalising failure to ensure your system can handle it when it inevitably happens.

This idea isn't new. Experts in the resiliency engineering field such as Dr. Richard Cook have published multiple papers around this topic whilst studying high-risk sectors (such as traffic control and health care). He summarises failure quite nicely in his Working at the Center of the Cyclone talk when he said:

"You build things differently when you expect them to fail. Failure is normal, the failed state is the normal state".

Dr. Richard Cook (amongst others in the resiliency engineering industry) propose that in a complex system, failure is the normal state and it's humans that build mechanisms (either sociologically, organisationally or technically) to ensure systems continue to operate. Without human intervention, failure will happen.

Dr. Richard Cook's paper How Complex Systems Fail goes further by saying:

"The high consequences of failure lead over time to the construction of multiple layers of defence against failure. These defences include obvious technical components (e.g. backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organisational, institutional, and regulatory defences (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents."

This shift in perspective is one of the tenants of Chaos Engineering and why Netflix built the Simian Army - to regularly introduce failure in a safe, controlled manner to force their engineers to consider and handle those failures as part of regular, everyday work. So when those failures do happen, they won't even notice.

With that in mind let's take a look at how we can use Simmy to regularly test our transient fault-handling mechanisms such as timeouts, circuit breakers and graceful degradations.

Simmy in Action

If you're familiar with Polly then Simmy won't take long to pick up as Simmy's fault injection behaviours are Polly policies, so everything fits together nicely and instantly feels familiar.

Let's take a look at Simmy's current failure modes.

Simmy's Failure Modes

At the time of writing Simmy offers the following types of failure injection policies:

Fault Policy

A fault policy can inject exceptions, or substitute results. This gives you the ability to control the type of result that can be returned.

For instance, the following example causes the chaos policy to throw SocketException with a probability of 5% when enabled.

var fault = new SocketException(errorCode: 10013);

var policy = MonkeyPolicy.InjectFault(
	fault, 
	injectionRate: 0.05, 
	enabled: () => isEnabled()  
	);

Latency Policy

Like Netflix's Latency Monkey, a latency policy enables you to inject latency into executions such as remote calls before the calls are made.

var policy = MonkeyPolicy.InjectLatency(
	latency: TimeSpan.FromSeconds(5),
	injectionRate: 0.1,
	enabled: () => isEnabled()
	);

Behaviour Policy

Simmy's Behaviour Policy enables you to invoke any custom behaviour within your system (such as restarting a VM, or executing a custom call or script) before a call is placed.

For instance:

var policy = MonkeyPolicy.InjectBehaviour(
	behaviour: () => KillDockerContainer(), 
	injectionRate: 0.05,
	enabled: () => isEnabled()
	);

A trivial example

Once we've defined our chaos policy we can use the standard Polly APIs to wrap our existing Polly policies.

var latencyPolicy = MonkeyPolicy.InjectLatency(
	latency: TimeSpan.FromSeconds(5),
	injectionRate: 0.5,
	enabled: () => isEnabled()
	);

...

PolicyWrap policies = Policy.WrapAsync(timeoutPolicy, latencyPolicy);

In the example above we introduce 5 seconds latency in 50% of our calls, which depending our timeout policy will force timeouts within our application making us more aware of how our application will handle timeouts.

By registering the Simmy policy as the inner most policy means it'll be invoked just before the outbound call.

Using Polly registries

When implementing Simmy's chaos policies you can use the aforementioned WrapAsync method but you'll probably want to implement them with as little change to your existing policy code as possible, this is why I'd recommend using Polly's PolicyRegistry type. This way you can configure your policies then easily wrap them with your Simmy policies:

// Startup.cs

var policyRegistry = services.AddPolicyRegistry();
policyRegistry["LatencyPolicy"] = GetGetLatencyPolicy();

...

if (env.IsDevelopment())
{
    // Wrap every policy in the policy registry in Simmy chaos injectors.
    var registry = app.ApplicationServices.GetRequiredService<IPolicyRegistry<string>>();
    registry?.AddChaosInjectors();
}

Testing failure modes within integration or end-to-end style tests

Prior to learning about Simmy I used to test Polly policies via unit tests, in a lot of cases this is fine - however when you want to test them these policies part of an integration or end-to-end style test to understand how your application handles slow running requests or times outs things can get a little trickier as it can be hard to replicate these things under an automated test condition.

Using Simmy's Chaos policies we're now able to invoke those behaviours from the outside which is where Simmy really shines.

Taking it further

Regularly and reliably testing some of the fault-tolerance patterns in a real world environment (such as staging, or dare I say it - Production!) can be challenging. If you're on a platform such as Kubernetes then introducing failure into your system can trivial as there are plenty of tools that can do this for you. For anyone else this can be a challenge. Simmy has opened the door to make this so much easier, so much so that at one point we had it enabled in our staging environment to introduce a healthy amount of failure into our application.

As Adrian Cockcroft said in his Managing Failure Modes in Microservice Architectures talk:

"If we change the name from chaos engineering to continuous resilience, will you let us do it all the time in production?"

Wrapping it up

Hopefully this post has demonstrated the value in Simmy. Chaos Engineering is a really interesting area and it's great to see tools like this appearing to support running experiments in your .NET applications. Hopefully over the next coming years we'll start to see more tooling emerge, enabling us to introduce failure scenarios into our applications as part of regular work.

Back