CodeDeploy – How many instances to I need?

Follow up on our previous discovery on the mind f*** of galactic proportion on custom pass percentages for FLEET_PERCENT, the next question is “can we compute it”?

I’ve come up with a simple script that will do just that.

for (var i = 1; i < 100; i++)
{
    var percent = i/100.00;
    for (var x = 1; x <= 100; x++)
    {
        var roundUp = Math.Ceiling(percent*x);

        if (!(roundUp < x))
            continue;

        Console.WriteLine($"i : {i} \t criteria : {x - 1} \t min : {x}");
        break;
    }
}

 

Here’s the result:

% Range Success Criteria Min Instances Required
1 – 50 1 2
51 – 66 2 3
67 – 75 3 4
76 – 80 4 5
81 – 83 5 6
84 – 85 6 7
86 – 87 7 8
88 8 9
89 – 90 9 10
91 11 12
92 12 13
93 14 15
94 16 17
95 19 20
96 24 25
97 33 34
98 49 50
99 99 100

What does this mean? If you have to use CodeDeploy and you have to use custom deployment configuration, use absolute value.

It works a lot more better.

 

AWS CodeDeploy Deployment Configuration

Let’s talk AWS CodeDeploy Deployment Configuration. If you’ve ever tried to deploy to AWS using CodeDeploy, you’d notice that you have the option to change the “deployment configuration”. This attribute specifies how CodeDeploy decides to deploy your application and the pass criteria for the deployment.

The three built in options are:

  • OneAtATime (default)
  • AllAtOnce
  • HalfAtATime

awsdeploymentconfiguration

In most cases, OneAtATime works just fine. It deploys your application to one server at a time, and ensures that all* of your installations are successful. I haven’t tested this myself though, but if you have 100 instances, and it probably accepts that one instance can fail.

However as most of us do where we do want to do deployments in scale, speed becomes sort of important. If you have 20 instances to perform an installation to in a blue/green strategy approach then the speed of your deployment increases in a linear scale of n * t, where n is number of servers and t the time to deploy to a single instance. In a nutshell, it’s slow as hell in an enterprise environment.

So, how about let’s decide to deploy AllAtOnce? It solves the problem of scale, given that it triggers all deployments at once, so the time it takes is only as slow as the slowest instance installation. A closer look at the minimum required healthy host is 0 on the board. Which means, CodeDeploy will classify your installation is successful as long as one of your instance installations succeeds. Worst case scenario you can have 999 of your 1000 instances fail on installation and CodeDeploy still thinks the installation is good.

This has many fatal consequences, chief among those is that CodeDeploy will only execute deployments on “last pass”, which in this case it will deploy a poisonous installation on scaling up causing more havoc.

Okay, maybe let’s take the halfway house where half of instance pass with HalfAtATime? Possibly if it successfully installs on half of the instances then it should be fairly good anyway. In theory this works well, as in most cases, your installation should either fail or pass on most with a few rogue instances going south for unrelated reasons. You at most only double your installation time, which is very much in the acceptable range.

Not so fast, what happens when you have only one instance? How does it work half of one? You’d think simple! It should just pass on one … right? You’d be wrong, it doesn’t even try.

Deployment config requires keeping a minimum of 1 hosts healthy, but only 1 hosts in deployment

So HalfAtATime just blatantly does not work if you have a scale of one (perhaps some time during midnight where traffic is low). Again, non-usable scenario.

Helpfully, Amazon has provided several ways you can specify your own deployment config. But hold on, once you look into the details of the request call, you’d find this

{
  &amp;amp;amp;amp;amp;amp;quot;value&amp;amp;amp;amp;amp;amp;quot;: integer,
  &amp;amp;amp;amp;amp;amp;quot;type&amp;amp;amp;amp;amp;amp;quot;: &amp;amp;amp;amp;amp;amp;quot;HOST_COUNT&amp;amp;amp;amp;amp;amp;quot;|&amp;amp;amp;amp;amp;amp;quot;FLEET_PERCENT&amp;amp;amp;amp;amp;amp;quot;
}

Wait, you can only specify the minimum host count or fleet percent as a parameter and AWS then over helpfully figures out the deployment strategy for it? Amazon always deploys the amount in the figure rounded down first.

Okay, so what happens when you specify FLEET_PERCENT as 100?

A client error (InvalidMinimumHealthyHostValueException) occurred when calling the CreateDeploymentConfig operation: The value for the minimum healthy hosts with type of FLEET_PERCENT should be positive and less than 100

Damn, how about 99%?

It’ll work fine, until you deploy …

Deployment config requires keeping a minimum of 2 hosts healthy, but only 2 hosts in deployment.

This alternative is worse than HalfAtATime.

Oh so there is NO good deployment configurations possible. Cheers Amazon!

Is there a moral of the story in this article? Consider all your options before diving in for CodeDeploy. This is just one long list of flaws associated with it.

Update (12/01/2016):

I’ve gotten a definitive reply from AWS technical support and here’s what they said:

Hello,

Thanks for your patience, I have got clarification on how FLEET_PERCENT works and it is actually rounded up to the next integer, so you were right if FLEET_PERCENT is set to 99% then 100 instances are needed. However if you use 95% the deployment will succeed just with 20 and 90% with 10. Essentially values above 90% increase the minimum number of instances very fast, but with numbers up to 90% the minimum number of instances stays at relatively moderate numbers of instances.

I did miss this on my previous message but this is actually mentioned on our public documentation[1], please accept my apologies for the misleading comments I provided in my previous message.

Please let us know if there is anything else we can help you with.

References:

[1] http://docs.aws.amazon.com/codedeploy/latest/userguide/host-health.html#host-health-minimum

Sounds a bit iffy. I am not sure if I’d ever design a system with this sort of qualifier but there you go.

update your aws sdk to v3 now

Another week, another AWS related topic. We were looking out for a feature recently released to AWS API in our dotnet library and it turned out that the feature was published and documented under v3 but not v2.

Further investigation revealed that v2 hasn’t been updated since September, which is rather worrying; given that there isn’t any announcement about its deprecation.

The obvious problem with porting to another major version, is well all the same problems you have when upgrading from one major version to another. I’ve done a quick check and found that they’re not definitely compatible (all your dependencies have to be one or the other or you’ll get a lot of compilation confusion).

So, time to update and make breaking changes to v3! If you’re wondering what you should be looking out for, here’s Amazon’s official list.

 

update:

Amazon .Net team has since came back with a very quick turnaround response on twitter and we finally have the much needed feature back on v2 (thanks guys!).

 

more fun with multi tenancy

When deciding to multi-tenant on the cloud, please do not put services that depend on each other on the same auto scaling group. It’s just a recipe for disaster.

So, if you’ve decided against all my advice and reasoning not going multi tenancy here, then for the love of God don’t put a service that rely on another in on the same auto scaling group.

The reason for this is simple, Code Deploy does not have an ordered list (nor should it) when deploying applications on a newly spun instance. Now, if you’ve done the sensible thing about connecting all of them via a load balanced DNS; then you’ll be safe(r).

I was investigating a problem earlier this morning about an instance in a perpetual recycle (scale up, kill self, scale again) thinking it was the classic 5 minute multi tenancy timeout issue. Lo and behold I found that it was an auto scaling group of one (it’s a test environment), and as part of our installation script we check that it can talk to all its dependencies. Of course, this dependency has by that time been destroyed and by chance has not spun up “yet”. That’s several thousand pounds worth of lesson.

I’m anticipating questions about containers. Yes, containers will resolve a lot of the issues we’re facing. Unfortunately we’re a .net shop and containers are still extremely immature. Furthermore, we’re talking about a mixture of new and very, very old components (some with shelf life for over a decade and half).

So kids, moral of the story? Don’t do multi-tenancy. If you do, then read all my whinge about it so you’ll know what NOT to do.

 

NLog Target for Amazon SNS

I’ve been working on a small extension to NLog recently that publishes your log messages to Amazon SNS. The requirements were rather unique as I needed it to resolve some log aggregation issue whilst getting around all the problems related to log shipping (memory!) and disconnected workers. The inspiration came about as my application that consumes this extension runs on an EC2 instance, making permission management extremely convenient.

This package works with various configuration options, from using stored profiles (for local development), to instance profiles (preferred approach) which will take advantage of the server role to publish messages to Amazon SNS.


<target xsi:type="SNS"
name="s"
RegionEndpoint ="eu-west-1"
Topic="{your-topic}"
AmazonCredentialType="Amazon.Runtime.InstanceProfileAWSCredentials, AWSSDK"
layout="${message}"/>

You have the option to specify an account number should you like to do some cross-account magic, but by default if no account number is specified, it will assume that you want to publish to the topic found on the same account.

You may also notice that the credential type is a bit weird. This is actually the class name, where I use reflection to generate the credential object. There’s an Issue opened to enhance this to give it a short name and stop using reflection and activators which may have been fixed by the time you read this!

Guide on how to consume this package can be found on the main GitHub page.

Any questions, comments or request about this package, feel free to post something on the issues section!

 

tl;dr;

I have a new project on GitHub that targets Amazon Simple Notification Service (SNS) using NLog. You can find the source code here. Alternatively, if you like the nuget package.

PM> Install-Package NLog.Targets.SNS

AWS code deploy really does not like multi tenancy

We ran into a not-so-new-but-not-well-publicized problem with code deploy recently which involves auto scaling group hooks with code deploy. In a cost-efficiency exercise and partly due to our legacy data center based infrastructure, we were mandated to perform multi-tenancy on our servers. The point about multi tenancy has been debated to death and I am not going into that here, but suffice to say that it has its merits, especially when you’re running a high volume low margin business (the Tesco model).

So, somewhere down the line each cluster (or product group) if you like to call it decided that their services should now be small micro services multi tenanted on a semi decent machine. As they are also web applications, they also like them to be blue and green on the same instance, thereby optimization on cost and performance. We’ve managed to deploy a total of 6 application with their associated colour slices onto the auto scaling group, thereby bringing  the application count to 12.

All of these went fine with code deploy, with all the soft limits removed we were able to simultaneously update all 12 applications at any given time. Then we decided we wanted to scale it up. The instances then go off on a infinite loop of recreation and destruction. Looking at the activity history, we came across this line.

Launching a new EC2 instance: i-********. Status Reason: Instance failed to complete user’s Lifecycle Action: Lifecycle Action with token *****-****-****-**** was abandoned: Heartbeat Timeout

The keyword there is heartbeat timeout. So, what’s happening? Because the life cycle hooks happen on an event based model, it seems that because of the sheer number of applications in the hook, some of the hooks are timing out presuming that the event is now lost in the ether.

A bit of good old web search, and I found this response from the AWS forums.

Simultaneous application deployments– Deploying multiple applications to the same instance at the same time can fail if one of the deployments has scripts that run for more than a few minutes. Right now, the agent won’t accept any new commands while it is executing one. Outside of Auto Scaling, this is rarely done so it’s mostly been noticed here. We currently don’t have any workarounds for this.

That is simply not fun, and after having a chat with a support rep from Amazon, it seems that whilst it is permissible to have code deploy multiple hooks onto a single ASG, the developers of code deploy really don’t like what we did there (their terminology was “bad practice”). We of course have our very reasons for doing so, and it seems something perfectly reasonable (you can use userdata or octopus deploy for this).

Boils down to this, AWS ASG really doesn’t like you treating code deploy like a farm with multiple non-container containers.

So, lesson for all here. Micro-services, .net, multi-tenancy and code deploy are not match made in heaven.