CategoriesInfrastructureServerless

AWS Step Function Versions and Aliases

Up until last week, when you deployed a new version of your State Machine in AWS Step Functions, the old version was gone and the ability to test or rollback was limited by your ability to re-push a previous commit. However, AWS has rolled out Step Function Versions and Aliases so that you can accomplish just those tasks. Creating a unique combination of a version and ASL gives you the ability to use things like Deployment Preferences to accomplish Canary or Linear-type deployments. In the below article, I’m going to walk you through Step Function Versions and Aliases.

Disclaimer

First up, I’m going to be using SAM to build the infrastructure. I think this is the first SAM-based deployment article I’ve written. I know this makes Allen Helton super happy. Second, I know the article from AWS says support for SAM and CDK, but they haven’t rolled this in as of the writing of this article. However, I’m using the SAM Nightly Builds and it does include the Transforms to make this happen. I could spend another few articles describing SAM and perhaps I will dig deeper later, but for now, here’s the AWS Docs on transforms.

Step Function Versions and Aliases

For starters, these two concepts are very similar to how they work with Lambda Functions. A version is nothing more than a unique package of your workflow with a number associated with it. That version can be run via the console or the API just like you’d run the latest tag that’s associated with a Lambda. Then you have an alias. This is nothing more than a pointer to a version that can also include routing type information. This is helpful when managing traffic that is supplied to your state machine. AWS did a nice job of visualizing these concepts with the images below

Alias

Alias

Version

Version

Sample Code

With this sample, I’m going to build a basic State Machine that has a choice and then based on the value, will either Succeed or Fail the machine. Then I’ll show you how using Step Function Versions and Aliases can help prevent bad code from your environment./

Choice Workflow

What I’m going to demonstrate is how you can use DeploymentPreferences to protect your State Machines from a faulty definition by changing the choice selections to fail when the same input is supplied due to an error in the code.

The ASL for the workflow will start like this

{
    "StartAt": "Basic Choice",
    "States": {
        "Basic Choice": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.type",
                    "StringEquals": "B",
                    "Next": "Passed"
                },
                {
                    "Variable": "$.type",
                    "StringEquals": "C",
                    "Next": "Passed"
                },
                {
                    "Variable": "$.type",
                    "StringEquals": "D",
                    "Next": "Passed"
                },
                {
                    "Variable": "$.type",
                    "StringEquals": "A",
                    "Next": "Failed"
                }
            ]
        },
        "Passed": {
            "Type": "Succeed"
        },
        "Failed": {
            "Type": "Fail"
        }
    },
    "TimeoutSeconds": 30
}

Now let’s walk through the SAM Template to build and deploy the infrastructure.

SAM Template

State Machine

AliasStateMachine:
    Type: AWS::Serverless::StateMachine
    DependsOn: StateMachineLogGroup
    Properties:
        Type: EXPRESS
        AutoPublishAlias: "main"
        Name: "SampleAliasMachine"
        DefinitionUri: statemachine/alias.asl.json
        Role:
            Fn::GetAtt: [StatesExecutionRole, Arn]
        Logging:
            Destinations:
                - CloudWatchLogsLogGroup:
                      LogGroupArn: !GetAtt StateMachineLogGroup.Arn
            IncludeExecutionData: True
            Level: ALL
        DeploymentPreference:
            Alarms:
                - !Ref AliasStateMachineFailureAlarm
            Interval: 2
            Percentage: 50
            Type: LINEAR

So if you’ve built a State Machine before with SAM, some of this will look pretty standard. Bonus points if you’ve built a Lambda with SAM because the DeploymentPreference section should also look familiar. The things to take note of though are this.

  • AutoPublishAlias: this will automatically create the alias for you and keep it up to date with the latest version
  • DeploymentPreference: if you view the spec, it requires a Version, but when using it in SAM like this, the version will get added for you in the transform so no worries.

Running sam deploy will produce the following State Machine Alias setup.

Version 1

Deployment Alarm

Looking further at the DeploymentPreference, there is an Alarm defined in there. That alarm is watching for the FailedExecutions metric. This could be a metric used for other purposes, but in this article’s case, it’s to manage when to roll back. The definition takes this shape.

AliasStateMachineFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
        AlarmDescription: Invocation alarm for Alias State Machine
        Namespace: AWS/States
        MetricName: ExecutionsFailed
        Dimensions:
            - Name: StateMachineArn
              Value: !Ref AliasStateMachine
            - Name: Alias
              Value: "main"
        Statistic: Sum
        ComparisonOperator: GreaterThanOrEqualToThreshold
        Threshold: 1
        EvaluationPeriods: 1
        Period: 300
        TreatMissingData: notBreaching

Building a CloudWatch alarm is about defining what metric to watch for, which dimensions or criteria are you paying attention to, what period to evaluate, how to evaluate missing data, what is the calculation and what’s the comparator. With this release from AWS, they’ve also rolled out CloudWatch Step Function dimensions for Version and Alias. The alarm defined above does this:

  • Sum the number of FailedExecutions over 5 minutes
  • Filter those by the ARN of the State Machine and by the Alias supplied
  • Treat no data as “OK”
  • Compare that SUM that if it’s >= 1 then ALARM

Testing the Workflow

With the State Machine deployed at Version 1, I can now test that my workflow is as I intend and then show you how to use Step Function Versions and Aliases when the workflow is not good.

aws stepfunctions start-execution --state-machine-arn arn:aws:states:::stateMachine:SampleAliasMachine:main --input "{\"type\": \"B\"}"
Choice Pass

This is going to run through the Choice step and trigger a Succeed because “B” is a passing type.

But let’s for the sake of testing the deployment say that we change the workflow to the following defect and now “B” triggers a Fail. We could have executions and customer impacts because the State Machine that was working is now failing. This is where the DeploymentPreference and the Alarm come into play.

Choice Pass
{
    "Variable": "$.type",
    "StringEquals": "B",
    "Next": "Failed"
}

This could be bad, but thankfully when deploying, in this example, I’m using a LINEAR style deployment where traffic is weighted 50/50 for 2 minutes to monitor for issues and then the rest of the traffic is switched over to the new version should everything look good. This article will explain more of the options.

The Routing will look like this when it’s happening.

Routing

And if you run the failing choice while the deployment is happening, it will be automatically rolled back.

Rollback

Wrap Up

My first thoughts when using this capability were how have I been using Step Functions without Versions and Aliases the whole time? But then as a stepped back, Lambda went through the same type of transition where you just always overwrote the existing function and didn’t do much canary or linear type rollouts. I have a feeling that I’m not going to use this technique for everything but for mission-critical workflows, it’s going to be a requirement.

If you are going to invest in leveraging Step Functions and ultimately in a deployment pipeline to manage releases, the automated verification before releasing to your customers is a critical step for zero downtime deployment. It is also a critical piece of continuing beyond just CI and moving towards a CD-style world. Step Function Versions and Aliases will help you get there.

Feel free to clone the repository which is a full working sample of what I’ve shown above. Hopefully, it gets you started on this journey. And as always, I’d love to hear feedback!

Published by Benjamen Pyle

Benjamen is a genuine and resourceful technology creator with over 20 years of hands-on software development, team building and leadership experience. His passion is enabling technology teams to be their best by bridging modern technical design with outstanding business problem-solving. Recognized as an AWS Community leader in the areas of Event-Driven and Serverless Architecture, he brings multiple years of pragmatic experience designing and operating modern cloud-native and containerized solutions. When Benjamen doesn't have his head in the clouds, he's either playing golf with his wife and 2 boys or they are outside with their 12 paws.