Infrastructure as actual Code (IaaC)

12 minute read

#iac#programming

Setting the Scene

My first introduction to Infrastructure as Code (IaC) must have been back in 2016. We were exploring AWS for the first time in my team, and we were using CloudFormation with its JSON-formatted templates. I'm pretty sure this was before the YAML-formatted templates were even a thing since that wasn't launched until September 2016.

Anyway, as a developer, I'm used to writing code. I follow "best practices", such as Don't Repeat Yourself (DRY), by using variables, methods, and abstracting where necessary. I also try to keep it as easy to read, understand, and maintain as possible. Sometimes, that includes comments in my code, usually to explain the rationale behind certain choices or highlight potential gotchas.

Yes, yes, not everything should be abstracted, You Aren't Going to Need It (YAGNI), and all that.

Well, you see... The problem with most IaC solutions is that they are actually configuration disguised as code. They sprinkle a narrow set of functions and utilities on top of the configuration language and call it a Domain Specific Language (DSL).

What happens next is a tale as old as time. The functions and utilities are not enough, and the creators have to keep band-aiding this horribly thought-out mess they've created. This is how you end up with a badly designed mess of a DSL, with some resemblance to a programming language. This is true for CloudFormation, Terraform (HCL), ARM, and Bicep.

There must be a better way, right? Well, I think there is, but first, let's take a look at some of the problems with the current state of affairs.

JSON as a Configuration Language

JSON is not a configuration language; it is a data-interchange format. While it is convenient, widespread, well-known, and available in just about all languages, it simply is not a good fit for writing configuration.

My first issue with JSON as a configuration language is that it lacks comments. Comments are often needed in configuration files, especially when overriding default values and why you potentially did that. Also, when working on a remote machine, quickly commenting out a line of code and restarting the service beats having to delete a line of code.

It isn't well-suited for humans either; it involves a lot of braces, commas, and quotes, making it hard to read and write by hand at scale. For example, when defining an array, you might leave a trailing comma, which is invalid JSON.

CloudFormation

As I wrote earlier, my first introduction to CloudFormation was with the JSON syntax. I don't think I'll never forget the first time I had to create a string from the output of two resources. It required figuring out how to use a combination of Fn::Join, Fn::Split, Fn::Select, and Ref.

It was way more of a challenge than manipulating a string has any right to be. Also, it isn't like you're going to be able to use a debugger when it goes wrong. If those hellish-looking functions aren't familiar to you, here is the documentation for CloudFormation's so-called intrinsic functions.

Take a look at this snippet below, taken from AWS's examples repo.

{
  "OriginAccessControl": {
    "Type": "AWS::CloudFront::OriginAccessControl",
    "Properties": {
      "OriginAccessControlConfig": {
        "Name": {
          "Fn::Join": [
            "",
            [
              "rain-build-website-",
              {
                "Fn::Select": [
                  2,
                  {
                    "Fn::Split": [
                      "/",
                      {
                        "Ref": "AWS::StackId"
                      }
                    ]
                  }
                ]
              }
            ]
          ]
        }
      }
    }
  }
}

I mean... just woah, that is so horrible. The amount of indentation and brackets is simply astounding. Those brackets are simply not helpful for anyone but computers.

This was my first experience with IaC, so to say that I was excited about alternatives is an understatement. At this point, we haven't even talked about actually deploying CloudFormation, which is a whole other nightmarish experience.

To be fair, it is considerably more readable in YAML, but that is not saying much. Here is the same snippet
in YAML. But look at those double dashes...

OriginAccessControl:
  Type: AWS::CloudFront::OriginAccessControl
  Properties:
    OriginAccessControlConfig:
      Name: !Join
        - ""
        - - rain-build-website-
          - !Select
            - 2
            - !Split
              - /
              - !Ref AWS::StackId

The fundamental issues with CloudFormation run way deeper than just syntax, and YAML can't fix that. This whole thread on HackerNews is such a throwback to the days when I was using CloudFormation myself. Who doesn't miss a little ROLLBACK_FAILED in their lives?

Just to make something clear before we proceed: I'm not saying that YAML is good, just that it is better than JSON.

Constraints won't set you free

When using a DSL like CloudFormation, I always reach a point, where I get utterly frustrated, wanting to do something simple, that is usually done in a few lines of code in ANY real language, but with the DSL, it feels like you're wearing a straitjacket, especially because it feels like you're going crazy at the same time.

You have to visit the documentation again and again, and for the most part, you end up copying/pasting samples. You try to make as few adjustments as possible, as you fear the dreaded feedback loop that comes with vendor IaC tools (Bicep, CloudFormation), where the decision engine is owned by the cloud provider.

It seems to me, that the people who are sold on these kind of DSLs, are the people who can't code. They don't want to make an investment as big as learning to code. So they fool themselves into thinking that using a DSL is less work. The only thing they end up doing, is shoehorning themselves into a corner.

They then have to cling on for dear life, hoping that it has staying power, as everything they've learned is useless if it doesn't.

Case in point the Puppet language.

It is time to stop the cope.

The case for using a programming language

Using a DSL with a configuration language usually means you won't have any proper tooling like you usually would with a programming language.

This means that your IDE can't help you, unless a Language Server Protocol (LSP) or similar is implemented. Without proper tooling you won't have any autocomplete, syntax highlighting, auto formatting, refactoring, code navigation, easy separation of units, debugging or any such things.

If only something existed from before that had all those wonderful features. That people could use to express complex instructions; perhaps it could even be Turing-complete, I don't know, something like a PROGRAMMING LANGUAGE?

It really makes you wonder if people do not realize why TypeScript has gained so much popularity. It is all about the tooling. It actually makes it pleasant to write JavaScript (for the most part). You avoid a lot of errors, by simply having static typing.

Terraform (HCL)

Terraform uses its own custom configuration language, HCL. Luckily for HashiCorp, HCL managed to become popular enough that now tooling is widespread enough, but a 2.5-star rating for their VS Code plugin... Yikes!.

A problem remains, though, HCL will not be familiar to anyone who has not written Terraform before because no one else uses it, and no one else is going to, let's be real. Funnily enough, GitHub used it early on for GitHub Actions but retired the HCL syntax and switched to YAML.

Personally, HCL works well for Terraform, it is a lot better than JSON, but it still gets hairy, when you start introducing, loops, and conditionals, also, you're still at the mercy of built-in functions, and utilities.

It manages our little string manipulation pretty smoothly, though. This is what the snippet from the previous section looks like in HCL.

resource "aws_cloudfront_origin_access_control" "oac" {
  name = join("", ["rain-build-website-", split("/", var.stack_id)[2]])
}

But for me it all falls apart so easily. I don't feel like I need to give too many arguments, when you can just look at the ecosystem surrounding Terraform. Would you really need the likes of Terragrunt, if it wasn't for the limiting DSL?

Let me put their headline from their website here:

"DRY and maintainable OpenTofu/Terraform code." - Terragrunt

Indicating that Terraform is not DRY nor maintainable without? Their words! Not mine! Okay, okay, I jest, maintainable sure, but DRY?

Let's move on to deeper issues.

Imagine you needed to create a UUID, you quickly check the documentation, and you find the built-in uuid-function, but wait, what does it say.

"The function generates a well-understood string representation of a 128-bit value, but the output is not RFC-compliant."

What does that mean? Oh no..., it generates a random string formatted as a UUID, but it is not RFC-compliant... What were they thinking?

The "best" way for you to fix that, is to make your own provider, or find a provider that works properly, but let us be honest, perhaps executing a shell script is easier?

data "external" "generate_uuid" {
  program = ["sh", "-c", "uuidgen"]
}

I'm being facetious here, trying to get a point across.

Using actual code

One of the primary advantages of using a programming language is that you can leverage that language's ecosystem, be it tooling, libraries, or community.

When you have an IaC tool, that is designed to be used from a programming language, it can unlock new possibilities. It is time to introduce Pulumi. It fulfills everything I've been talking about this whole time, did you perhaps see it coming?

First things first, here is what our mundane snippet from the previous section looks like in Pulumi using TypeScript.

import * as aws from "@pulumi/aws"
import * as pulumi from "@pulumi/pulumi"

const oac = new aws.cloudfront.OriginAccessControl(
  `rain-build-website-${pulumi.getStack().split("/")[2]}`
)

Quite simple, easy to read, and easy to understand. Discoverability is also great because of the common dot-notation, allowing you to easily discover what members are available on objects.

Here are some of the unique features that Pulumi offers:

  • Dynamic providers
  • Function serialization
  • Automation API
  • Avoiding plaintext secrets in your state

A dynamic provider, is a provider defined directly in your code. It doesn't have to be packaged, it doesn't even need to live in a separate file. All you have to do it implement a create, and a delete, or if you want to go the whole way, you can do the full CRUD, for a fully managed life-cycle.

For more detailed information, check out the documentation for dynamic providers here.

You also have function serialization. I'm still not entirely sure, if they are a good idea to use in production, as they involve a lot of moving parts, with a lot of caveats, and possible footguns, but it is a very impressive feature nevertheless. Basically, you can write an inline Lambda function, and Pulumi will take care of the packaging.

Perhaps the biggest killer feature of Pulumi is the Automation API, which allows you to programmatically use Pulumi to provision resources. With for example Terraform, you'd end up calling the CLI via a shell script or some other means. With Pulumi, you can do it in code, which makes it ideal for building tooling that provisions infrastructure, building a full-blown Internal Developer Portal, or some other tool that needs to manage infrastructure.

Pulumi has amazing support for handling secrets and encrypting them in the state file. It is honestly mind-boggling that Terraform doesn't support it, and no encrypting your state file at rest is not the frigging same. A recent development in Terraform is what they call ephemeral values, which can be used to alleviate some of the issues, with secrets in state, but encrypting secrets like Pulumi does, is still far superior.

Why not Terraform CDK, AWS CDK, etc.?

The fundamental issue with Terraform CDK, AWS CDK, and other similar tools, is that they are not actually built as Infrastructure as actual Code. They are merely abstractions that translate to their original data interchange format. This means that are not actually innovating at all, merely playing an impossible game of catch up.

In the case of AWS CDK and the future Azure CDK, they will never even solve the fundamental problem of the cloud provider owning the actual decision engine. They are simply a hack or workaround to solve the issue of picking a configuration language to begin with.

Conclusion

I'm not saying that Pulumi is the objectively best IaC tool out there, but it is really powerful if you know how to code. Even if you don't need the superpowers it can provide, it is still much better at reducing duplicate code, and extracting common patterns, than any configuration language.

I urge you to try it out, and see for yourself. Even if you don't know how to code, perhaps you'll find out, that it isn't so difficult after all, especially with the help of good tooling.

Forget about the cloud-specific IaC tools. There is simply no good reason to use them.

With the cloud provider owning the decision engine, you're never actually in control. If you're blocked by a bug, you can't contribute to fixing it. Perhaps if you or your company pays the cloud provider enough money annually, they will actually get back to you.

Third-party IaC tools ship support for new resources really fast. There is no advantage, relying on the first-party tool for early support of new resources / features.

And most importantly, eventually, you'll want to manage additional infrastructure that is not vendor-specific. This could be DNS records in another registry, or some configuration in a SaaS platform.

If you're still not convinced, you can always just stick to HCL, and hope that it lasts, hey at least it is better than JSON / YAML, right?

Closing remark

If you're still not convinced, I could probably be goaded into doing a face-off between your choice of IaC, and Pulumi. You make an implementation in your tool, and I'll make an implementation in Pulumi, and we'll see which one comes out on top.

Anyways, stay tuned, it is very possible I'll publish more posts about Pulumi, and how I use it.