Automated remediation of cloud misconfigurations continues to be a big theme. One of the significant challenges customers face is putting automation into action, instead of just talking about it.
When enterprises evaluate Cloud Security Posture Management (CSPM) solutions, automated remediation is frequently the end goal. As with any enterprise system, it is critical to learn, plan, and prototype your automation capabilities until the power is fully understood. Our challenge is to help those clients who are starting from a blank slate to take a “crawl, walk, run” approach. Running aggressive automated remediation from day 1 risks causing more issues than you’re solving. As a result, your team will most likely be averse to future automated remediation efforts. A poor initial implementation of remediation introduces a risk of organizational opposition to automation going forward.
Automation can range from basic notification and logging to fully automated remediation (the most advanced type of automation). You don’t need to start with 100% automated remediation from day 1. In fact, most organizations benefit greatly from working their way through the levels of automation to fully explore what approaches suit their environment best. In this paper, we’ll examine the different steps and levels of automation, and at the end of this document, you’ll be able to choose the appropriate level of automation for your environment.
First, let’s start by reviewing the benefits of automated remediation:
Notification is the foundational building block of automated remediation, and something you will use on an ongoing basis. Because you’ll configure notifications to send reports of remediated events, this is a great place to start for testing.
During the initial rollout, using only notifications for the first tests is critical because it allows you to audit exactly what would be remediated without making changes. This is a great way to ensure any actions that would be invoked will do exactly what you want.
The old saying of “measure twice, cut once” applies here. Do not move out of the notification stage until you’re able to consistently validate that you are only receiving notifications for the defined resources.
When you’re ready:
Tip: After initial testing and the first stage of remediation has been performed, it is important to keep notifications turned on so that the ongoing results of your remediation are logged for audit.
This is critical for several reasons. If things are breaking and being fixed automatically, how will you know when something bad is happening? Also, if there’s someone who is making unintentionally incorrect or insecure changes, they won’t have any way of being notified that they need to change what they’re doing.
It’s important that notifications don’t become “noise” or “spam” to the recipients. To that end, notifications should include as much contextual information as possible.
Sample notification, ticketing, and logging targets include:
The next step in moving towards automated remediation should focus on locking down your account fundamentals. There are several initial configurations that every new cloud account should have, and most of them can be controlled with automation.
Sample automation for AWS can include:
These fundamentals will save valuable time and increase the security of your environment. None of these configurations should have a negative impact on your day to day operations or users.
The automation recommendations in level 1 and 2 line up closely with the AWS CIS benchmark. These are account fundamentals that any organization can employ to improve the security and overall hygiene of their cloud accounts.
The difference between level 1 and 2 is that in level 2 the automation for account fundamentals will take some planning to ensure they won’t have an impact on your users. The level 2 automation will be easier to roll out than the automation that provides remediation in levels 3 and 4.
Examples of AWS housekeeping automated remediation:
Things get a bit more free-form in level 3. The goal here is to make this automation your own and add actions that bring your company the most value while affecting day to day operations as little as possible.
There are several use-cases that may be employed, and you’ll have to explore which of these best fits your company:
It’s not until level 4 that you’ll begin performing the kind of automated remediation most people think of when discussing the topic. That’s because while these are the most exciting form of automation, these are actions that give you the most control, and can also do the most damage. Additionally, when you start rolling out these kinds of actions, you need to make sure that the organization is completely on board. If these types of actions are being run in a vacuum, you’re going to have a lot of confused people and potentially a lot of broken systems.
Examples of level 4 automated remediations in AWS:
It’s important to mention the concept of timing for the actions that are in levels 3 and 4. It’s appealing to initially roll out actions that have a lag between notification and remediation built into them, but that might not be the best approach.
For example, if you wanted to lock down SSH exposure in your development environment, you might design your remediation to send a Slack notification that an instance is out of compliance because it has SSH exposed and that it’ll be terminated in two hours if not fixed. Two hours later, the instance can be terminated.
It may seem counterintuitive, however, this is actually a more disruptive workflow than if the instance was turned off as soon as the issue was created. In the first scenario, if a developer spins up a non-compliant instance and then it goes away after two hours, they will have spent two hours of work on that instance. The code will have been loaded, the app might be running, and if it just goes away, that’s developer time that was wasted —and they’ll probably be upset! Instead, if the instance is torn down as soon as it’s seen, there wouldn’t have been an opportunity for the developer to waste any time on the instance. It’ll go away a minute or two after it is created, and the supporting notifications will give them the context they need to avoid the same mistake again.
Different scenarios require different timing and you’ll always have to balance the risk of security exposure with the operational impact it will have on the organization.
Every organization will have a unique journey when implementing automation For some, there’s no appetite for full automated remediation and just using automation for notifications will be enough. In other organizations, everything will be automated and completely locked down. Whatever level your organization strives to get to, by working through these 4 levels you can be successful in gradually rolling out automation to achieve fully automated remediation and get the most value out of the actions with the least amount of shock to the organization.