Cloud Cost Control - Early Detection is Key

Don't get caught off guard by inflated monthly cloud costs

The key to effectively managing cloud spend is through early detection.

Companies would end up incurring extra costs to isolate the cause of the anomaly over the span of months or more. Here is why:

  1. Companies are usually alerted to the increase in cost when the accounting receives the monthly bill
  2. The accounting department then informs the project teams to look into the cause
  3. The specified team proceeds with the investigation to isolate the cause
  4. The process could take weeks up to even months while the Cloud Spend continues to mount

How Cloud Spend Abnormality Happens

Unfortunately, there is no easy way to isolate the cause of these abnormalities.

A surge in Cloud Spend generally happens due to changes such as:

  • Life-cycle polices for storage that kick in after 1 month
  • Erroneous execution of Change Requests

to name a few

These events are plentiful and are difficult to isolate in production environment.

Nimbus Stream is the Answer

Nimbus Stream solves this problem by allowing:

  • Early detection of abnormal costs
  • Early isolation of the reason of the abnormal costs (RCA)


Cost Management - Effective Tagging as the Foundation

Background

At the end of the month, we do receive an AWS bill that shows all the costs of the account broken down by services. From there, we generally see the dreaded monthly increase. Sometimes, we do see some unforeseen increases. On those occasions, we start digging deeper into the statistics. For this, we have something called the Cost Explorer.

Cost Explorer

The Cost Explorer is a BI tool provided by the cloud to allow you to slice and dice the data till you find certain discrepancies. It is an amazing tool, if... and only if...

All our data is tagged correctly...

This allows you to be able to slice and dice your data by the tags!

Nimbus Stream Possibilities

Nimbus Stream recommends a standardised tagging policy of the following tags:

  • Environment - Used to differentiate between the different environments, i.e. development, staging, production
  • Project - Used to differentiate between the different projects
  • ServerGroups - Used to differentiate the different type of Server Types you have, i.e. Network Devices, Load Balancers

Customers are advised to add as much granularity as possible in their tagging strategy.

Nimbus Stream - Identifying Errant Resources

Truth be told, getting your resources tagged properly is not easy. As Shakespeare once said:

I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

Ensuring that all our Developer and Engineers to abide by all these rules is close to impossible. We need a little help ... and that's where Nimbus Stream comes in.

Having first identified all the necessary tags that needs to be configured for the project, Nimbus Stream will:

  • Call APIs to create an initial check on all resource. Any resources that have not been tagged properly will be highlighted.
  • Listen to Cloudtrail events for future resource creation / updates. It will again highlight all these resources that have not been tagged properly.

Nimbus Stream - Identifying the User

So once you have gotten all these errant resources, who we actually find to rectify this? How do we find that naughty Engineer / Developer? Again, this is where Nimbus Stream can help again.

Nimbus Stream is always listening actively to the Cloudtrail logs. As such, we KNOW who are the people who create these resources. As such,

Not only will we identify the errant resources. We will also highlight the relevant culprit!

With Nimbus Stream, getting the tags correct will be as easy as ABC. Truth is, tagging is the corner strategy of Nimbus Stream

Tagging - Constant Monitoring for Statistical Anomaly

Working with the Cost Explorer

Most of you who use AWS will definitely use its Cost Explorer. You will spend a lot of time tagging, as you need the segmentation to make sense of your costs. Azure and Google Cloud have similiar tools.

At the end of every month, when AWS surprises you with a few thousand dollars increase in costs, you are in a good position to easily slice and dice the data to identify the exact, Service, Environment Project and ServerGroup type, which is causing this problem. This is because you have all the tags in place.

Still, by the time we use the Cost Explorer, we would have been hit with thousands of dollars. We start asking ourselves,

Is there a better way to handle this?

Banks and Credit Cards

Most of us use credit cards.

Do you know that all card-based payment systems worldwide generated gross fraud losses that amounted to 6.86¢ for every $100 of total volume in 2018.

If the banks did absolutely nothing, we will all get a rude shock at the end of the month.

Banks cope with credit card frauds by:

  • Limiting Credit Spending per card
  • Implement complex fraud detection analytics

The credit card systems are built with complex fraud detection analytics that will alert the compliance officers of suspicious transactions.

Maybe we can start learning from the banks...

Limiting Cloud Costs

Now the truth is ...

There is no way we can limit CloudSpend from the CSPs.

The cloud providers charge for whatever you consume. If your server-less algorithms go wrong, you might be slapped with an astronomically high cost.

Yes, you can set certain alert limits. But should an errant algorithm start incurring astronomical cost, you will find yourself in a race to try to figure out the root cause, when shutting down the production system is not an option.

Nimbus Stream - Statistical Anomaly

So this is where Nimbus Stream comes in. Once you have the data tagged correctly, Nimbus Stream will analyze the data for statistical anomalies. It does this in the following ways:

  • Costs Segmentation
  • Statistical Anomaly

Costs Segmentation

This concept really comes from the more common marketing principle called Market Segmentation.

Market segmentation is the research that determines how your organization divides its customers or cohort into smaller groups based on characteristics such as, age, income, personality traits or behavior. These segments can later be used to optimize products and advertising to different customers.

In Nimbus Stream, we define cost segmentation as:

The concept of dividing our costs into characteristics such as Services, Enviroment, Projects and other tags. These segments can later be used to identify statistical anomaly in costs!

This means that if:

All the EC2s (services) in the Development Environment (tag:Environment) for Project A (tag:Project) running as Public Load Balancers (tag: PublicLoadBalancer) cost you $1000 in January and $2000 in February, we can very clearly identify the root cause.

Statistical Anomaly

But, how do we know what deviation is too much? Also, isn't cloud cost expected to increase over time (increasing data costs etc..).

To address this, Nimbus Stream is built with these two concepts, i.e,

  • Linear Regression
  • 2 Standard Deviations (more than 95% certainty)

Linear Regression

Cloud costs generally increase over time. By performing a linear regression on the past consumption, we can very easily predict the next month's cloud bill.

This forms the basis of our predicted cloud spend.

Normal Distribution (2 Standard Deviations)

TCloud costs always have a certain level of variance from what is expected. By assuming that the variance of Cloudspend follow a normal distribution, any deviation more than 2 standard deviations away is abnormal with a 95% certainty.

Nimbus Stream will flag these abnormalities out so that you can investigate.


Figuring out the Outliers / Anomalies

To address what is out of the norm, we rely on 2 concepts.

Linear Regression

We do know that costs tend to trend up in the cloud. A lot of it is natural and expected. As our platforms mature, we tend to archive more data in terms of backup / archival, etc. This inevitably lead to an increase in costs. To take this into consideration, we leverage on the concept of Linear Regression.

Linear regression is commonly used to quantify the relationship between two or more variables

In this case, our 2 variables are:

  • Cloud Costs
  • Time

This will allow us to figure out the expected mean in the current month.

Standard Deviations

Having an expectation of the cloud spend is one thing. Knowing how much of a deviation is acceptable is another. For this, we rely on a mathematical concept called the Standard Deviation. For a normal distribution, it is generally understood that anything that falls 2 standard deviations out of the mean is considered a rare event.

Chances of costing exceeding 2 standard deviations in any normal distribution is less than 5%

As such, it definitely warrants an investigation!

Cloud Costs Complexity - Reserved Instances / Savings Plan

Lower Cloud Costs by committing Cloud Usage

Overtime, all companies will try to lower their Cloud Costs by committing usages to the CSP. The 2 common avenues are Reserved Instances and Savings Plan.

  • A Reserved Instance is a reservation of resources and capacity, for either one or three years
  • Customers simply commit to a consistent amount of usage (e.g. $10/hour) over 1 or 3 years, and in exchange they will receive a discount for that usage

This creates a new problem for people who are trying to analyse the Cloud Costs from their monthly bills. Yes, the cloud costs seems to be going down, but this is due to the commitments in Reserved Instances or Savings Plan?

Reserved Instances

This gets even more complicated as Reserved Instance can be set in a sharing mode. Once it is in the sharing mode, the reserved instance savings can be applied in any account, i.e.

The account that originally purchased the Reserved Instance receives the discount first. If the purchasing account doesn't have any instances that match the terms of the Reserved Instance, the discount for the Reserved Instance is assigned to any matching usage on another account in the organization.

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-ri-consolidated-billing/

True Costs - Utilization Analysis

If this is the case, then how can analyzing the costs alone detect anomalies since anomalies are introduced by such committed pricing mechanism?

Nimbus Stream solves this problem by analyzing utilization as well. Instead of just looking at the costs per service, we look into the quantity utilised for each service, i.e.

  • For ec2, we look at the hours used
  • For s3, we look at the storage consumed
  • ....

Our dimensions for analysis is not just limited to costs but also the quantity of cloud services utilized.

Nimbus Stream - Automated and Fast Detection of Cost Anomalies

Concept in Short

Cloud usage is on demand and scalable. With it, comes unpredicable cost.

Nimbus Stream gives you back the predictability by alerting you to abnomalies in your cloud cost at the point when it happens.

Nimbus Stream relies on:

Statistical Analysis of Cost Segmentation for detection of Abnormal Cloud Costs

Automated and Fast

Doing the above manually is painful, time-consuming and unsustainable even on a monthly basis. Nimbus Stream automates the above analysis to be done in high frequency (daily).

This allows very early detection of Abnormal Cloud Costs!