This ETL works from the raw AWS Cost & Usage Report (CUR). It is implemented as a series of SQL statements with no out-of-band processing. See aws/dag-full-load.txt for the order of queries.
See also this very long article about cloud cost control in general and how to use Cloudstats on AWS in particular.
The hope is that Cloudstats gives you a clean basis for your capacity planning and efficiency work. This work is based on experience with large AWS accounts over many years. Its accuracy is not guaranteed, but there are built-in ways to check its completeness and consistency.
There are two main output tables:
-
cloudstatsis a per-day aggregation and cleanup of only the AWS usage lineitems, with fully-amortized accrual-basis cost measurements. -
cloudstats_billis a per-month cash-basis aggregation of all lineitems (including taxes, credits, etc etc) reverse-engineered from the observed behavior of official AWS invoices. (See the code for more details.) It is meant to predict the official invoice to within 1%. It also provides much more detail, eg explanations of individual Taxes and Credits, and is updated daily.
There are several views laid on top of these main tables for specialized analyses:
-
cloudstats_bill_icebergdoes further attribution of discounts and credits. For example, backing out the separate effects of EDP, private rates, and reserved instance usage. -
cloudstats_cpu_ratecardrenders the historical per-CPU-hour rates you paid, aggregated on instance type andpricing_regime. This includes percentile bands for real-world Spot prices, amortized RIs and Savings Plans, and so on. -
cloudstats_buying_efficiencyattempts to show new opportunites for using RI, SP, and Spot based on historical OnDemand usage. -
cloudstats_wow_cost_moversdoes week-over-week comparisons of spend broken out by a few AWS and client-specific dimensions, to find interesting gainers and losers.
There are several dimension tables for client-specific needs and to fix up hilarious data bugs in the raw table:
-
cloudstats_dim_account_names.sqlis a list of your internal account aliases. -
cloudstats_dim_pricing_buckets.sqlmaps the cleaned-uppricing_unitfield into a handful of buckets like 'Compute', 'Storage', and so on. -
cloudstats_dim_instance_specs.sqlprovides machine specs based on an artificial field calledinstance_spec. The CUR's originalinstance_typefield has lots of junk in it, and many records in the CUR have missing values for important fields like physical processor. For example, aninstance_typeof "ml.c5.xlarge-Training" will have aninstance_specof "c5.xlarge". This table is compiled from public sources and should be updated from time to time. -
cloudstats_dim_aws_products.sqlmapsproduct_codeto the official product name as it appears in the AWS bill. Not all of the CUR's lineitems have this information (!). This is also hand-compiled from public sources and may be incomplete.
-
The incremental load re-writes the previous 65 days of data, each day. This is because the CUR is a running account, not an immutable log. The source records in your raw CUR table may be updated or added weeks after the fact. For example, the
invoice_idfield is null until after the month's accounting close, usually 3 days after the calendar month. Once the invoice is generated by AWS, all of those records are updated with the invoice ID. Credits and monthly charges are often added post-hoc. -
The first two stages of the dag,
cloudstats_00andcloudstats_01are implemented as SQL views. There are no staging tables. This was for speed of development. This ETL was originally written for columnstore databases with partitioning & predicate pushdown (eg, Trino or Athena) so was not much of a performance problem. In OLTP Redshift this pattern is arguably a bad idea. See thewhereclause at the bottom ofcloudstats_00for more notes.
My general opinion is that you can't (or shouldn't) do automated anomaly detection until you have a very good idea about what is normal. A simple WoW comparator will very likely trigger tons of false positives. Simplistic threshholds will also generate alarm fatigue. Who cares if the foobar service increased by 212% (ZOMG!) when its cost basis is five bucks?
My favorite example is storage cost. You will find exhaustive complaints about AWS's 31-day pricing logic in the comments. The main cloudstats table works hard to smooth it out.
But storage also comes in tiers, which are charged in descending order during the month. If you have a PB of data in S3, the first day of the month will show a little spike in total cost and effective rate paid. The following few days will show a lower rate, then the rest of the month will plateau at the lowest rate. Next month the cycle starts again. Even a z-score can be fooled by that.
Even worse, the automatic processes that move data to and from cheaper tiers can happen at any time. Storage anomaly detection should be run against usage & storage tier (eg storage_total_tb and storage_class) and not cost. And even then it needs to be smart enough to not pull the fire alarm for "normal" tier shifts.
Another pitfall is changing tags. Engineers often rename their clusters & services on their own schedule and for their own reasons. A simplistic detector grouped by service name will panic twice: when the foobar service suddenly drops to zero and when the new foobar_v2 service pops out of nowhere with significant cost.
There is no general solution to anomaly detection over changing infrastructure. All I can recommend is to put a human in the loop before you email the world.