Businesses around the world have been rapidly adopting cloud services, making it clear that cloud services are becoming a strategic necessity. In a recent report by McKinsey, it is projected that cloud computing could generate up to USD 3 trillion in global value by 2030. However, this growth brings a paradox. While cloud services that aim at streamlining operations, they often lead to chaotic and unpredictable spending patterns which manifest as spending anomalies. Studies show that as much as 32% of total cloud expenditures are wasted due to inefficiencies and spending anomalies.
Cloud costs vary based on usage, subscriptions, location, reservations, and other factors. Additionally, resource-specific attributes such as instance family, platform, and tenancy play a pivotal role in defining cloud costs. This brings too many variables into play. Consequently, many spending anomalies often go undetected and business teams lack control and foresight to manage and optimize their cloud spend.
Various tools are offered to analyze cloud spend but most of these tools at best offer dashboards and statistical tools to view spend distributions. They lack sophisticated solutions to detect spending anomalies, assess their impact, and recommend alternate cost-effective solutions.
Spend leakage in cloud resources can manifest in many ways. Hence, the detection of spending anomalies also requires a multi-pronged approach. We propose to analyze cloud spending from three different perspectives: (1) analysis of resource spend, (2) analysis of resource utilization, and (3) analysis of resource pricing models.
Optimization of cloud budget opportunities by analyzing resource spend
Cloud environments often exhibit situations where similar resources incur different costs, leading to cost inefficiencies. These anomalies can stem from various factors, such as cloud service provider, location, subscription plan, and so on. For example, in a multi-cloud setup, a business may unknowingly pay significantly more for virtual machines with similar specifications in one location compared to another.
We address such scenarios by first grouping homogeneous resources, which are the resources that share similar attributes, such as resource type, specifications, environment, and other characteristics. This allows us to identify clusters of resources that should exhibit comparable costs. We consider both topological constraints as well as user constraints and preferences while creating this homogeneous set to ensure that the comparison set is relevant. These sets are defined in such a way that the resources within the same set are expected to observe similar effective prices. Thus, any significant deviation in the effective price can be easily identified as a spending anomaly . We apply multivariate clustering algorithms and anomaly detection algorithms to create these homogeneous sets and detect anomalies.
Once the anomalies are identified, then the next step is to generate recommendations. The basic idea is to compare and contrast the attributes of resources within the same homogeneous set but observe different effective prices. To generate these recommendations, they adopt a graph-theoretic approach and use graph mining algorithms to pinpoint root causes of cost disparities and to provide actionable recommendations for cost reduction. Often, the recommendations involve adjusting configurations, selecting a different provider, or modifying resource attributes to align with the lower-cost group.
Optimization opportunities by analyzing cloud resource utilization
Analyzing resource utilization helps businesses address inefficiencies caused by mismanagement of cloud resources. Various reasons lead to overprovisioning of resources. Some resources are over-provisioned in the first place, or utilization of some resources diminishes over time due to application upgrades or changes in business logic. Detecting these inefficiencies can provide opportunities for cost optimization.
Various approaches can be applied to detect under-utilized resources. Simplistic approaches to available computing resource headroom based on simple statistical functions are often misleading. A better understanding of under-utilized resources can be obtained by capturing behavioral changes, understanding steady states, mining temporal patterns, and projecting trends.
Once the underutilized resources are identified, different types of recommendations can be generated.
- Resources showing consistently low headroom can be recommended for a scale-down.
- Resources not being used at recurring intervals can be recommended for auto-shut-down.
- Resources demonstrating recurring intervals of high and low utilization can be recommended for auto-scaling.
Additionally, resource consolidation is another key strategy that helps merge under-utilized resources. Instead of allowing multiple low-utilization resources to function independently, we propose consolidating them into fewer, optimally utilized resources. We have developed a novel multi-dimensional bin-packing algorithm that generates resource consolidation recommendations by considering resource utilizations, their temporal patterns, preferences about location, vendor, subscription, and so on.
Optimization opportunities by analyzing cloud resource pricing models
Many businesses rely heavily on on-demand pricing, which offers flexibility but comes at a higher cost. For businesses with consistent and predictable workloads, analyzing and selecting the right pricing model, such as Reserved Instances (RIs) or savings plans, can lead to significant savings. For instance, a business running multiple VMs on an on-demand basis for sustained workloads may unknowingly pay a premium, despite having predictable usage that could benefit from reserved pricing.
Our approach focuses on identifying resources with consistent utilization patterns. Traditional statistical approaches to analyzing interquartile range do not work on noisy real-world data with skewed distributions. We propose a novel algorithm for analyzing historical behavior and identifying resources with consistent and predictable utilization, making them suitable candidates for Reserved Instances (RIs). As a next step, we apply constraints such as tenancy, platform, instance type, location, and instance size to match these resources with the best possible RI options.
Addressing real-world constraints for cloud spending
When recommending cloud cost optimization strategies, various constraints such as performance, security, and business continuity must be precisely considered to ensure that cost saving does not compromise service quality or regulatory compliance. For instance, the spend on a resource may be reduced by changing its location but it could increase latency or violate regional data compliance laws. Similarly, consolidating resources across different regions could affect compliance with local data protection regulations, and auto-scaling during peak periods could disrupt critical services. Furthermore, when committing to Reserved Instances (RIs), it is important to account for the commitment period, as misjudging workload predictability can result in over-committing to unnecessary resources, leading to potential losses.
Conclusion
Spend anomalies in cloud operations manifest in different ways. These anomalies often go unnoticed with simple handles of data analysis and require comprehensive solutions. Sophisticated analytical solutions do not just optimize the current cloud spend, but also offer better foresight and control of cloud costs.