451 Research - Cloud Must Sprawl; It's Written in the Stars

The universe and cloud alike have a natural propensity to become disordered, and such disorder leads to fragility. Energy, in the form of optimization and resource management, must be put back into the system to resolve the chaos. We identify three laws that dictate and define how cloud scalability impacts enterprise deployments.

The 451 Take

Picture a field of Japanese Knotweed that represents the sprawl of an application. Even if you spray weed killer on every part of the infected field, it only takes one tiny root to bring it all back in a few years. No matter how well you think you've eradicated it, it is bound to come back. The only way to fight it is to keep track of it, and regularly cut it back and attack it – it's not a one-off task. If you leave it, the fightback will be harder than ever as those roots wrap and interact around each other. And if you don't keep track of its spread, then all our land is under threat and fragile to the elements. Cloud applications are the same: Accept that sprawl will happen, but address it regularly, monitor its progress and don't let it get out of hand.

The second law of thermodynamics essentially says that a system generally flows from a state of order into a state of disorder. Entropy is a property that describes this disorder. At the microscopic level, this occurs because energy flows, which creates waste, which creates disorder. To resolve this disorder, energy must be spent recreating the order. This also occurs at a macroscopic level – to arrange a house of cards takes time and effort, but to knock it into a disordered mess takes a slight gust of wind. The probability that the house of cards will eventually fall down is certain, over a long enough time frame; the probability that a house of cards will spontaneously organize itself into a structure is near zero, over that same time frame. This law gives reassurance to those parents that reach the end of the day and wonder how their once pristine home became a disaster zone of Lego bricks and semi-digested bananas.

In IT, this entropy has existed for decades. A newly provisioned computer becomes slower over time as new applications are provisioned, orphaned processes run in the background and disk drives become fragmented. The solution is to clean up the disorder, which requires effort – sometimes this effort is as simple as rebuilding a computer from scratch or defragmenting the hard drive. But due to its on-demand nature, cloud creates new challenges, which we have summed up in three rules that we expect will fundamentally impact cloud adoption and growth:

Law of Cloud Scalability: A cloud-native application left unmanaged will tend toward greater resource consumption over time.
Law of Cloud Entropy: Increasing scale left unmanaged tends toward increasing disorder, complexity and fragility.
Law of Cloud Complexity: The longer the period between resolution of disorder, the more effort required to resolve that disorder.

Law of Cloud Scalability: A Cloud-Native Application Left Unmanaged Will Tend Toward Greater Resource Consumption Over Time

Picture a cloud application that can scale up and down. In this initial case, scaling is performed manually by an administrator. In a period of demand, the administrator scales up the application, which takes a small amount of time and effort. The administrator tracks the demand, and scales back the application when that demand is no longer present, which (again) takes a small amount of time.

In the above scenario, the administrator must expend twice as much time to keep control of the application's resources compared with leaving the application to grow. Let's say this occurs time and time again, hundreds or thousands of times. For the application to remain at its original scale, the administrator must remember to scale back every single time. But for the application to have net growth, the administrator only needs to forget once.

The effort needed to scale up and scale back is greater than the effort needed to scale up. Thus, the probability of scaling back every single time is less than the probability of not scaling back every time. Therefore, the cloud-native application experiences a net growth over time.

This problem is even more pronounced with multiple administrators, each having their own agenda and personality. And with multiple clouds, the issue is far more pronounced. The problem also exists with auto-scaling – an automated script can scale up and down. But, like the human administrator, it only needs to fail at scaling back once for there to be a net gain in resources. Even a perfect script or tool is going to fail now and then.

The same applies to object storage, databases and the like. These platforms will naturally attract more data, and it is easier to add data and leave it than to add it and also remove it or optimize it. Thus, storage platforms also tend toward disorder. Energy must be expended to keep it ordered and remove the orphaned or disordered resources.

The problem is caused by cloud's greatest asset – the ability to scale. In a fixed-capacity system, this infrastructure remains broadly the same. But in cloud, the ability to scale means it will scale, if it can.

What does this mean in practical terms? On-demand cloud resources have a natural propensity to grow in volume. Even if you tightly control who can spin up resources, how much budget they have and how long those workloads live for, over a long enough timeframe there will still be some sprawl. These measures just slow down sprawl. To prevent sprawl from the outset, processes must be perfect and never failing, since just a single failure to scale back results in net growth. Since nothing is perfect and errors always occur, sprawl must happen. The only way to prevent it is to regularly clean up the sprawl. More on this in a moment.

Law of Cloud Entropy: Increasing Scale Left Unmanaged Tends Toward Increasing Disorder, Complexity and Fragility

With each new resource, the interactions between those resources scale exponentially, and the sprawl increases complexity. Let's say three resources communicate between each other, so that there are three interactions.

With each additional resource, the interactions between those resources vastly increase (in the figure above, the number of interactions grow faster than number of resources). For each resource we add, interactions grow larger again. So even a small amount of waste drives a huge increase in interactions. These interactions make the application more complex, since effort must be expended to understand these interactions, and it becomes more difficult to resolve issues. The more resources and interactions to keep track of, the more disordered things become, unless energy is expended in tracking and resolving this complexity. This can increase fragility.

'But the resources are resilient and duplicated, so it's not fragile,' I hear you cry. However, if the application is a disordered mess, how can the administrator be sure the application remains resilient – how can the admin be sure that failures haven't resulted in a single point of failure that is about to break? If there is a one in 10 chance of a resource failing, the greater the number of resources, the more failures there will be. As complexity increases, so does the risk that a problem has gone unnoticed. Effort must be made to track this complexity. The application must be built to be anti-fragile. It's okay to say 'We started with two load balancers, so we know we're resilient,' but will that still be true months after deployment? If not, who will know? With a complex application of hundreds of thousands, how can anyone ascertain the effect of one of those interactions becoming bottlenecked? Will it go into failover and be irrelevant, or will it bring down the whole application? The analysis of this 'emergent' and unpredictable theory is called Complexity Theory. As an analogy, what is more fragile: a house of cards built of three cards, or a house of cards built of two layers of one hundred cards? Which is easier to track and rebuild?

Law of Cloud Complexity: The Longer the Period between Resolution of Disorder, the More Effort Required to Resolve that Disorder

Let's say we clean up a resource every time we notice that it is orphaned. We take one minute to identify that it is orphaned and remove it, which includes checking that its interactions aren't valid (accidentally removing a resource could bring down the application). If this occurs 100 times, and we clean it up after each time it happens, the effort expended is 100 minutes. We must spend time checking the resources and interactions to ensure that they do not represent a fragile single point of failure.

If we wait 100 turns before cleaning up, it will take nearly 500 minutes as a result of an increasing amount of interactions. Essentially, things become so tangled that it takes longer to untangle. As anyone who has brushed their daughter's hair after four days of rough and tumble can tell you – we wish we had made an effort to brush it a bit every day.

Practical Advice

Policy-based control of who can spin up resources can limit the scalability-complexity-fragility combo. But the fact of the matter is that, even with such controls, there is a natural tendency to grow. As a result, optimization must take place on an ongoing basis, and the more regularly the better. Optimization will limit costs by preventing sprawl, but can also reduce fragility by making it easier to identify single points of failure.

Use role-based access to slow down sprawl. AWS, Google, Azure, IBM, Oracle, Alibaba and nearly all cloud providers allow provisioning of resources to be limited to certain users only using role-based access.

Use third-party tools to further restrict provisioning based on policy or to define blueprints, again with a view to slowing down sprawl. Examples include Terraform, AWS CloudFormation, SaltStack and Ansible Tower. As we've shown, this will reduce sprawl, but won't stop it altogether.

When migrating a legacy application, ensure that you understand interactions between resources and clean them up. Use tools such as those provided by Cloudamize, Movere, Turbonomic, PlateSpin, Bitnami or RISC Networks. This can reduce complexity from the outset.

Use proactive tools to monitor, alert and clean up waste. AWS has its Trusted Advisor, and Microsoft recently acquired Cloudyn, but there are several third-party tools, including OpsRamp, VMware's vRealize and CloudHealth, CloudCheckr, Cloudability, ParkMyCloud, DivvyCloud, GorillaStack, Skeddly, CloudMGR, Yotascale, Spotinst, and FittedCloud.

Where complexity must occur due to the nature of applications, use appropriate tools to orchestrate, monitor, and resolve resources and their interactions. There are many open source projects expressly designed to manage cloud-scalable projects that evolve into being complex applications, including Kubernetes, Prometheus, OpenTracing, Fluentd, gRPC, containerd, rkt, CNI, Envoy, Jaeger, Notary, TUF, Vitess, CoreDNS, NATS and Linkerd.

Many of the aforementioned tools and vendors cross many of these areas. Some cloud management platforms can resolve orchestration across many of these areas. Such platforms include HPE's OneSphere, RedHat's OpenShift, VMware's vRealize, Micro Focus's Hybrid Cloud Management, IBM MultiCloud Manager and Google Cloud Services Platform, among others.

Owen Rogers

Research Director, Digital Economics Unit

As Research Director, Owen Rogers leads the firm's Digital Economics Unit, which serves to help customers understand the economics behind digital and cloud technologies so they can make informed choices when costing and pricing their own products and services, as well as those from their vendors, suppliers and competitors, and architected the Cloud Price Index.

Al Sadowski

Research Vice President - Voice of the Service Providers

Al is responsible for 451 Research’s Voice of the Service Provider offering. He focuses on tracking and analyzing service provider adoption of emerging infrastructure, spanning compute, storage, networking and software-defined infrastructure.

Jean Atelsek

Analyst, Cloud Price Index

Jean Atelsek is an analyst for 451 Research’s Digital Economics Unit, focusing on cloud pricing in the US and Europe. Prior to joining 451 Research, she was an editor at Ovum, spiffing up reports, forecasts and data tools covering telecoms and service providers, fixed and wireless networks, and consumer technology among other topics.

Want to read more? Request a trial now.

First Name:

Last Name:

Email:

Business Phone:

Company Name:

Company Type:

Job Function:

Country:

State / Province:

utm_campaign_451:

utm_campaign_lt_451:

utm_source_451:

utm_source_lt_451:

utm_medium_451:

utm_medium_lt_451:

utm_term_451:

utm_content_451:

utm_content_lt_451:

451 Research Lead:

I consent to be contacted by 451 Research. Note: You will be contacted with instructions on how to proceed with your trial regardless of consent.