The rapid, global shift to remote work, along with surges in online learning, gaming, and video streaming, is generating record-level internet traffic and congestion. Organizations must deliver consistent connectivity and performance to ensure systems and applications remain functional, and business moves forward, during this challenging time. System resilience has never been more critical to success, and many organizations are taking a closer look at their approach for this and future crises that may arise.

While business continuity considerations are not new, technology has evolved from even a few years ago. Enterprise architecture has become increasingly complex and distributed. Where IT teams once primarily provisioned backup data centers for failover and recovery, there are now many layers and points of leverage to consider to manage dynamic and distributed infrastructure footprints and access patterns. When approached strategically, each layer offers powerful opportunities to build in resilience.

Diversify cloud providers

Elastic cloud resources empower organizations to quickly spin up new services and capacity to support surges in users and application traffic—such as intermittent spikes from specific events or sustained heavy workloads created by a suddenly remote, highly distributed user base. While some may be tempted to go “all in” with a single cloud provider, this approach can result in costly downtime if the provider goes offline or experiences other performance issues. This is especially true in times of crisis. Companies that diversify cloud infrastructure by using two or more providers with distributed footprints can also significantly reduce latency by bringing content and processing closer to users. And if one provider experiences problems automated failover systems can ensure minimal impact to users.

Build in resiliency at the DNS layer

As the first stop for all application and internet traffic, building resiliency into the domain name system (DNS) layer is important. Similar to the cloud strategy, companies should implement redundancy with an always-on, secondary DNS that does not share the same infrastructure. That way, if the primary DNS fails under duress, the redundant DNS picks up the load so queries do not go unanswered. Using an anycast routing network will also ensure that DNS requests are dynamically diverted to an available server when there are global connectivity issues. Companies with modern computing environments should also employ DNS with the speed and flexibility to scale with infrastructure in response to demand, and automate DNS management to reduce manual errors and improve resiliency under rapidly evolving conditions.

Build flexible, scalable applications with microservices and containers

The emergence of microservices and containers ensures resiliency is front and center for application developers because they must determine early on how systems interact with each other. The componentized nature makes applications more resilient. Outages tend to affect individual services versus an entire application, and since these containers and services can be programmatically replicated or decommissioned within minutes, problems can be quickly remediated. Given that deployment is programmable and quick, it is easy to spin up or deactivate in response to demand and, as a result, rapid auto-scaling capabilities become an intrinsic part of business applications.

Additional best practices

In addition to the strategies above, here are a few additional techniques that enterprises can use to proactively improve resilience in distributed systems.

Start with new technology

Enterprises should introduce resilience in new applications or services first and use a progressive approach to test functionality. Assessing new resiliency measures on a non-business-critical application and service is less risky and allows for some hiccups without impacting users. Once proven, IT teams can apply their learnings to other, more critical systems and services.

Use traffic steering to dynamically route around problems

Internet infrastructure can be unpredictable, especially when world events are driving unprecedented traffic and network congestion. Companies can minimize risk of downtime and latency by implementing traffic management strategies that incorporate real-time data about network conditions and resource availability with real user measurement data. This enables IT teams to deploy new infrastructure and manage the use of resources to route around problems or accommodate unexpected traffic spikes. For example, enterprises can tie traffic steering capabilities to VPN access to ensure users are always directed to a nearby VPN node with sufficient capacity. As a result, users are shielded from outages and localized network events that would otherwise interrupt business operations. Traffic steering can also be used to rapidly spin up new cloud instances to increase capacity in strategic geographic locations where internet conditions are chronically slow or unpredictable. As a bonus, teams can set up controls to steer traffic to low-cost resources during a traffic spike or cost-effectively balance workloads between resources during periods of sustained heavy usage.

Monitor system performance continuously

Tracking the health and response times of every part of an application is an essential aspect of system resilience. Measuring how long an application’s API call takes or the response time of a core database, for example, can provide early indications of what’s to come and allow IT teams to get in front of these obstacles. Companies should define metrics for system uptime and performance, and then continuously measure against these to ensure system resilience.

Stress test systems with chaos engineering

Chaos engineering, the practice of intentionally introducing problems to identify points of failure in systems, has become an important component in delivering high-performing, resilient enterprise applications. Intentionally injecting “chaos” into controlled production environments can reveal system weaknesses and enable engineering teams to better predict and proactively mitigate problems before they present a significant business impact. Conducting planned chaos engineering experiments can provide the intelligence enterprises need to make strategic investments in system resiliency.

Network impact from the current pandemic highlights the continued need for investment in resilience. Because this crisis may have a lasting impact on the way businesses operate, forward-looking organizations should take this opportunity to evaluate how they are building best practices for resilience into each layer of infrastructure. By acting now, they will ensure continuity throughout this unprecedented event, and ensure they are prepared to endure future events with no impact to the business.

Kris Beevers is CEO and co-founder of NS1