The Definitive Guide to Performance Testing in Production

Wednesday, March 29, 2023

Testing in Production, also known as TIP, is a shift-right testing methodology where new code, features, and releases are tested in the production environment. With today’s demanding digital applications and websites deployed over complex infrastructure, testing and validating production environments has become the best practice. TIP goes beyond traditional staging environment, which can never fully replicate the conditions that exist in production. This is even more important when it comes to Performance and Load Testing. This guide will take you through the various aspects and facets of Performance Testing in Production. 

Introduction to Testing in Production

Traditional performance testing is performed prior to moving to production and often even much earlier with the “Shift Left” mindset to be aligned with common DevOps procedures. However, recent developments are showing that “Shifting Right” and executing performance testing in production have other benefits that help elevate quality and customer satisfaction.

The main reason for this – it’s not easy to replicate production conditions in Dev-Test-Stage environments. So the best way to ensure that your production environment is up to the challenge is to test it in production! Also, the increasing velocity of releases due to DevOps and CI/CD practices mandates executing only periodic component based load testing in staging, while the thorough system wide performance testing complemented by long runs can only be executed in production.

In addition to these factors, the only way to verify readiness for an important event or an unexpected traffic surge is to thoroughly test it in production. This means that performance testing in production is more than just a recommendation today, where it’s common practice for websites and infrastructure to be scaled up and down multiple times, often on a daily basis.

With competition growing, organizations cannot afford to enable a bad user experience or poor code integrity, which can quickly escalate to negative brand performance, higher customer churn, and financial losses. It doesn’t matter if you are a small business or a big company – you need to make sure there are no outages, sluggish performance metrics, or malfunctions in production.

There are multiple scenarios where organizations are facing roadblocks with their infrastructure and website performance testing. Here are a few examples:

  • Integrating Third-Party Applications – More and more companies are embracing SaaS sales and marketing applications like analytics tools, chatbots, checkout functionality, and more to improve their bottom line. Unfortunately integrating these external code can often have a direct impact on performance, even if everything seems fine in staging.
  • Special Events – The holiday season is something every performance engineer dreads. This is because it is extremely difficult to simulate usage patterns, activity volumes, and network parameters in the staging environments. The same applies to pandemics, wars, or political instability that can have a sudden effect on user behavior.
  • Other Hidden Issues – Testing professionals and engineers in traditional setups are having a hard time locating performance issues in time, prior to move to production. These issues can involve load balancers, application code, web servers, and the database itself. More often than not, these procedures are concluded before the problems are detected due to time-to-market constraints and the need to increase release velocity.

Performance testing in production is extremely beneficial as it helps organizations analyze their products’ stability, scalability, and robustness under varying levels of traffic, server loads, and network bandwidth parameters. These are just a few of many real-life variables that simply can’t be fully duplicated in a staging environment. This is why shifting right is no longer an option but a necessity.

Performance testing in production with WebLoad

Learn More

There are many flavors of performance testing you should make sure to run also in production. Here are just a few:

  • Load Testing – This is arguably the most significant non-functional software testing you can possibly perform to check your infrastructure or website’s capability to perform under various user loads to uncover response times under different scenarios. This kind of testing helps detect performance bottlenecks before going live. As we’ll explain later in this article, load testing can be even more effective when you Shift Right to testing in production.
  • Capacity Testing – Some companies live and die by their performance SLAs, which is where capacity testing takes centerstage. This type of non-functional software testing allows you to make sure that the app and environment can smoothly handle the maximum number of users or transactions according to the pre-defined SLA  or performance requirements. This procedure is aimed at testing the maximum capacity of your system in terms of traffic, while still being able to deliver optimal user experience.
  • Soak Testing – This uncovers performance issues that surface only after testing over a long period of time. Quite often, the ‘standard’ load testing that can be performed in pre-production environments, can only run for limited times and therefore does not succeed in uncovering all performance problems. Because of that there are a lot of advantages in running Soak Testing in production, as there are no such time limitations.
  • Volume Testing – Not enough attention is being given to databases and data when it comes to performance testing. Data characteristics and volume in staging environments can change in production. Volume testing addresses this by testing a software application with a certain amount of data, ideally replicating the production environment. Hence, it can often be more effective to perform volume testing in production and gain actionable insights to improve quality.

Testing in production is one of the most important aspects of performance analysis and planning today, as we shall explain in detail in this guide.

Everything you need to know about non-functional testing

Read Now

RUM vs STM vs Performance Testing in Production

What is Real User Monitoring (RUM)?

Real User Monitoring (RUM) is a form of background monitoring that collects and analyzes real user interactions with a website or application.

RUM is described as a form of passive monitoring as it observes data in the background. After injecting a small JavaScript code into each page, developers can continuously monitor user data and make data-driven decisions about their product. RUM tools gather information in one location and build charts, bars, graphs and other visuals to create a simple overview of the performance.

It may help you to mitigate the bad consequences of rapid traffic spikes that can cause longer page loads. As a recent study shows, a vast majority of users abandon websites if they take longer than 3 seconds to load. Tech leads can also use RUM to evaluate the health of the application and detect potential issues that were not revealed during the pre-deployment testing phase.

How does RUM differ from Performance Testing in Production? 

The major disadvantage of RUM is that the issues are found only after they affect real users. With Performance Testing in Production and with STM, as you will see below, you can identify and mitigate issues before they affect real users and before they become full blown incidents. This is why RUM is often regarded as the last line of defence in today’s dynamic setups.

What is Synthetic Monitoring? 

Synthetic Monitoring (STM), or Proactive Monitoring, is a “Testing in Production” variant. It’s based on the simulation of real users and traffic, while running various user scenarios that help establish the health of the app or website. The traffic is not coming from your actual user visits, but from simulated ones. This helps highlight the system’s weaknesses and avoid potential issues.

In other words, STM lets you create simulated clients who use your application or website the way real customers would in various scenarios like peak times or slow times. You can run synthetic monitoring inside the firewall to assess the state of your machines or outside of the firewall to test availability and performance. It’s a very popular approach today due to its flexibility.

How does STM differ from Performance Testing in Production?

Unlike Performance and Load Testing in Production, STM focuses on functional testing and monitoring in production. In addition to that, it mostly simulates only one user at a time as with traditional functional testing, unlike Load Testing that focuses on simulating many simultaneous users. This limits its effectiveness as a performance testing methodology.

Summing it Up

RUM identifies issues that have already begun to affect real users. STM can detect issues much earlier and before they have become full blown incidents. That being said, STM is focused on functional testing and simulating one user at a time, which makes it far from ideal. It can’t replace the effectiveness and usefulness of Load and Performance Testing in production.

Canary Releases vs Blue Green vs Feature Flags

There are several strategies for testing in production to reduce the risk of deploying new features. Let’s review them and then examine if they can be used for conducting performance and load testing in production.

What are Canary Releases?

Before we delve into the technical and business value of canary releases, let’s establish where this name comes from. Canary birds were used in coal mines as a vestigial alert system to quickly inform miners about the dangerous levels of toxic gases. Before reaching the workers, the poisonous gases would first kill the birds, thus giving a warning to the miners to evacuate.

Much like the poor birds, the small group of users gets to use the new version, evaluate it, and “warn” the developers about all performance issues.

The canary release is a common testing in production approach while introducing new features. Instead of fully releasing it, you give access to a small group of users to test its feasibility. As opposed to a feature flag release that we will cover later, the canary release launches the entire updated application to a small subgroup, not just the feature like in traditional methodologies.

When is the best time to implement canary releases?

  • You are releasing changes to a multitude of services and need verification of those in a realistic environment.
  • New features pose a great operational risk that can be mitigated by allowing smaller traffic loads initially.
  • You are integrating a third-party service that cannot be pre-validated. Canary releases will help you measure the success of the integration.

Canary releases have been utilized by a number of big players like Netflix, Google, and Facebook due to the impressive results. Canary releases are great for incremental releases and testing. They instantly decrease the cost and severity of potential operational issues. Just like with DevOps, incremental testing also limits the potential negative impact on your customer base.

Although the tips above are only pointers, not dogmas, there are some certain cases where canary releases won’t be good for you.

For example, when assessing their applicability for running performance and load testing, this approach is clearly limited since it only tests the feature with a small subset of users. Testing in a real production environment includes testing load and capacity of the full scale of production and user load for better actionable insights and more accurate findings.

What is Blue-Green Deployment?

In a nutshell, Blue-Green Deployment is a deployment method that reduces deployment risk by running two almost identical production environments called Blue and Green.  The blue environment is where the live environment resides, where the old version is deployed, and the green one is the standby environment  where the new version is deployed. By simultaneously using these two production environments, you can easily switch from the old version of the application to the new one. This methodology has gained popularity due to it’s “binary characteristics” and clarity it creates.

As it is a side-by-side deployment, not a simultaneous one, only one environment can be active at one given time. Once you have run performance tests on the new version and made sure that it works smoothly, you can switch the router and direct the user traffic to the new environment. Rollbacks also become significantly easier when you have two solid versions in reach.

The three biggest advantages of this methodology include:

  • Flexible Deployment – Instead of planning far in advance and considering maintenance windows in the process, you can deploy at any time.
  • Simple Rollbacks – In case the version is not performing the way it should or users do not like it, you can quickly revert back to the original version.
  • Risk Mitigation – Since you have two environments, in case one data center goes offline, you can quickly switch to another environment.

Like with any methodology or philosophy, there are also negative factors you will need to consider before making the switch (pun intended) to Blue Green. The main downside is the elevated cost of building two environments, although cloud technology is helping with that. Additionally, it is worth mentioning that Blue Green deployment tends to deliver slow performance right after the switch.

Taking everything into account, blue-green deployment is a good application release model, also when considering executing load and performance testing in production. This is because the environment that is going to be live soon after successful tests have been conducted in the production environment. This is almost as good as testing in the actual production environment.

However, due to the small gaps between the green and blue environments, blue-green methodology cannot completely replace testing in (live) production.

What are Feature Flags?

Feature flags, also known as feature toggles or switches, are DevOps derivatives that enable devs to turn on or off a certain feature without deploying new code.

The main principle of feature flags is building conditional features that can be switched on and off and made available to a certain group of users on-demand. This way, your team can work on new features within the source code and enable them only when ready. When it comes to feature flags, the focus is on two main aspects: longevity and dynamism.

There are four types of features flags that can be applied to your code logic:

  • Release flags – These have a relatively short life and essentially serve as switches to enable or disable features.
  • Operational flags – Type of short-lived toggles that control the backend of the application, such as algorithm changes.
  • Experimental flags – These have a slightly longer lifecycle and are implemented for A B testing to gather user testing data.
  • Permission flags – These flags are used to control which users are allowed to access certain features, typical for sensitive environments.

The main use case scenarios for feature flags include:

  • To Enhance Release Management – For early releases such as canary releases, where access is given to a small group of users. 
  • To Test in Production Phase – Without the risk of poor release due to the rollback feature, you can test your applications even in production.
  • To Enable Continuous Development – By frequently building and deploying your product, you will have more data to test on-the-go.

How effective is it to leverage Feature flags for Performance Testing in Production? Well, because the main principle of feature flags is building conditional features that can be switched on and off and made available only to a small group of users for a short time prior to full rollout, it means by definition, that full blown load testing cannot feasibly be conducted (e.g. capacity testing, soak testing, etc.)

Canary Release vs Blue Green Deployment vs. Feature Flags

Canary releases are perfect for releasing one new feature to a small group of users to measure its success and impact on the app performance. Feature flags are an ideal solution when launching several features at the same time. On the other hand, Blue Green is a technique used to test your software in a near-production environment when you need a rapid rollback.

Conclusion: Good, But Not Enough for Performance Testing in Production

Canary releases, blue-green and toggling features allow you to roll out new features in a controlled way and reduce risk when introducing a new version into production. However, none of these methods can fully replace the benefits of actually executing Load and Performance Testing in Production. This is because each one has its inherited limitations and shortcomings.

  • Canary testing tests the new feature/s only with a small subset of users
  • Blue-Green is very expensive to implement, since you are creating two different environments. Also, the environment gaps can create issues
  • With Feature Flags you can run tests on only a small subset of users and for a limited time. Because of that, running full blown performance and load tests as capacity or soak testing is not really a viable option.

Yes, all of the three aforementioned methodologies are applicable for certain use cases and show great results when implemented correctly and wisely, but the fact is that none can fully replicate the conditions that exist in production. The conclusion is that none of the above can fully replace the execution of performance and load testing in the production environment.

Chaos Engineering and Performance Testing in Production 

What is Chaos Engineering? 

Chaos Engineering, a subset of testing in production, is an approach to test your system’s capability to withstand unexpected or unstable conditions in production.

This testing essentially involves the careful introduction of “organized chaos” into the systems to gauge their response and behavior changes. By doing so, organizations can predict and prepare for downtimes, outages, and other performance issues.

The four main principles of chaos engineering are:

  • Define the steady state of your system as a measure of normalcy.
  • Use a control and experimental group and build a hypothesis.
  • Include real-life scenarios that server crashes or malfunction.
  • Check the difference between the control and experimental groups to prove or disprove your hypothesis.

You can and should include scenarios from historical outages as well as create disruptive real-world events that might actually occur. After the experiment, your system should be able to return to its steady state within the predefined tolerance range. If it fails to do so, it is a red flag that should be investigated and fixed as soon as possible to avoid future performance bottlenecks.

Top three benefits of implementing chaos engineering in production systems:

  1. The first benefit refers to the business itself. By reducing the possibility of having lengthy outages, your company saves a lot of money. Additionally, you can quickly scale up or down without disrupting any services.
  2. The second benefit revolves around the software development team. Data generated from chaos engineering helps devs get valuable insights into the system’s dependencies to create a more resilient application.
  3. The third benefit is all about the customers. Fewer disruptions and outages ensure better availability and durability of the system which enhances user experience and eventually brand loyalty.

Performance and Load Testing as part of Chaos Engineering

Among the various options for systematically injecting harm into the production environment is leveraging load testing to simulate how the system reacts to unplanned and unexpected spikes and surges in traffic. This is the classic example with Performance and Load Testing being used in production, possibly the most effective way to get things done today.

The Three Phases of Testing in Production

Testing in production basically consists of three phases:

  1. The Deployment Phase – Here, load tests, integration tests, and shadowing are typically executed as per pre-defined protocols.
  2. The Release Phase – This is the stage where developers apply feature flagging, canary releases, and exception tracking.
  3. The Post-Release Phase – Here, chaos engineering, A/B tests, real user monitoring and other testing methods are usually implemented.

Since the last phase is the only one that truly has the conditions that exist in the live production environment, it is also the most important phase to implement Performance and Load Testing in Production.

Minimizing the Risks of Performance Testing in Production

As mentioned earlier, testing in production is extremely beneficial, but also comes with a lot of risk when not done properly with the right safeguards.

Before diving into the specifics of minimizing risks, the key message is that Timing is Everything. You need to plan when you want to do the performance testing in production. Planning to run a load test in production soon? Schedule it for times when real user activity is low (after midnight). Just make sure that it doesn’t clash with maintenance jobs (restarts, indexing).

  • Don’t Skip Testing in Staging – Don’t skip pre-production testing if you have decided to test in production. Although it’s being done in quite a few companies, testing in production does not mean deploying untested code. Your team SHOULD test the system during the staging phase and try to catch as many bugs as possible. However, as explained above, you can’t possibly fully replicate all production and real-life conditions in your staging environment.
  • Rollback Capabilities – Rollbacks are arguably the most effective tools when it comes to mitigating production data risks. When something breaks in production, you can revert to the last functional version with just a few clicks. This backup plan is becoming a mandatory requirement for organizations looking to test performance in production.
Why are rollbacks so important? 

Having a rollback strategy removes a significant part of the pressure from development teams as they know that there is a Plan B in case the deployment is not ideal from the performance standpoint. In case you see a bug or unusual behavior that you cannot identify, you can hit the rollback button and deal with it without completely disrupting the system or experiencing downtimes.


  • Compliance – Data leaks are becoming more and more common today, with the GDPR, HIPAA, and CCPA getting stricter with massive monetary and business implications. You need to make sure that you are not deleting or misplacing Personally Identifiable Information (PII) or Personal Health Information (PHI) while getting rid of your test data.
  • Data Masking (Obfuscation) – Security and compliance go hand in hand. For starters, limit access to sensitive data, also known as “least privilege access”. You also need to make sure you are preventing the unauthorised and unwanted use of sensitive data. It’s also recommended to automate data-access history creation for when the need arises.
  • Safe Chaos Engineering – Chaos engineering is great for testing in production, but you need to do it correctly to avoid issues. You need to make sure that you are monitoring everything closely, while also implementing a good OS or commercial reporting solution to keep all involved stakeholders on the same page for quick disaster recovery.
  • Work in a Cross-Functional Team – The most common risk that companies fear when going into testing in production is security. You can minimize the risks by merging your QA team with the operations team. When they work together and are aware of the testing steps, the risk is manageable.
  • Monitor continuously –  Another important thing you’ll need to do is monitor, monitor, and monitor. None of the aforementioned steps will be of much use if you are not feeling the pulse at all times to react quickly to issues as they arise. Besides performance metrics, you should also track infrastructure info (server utilization, network usage) and other application data (memory utilization, system information).

Shift Right Load Testing for Optimal User Experience

Testing in Production is all about shifting right to run performance tests in the production environment.

When done correctly and systematically, your DevOps teams will appreciate the fast detection of performance issues and other real-time information like usage pattern changes. Load testing is already crucial to analyze performance and for identifying issues and resolving bottlenecks, but doing so in production is a real boon. It can help elevate performance and business metrics and overall make your system more robust and less sensitive to turbulence or unexpected conditions, leading to a better user experience.

It must be noted that selecting the right tool for your load testing in production is also equally important to get best testing in production results. Due to the inherited risks involved with this methodology, you will need to make sure that you can schedule your tests, automate testing processed in the staging environment, get information from the APM tools through out-of-the-box integration capabilities, and enjoy user-friendly real time reporting to keep all involved stakeholders in the loop.

This post was originally published in

For more information, contact Lexington Soft or request a free trial.