Boris Rozenfeld
- Jul 13, 2023
- 11 min read

Best WAF solutions in 2023 - real-world comparison

Introduction

This article describes how we tested the efficacy of several leading WAF solutions in real-world conditions.

As many WAF solutions in the market are ModSecurity-based engines using OWASP Core Rule Set signatures, we assumed that the results may be similar, but to our surprise, there is a significant difference between solutions, and we are glad to share these results with the community.

The two most important parameters when selecting a Web Application Firewall are:

Security Quality (True Positive Rate) - the WAF's ability to correctly identify and block malicious requests is crucial in today's threat landscape. It must preemptively block zero-day attacks as well as effectively tackle known attack techniques utilized by hackers.
Detection Quality (False Positive Rate) – the WAF's ability to correctly allow legitimate requests is also critical because any interference with these valid requests could lead to significant business disruption and an increased workload for administrators.

In the test, we decided to conduct an in-depth, yet straightforward test of triggering both malicious and legitimate web requests at different WAFs and measuring the results.

A very comprehensive data set was used to test the products:

973,964 legitimate HTTP requests from 185 real websites in 12 categories
73,924 malicious payloads from a broad spectrum of commonly experienced attack vectors

Loyal to the spirit of open-source, we provide in this GitHub repository all the details of the testing methodology, testing data sets, and open-source tools that are required to validate and reproduce this test and welcome the community feedback.

The test was conducted in July 2023 and compared the following popular WAF solutions:

Microsoft Azure WAFv2 – OWASP CRS 3.2 ruleset
AWS WAF – AWS managed ruleset
AWS WAF – AWS managed ruleset and F5 Ruleset
CloudFlare WAF – Managed and OWASP Core Rulesets
F5 NGINX App Protect WAF – Default profile
F5 NGINX App Protect WAF – Strict profile
NGINX ModSecurity – OWASP CRS 3.3.4
open-appsec / CloudGuard AppSec – Default configuration (High Confidence)
open-appsec / CloudGuard AppSec – Critical Confidence configuration

The two charts below summarize the main findings.

Security Quality and Detection Quality are often a tradeoff with security products. The first chart shows visually how different products perform in each category.

The test reveals a significant difference between product performance. For example, CloudFlare WAF provides near-perfect Detection Quality (99.945%) but the lowest Security Quality (67.297%) of all products tested. Azure WAF provides very high Security Quality (98.547%), but also an extremely high false positive rate (38.346%). These kinds of results points to either a significant security risk or a need for very heavy tuning initially and ongoing, for the product to be used in real-world environments.

To provide security and allow minimal administration overhead, the optimal WAF solution should strike a balance, exhibiting high performance on both Security Quality and Detection Quality. This is aptly represented by a measurement called Balanced Accuracy - an arithmetic mean of the True Positives and True Negatives rates.

The Balanced Accuracy results of all tested products can be seen in the following chart:

open-appsec / CloudGuard AppSec using Default Configuration provides the best-Balanced Accuracy (97.28%) followed by the same product with Critical Profile configuration (BA of 96.8%) and then NGINX AppProtect when using its Strict Profile (BA of 92.52%).

Methodology

Datasets

Each WAF solution was tested against two large data sets: Legitimate and Malicious.

Legitimate Requests Dataset

The Legitimate Requests Dataset is carefully designed to test WAF behaviors in real-world scenarios. To attain this, it includes 973,964 different HTTP requests from 185 real-web sites in 12 categories. Each dataset was recorded by browsing real-world websites and conducting various operations on the site (for example, sign-up, selecting products and placing in a cart, searching, file uploads, etc.) ensuring the presence of 100% legitimate requests.

The selection of real-world websites of different types is essential because it is important for WAFs to examine all components of an HTTP request, including headers, URL, and Body as well as complex request structures like large JSON or other complex body types. This allows for accurate testing, as these elements can sometimes be the source of False Positives in real-world applications. Often synthetic datasets will overlook some of these.

The dataset in this test allows us to challenge the WAF systems by examining their responses to a range of website functionalities. For example, a significant number of HTTP requests are traffic to e-commerce websites. These websites often employ more intricate logic, making them an ideal ground for rigorous testing. Features such as user login processes, complex inventory systems equipped with search and filter functionalities, dynamic cart management systems, and comprehensive checkout processes are common in e-commerce sites. The dataset includes also file uploads and many other types of web operations. Incorporating these features allows us to simulate a wide range of scenarios, enabling an exhaustive evaluation of the efficiency and reliability of WAF systems under diverse conditions.

The distribution of site categories in the dataset is as follows:

Category	Websites	Examples
E-Commerce	109	eBay, Ikea
Information	28	Wikipedia, Daily Mail
Travel	14	Booking, Airbnb
Search Engines	7	Duckduckgo, Bing
Food	6	Wolt, Burger King
Files uploads	5	Adobe, Shutterfly
Social media	5	Facebook, Instagram
Content creation	4	Office, Atlassian
Videos	3	YouTube, twitch
Files download	2	Google
Games	1	Roblox
Technology	1	Microsoft
Total	185

The Legitimate Requests Data Set including all HTTP requests is available here. We think that it is an important resource for both users and the industry and we are planning to update it every year.

Malicious Requests Dataset

The Malicious Requests Dataset includes 73,924 malicious payloads from a broad spectrum of commonly experienced attack vectors:

SQL Injection
Cross-Site Scripting (XSS)
XML External Entity (XXE)
Path Traversal
Command Execution
Log4Shell
Shellshock

The malicious payloads were sourced from the WAF Payload Collection GitHub page that was assembled by mgm security partners GmbH from Germany. This repository serves as a valuable resource, providing payloads specifically created for testing Web Application Firewall rules.

As explained on the GitHub page, mgm collected the payload from many sources, such as SecLists, Foospidy's Payloads, PayloadsAllTheThings, Awesome-WAF, WAF Efficacy Framework, WAF community bypasses, GoTestWAF, and Payloadbox, among others. It even includes the Log4Shell Payloads from Ox4Shell and Tishna. Each of these sources offers a wealth of real-world, effective payloads and provides a holistic approach to testing WAF solutions.

For an in-depth view of each malicious payload utilized in this study, including specific parameters and corresponding attack types, refer to this link.

Combined, the Legitimate and Malicious Requests data sets present a detailed perspective on how each WAF solution handles traffic in the real world and thereby providing valuable insights into their efficacy and Detection Quality.

Tooling

As before, to ensure transparency and reproducibility, the tool is made available to the public here.

During the initial phase, the tool conducts a dual-layer health check for each WAF. This process first validates connectivity to each WAF, ensuring system communication. It then checks that each WAF is set to prevention mode, confirming its ability to actively block malicious requests.

The responses from each request sent by the test tool to the WAFs were systematically logged in a dedicated database for further analysis.

The database used for this test was an AWS RDS instance running PostgreSQL. However, the test tool is designed to be flexible. Readers can configure it to work with any SQL database of their preference by adjusting the settings in the config.py file.

Following the data collection phase, the performance metrics, including False Positive rates, False Negative rates, and Balanced Accuracy results, were calculated. This was done by executing specific SQL queries against the data in the database.

Comparison Metrics

To quantify the efficacy of each WAF, we use statistical measures. These include Security Quality (also known as Sensitivity or True Positive Rate), Detection Quality (also known as Specificity or True Negative Rate), and Balanced Accuracy.

Security Quality, also known as the true positive rate (TPR), measures the proportion of actual positives correctly identified. In other words, it assesses the WAF's ability to correctly detect and block malicious requests.

Detection Quality, or the true negative rate (TNR), quantifies the proportion of actual negatives correctly identified. This pertains to the WAF's capacity to correctly allow legitimate traffic to pass.

Balanced Accuracy (BA), an especially crucial metric in this study, provides a balanced measurement by considering both metrics. It is calculated as the arithmetic mean of TPR and TNR. In other words, it provides a more balanced measure between True Positives and True Negatives, irrespective of their proportions in the data sets.

This choice of metrics is fundamental, as we aim to assess not just the WAF's ability to block malicious traffic but also to allow legitimate traffic. Most importantly, we want to evaluate the overall balance between these two abilities, given that both are critical for a real-world production system.

Thus, we not only examine the number of attacks each WAF can correctly identify and block but also scrutinize the number of legitimate requests it correctly allows. A WAF with high TPR but low TNR might block most attacks but at the cost of blocking too many legitimate requests, leading to a poor user experience. Conversely, a WAF with high TNR but low TPR might allow most legitimate requests but fail to block a significant number of attacks, compromising the security of the system. Therefore, the optimal WAF solution should strike a balance, exhibiting high performance on both TPR and TNR, which is aptly represented by Balanced Accuracy.

Test Environment

The test includes both products that are deployed as standard software and products available as SaaS. The standard software products were staged within Amazon Web Services (AWS). The main testing apparatus was an AWS EC2 instance, housed in a separate VPC. This facilitated the simulation of a real-world production environment while keeping the testing isolated from external influences.

In order to maintain the integrity of the test and ensure that performance wasn't a distorting factor, all embedded WAF solutions were hosted on AWS t3.xlarge instances. These instances are equipped with 4 virtual CPUs and 16GB of RAM, providing ample computational power and memory resources well beyond what is typically required for standard operations. This configuration was deliberately chosen to eliminate any possibility of hardware constraints influencing the outcome of the WAF comparison, thereby ensuring the results accurately reflect the inherent capabilities of each solution.

Findings

In this section, we describe the configuration of each product tested and the score of the Security Quality or True Positive Rate measurement (higher is better) and Detection Quality or False Positive Rate measurement (lower is better) as well as the Balanced Accuracy score (higher is better).

In the test we used the Default Profile Settings of each product without any tuning and when available also an additional Profile that allows for the highest Security Quality available in the product.

Microsoft Azure WAF

Azure WAF is a cloud-based service implementing ModSecurity with OWASP Core RuleSet. The Microsoft Azure WAF was configured with the Default suggested OWASP CRS 3.2 ruleset.

The results were as follows:

Security Quality (True Positive Rate): 98.547%

Detection Quality (False Positive Rate): 38.346%

Balanced Accuracy: 80.1%

CloudFlare WAF

CloudFlare WAF is a cloud-based service based on ModSecurity with OWASP Core RuleSet. CloudFlare provides a Managed ruleset as well as a full OWASP Core Ruleset. We tested the product activating both rulesets.

The results were as follows:

Security Quality (True Positive Rate): 69.297%

Detection Quality (False Positive Rate): 0.055%

Balanced Accuracy: 84.62%

AWS WAF

AWS WAF is a cloud-based service implementing ModSecurity. AWS provides a default Managed ruleset and optional additional paid-for ruleset from leading vendors such as F5.

We tested the service in two configurations.

AWS WAF – AWS managed ruleset:

The results were as follows:

Security Quality (True Positive Rate): 76.434%

Detection Quality (False Positive Rate): 4.383%

Balanced Accuracy: 86.03%

AWS WAF – AWS managed ruleset plus F5 Rules:

The results were as follows:

Security Quality (True Positive Rate): 79.603%

Detection Quality (False Positive Rate): 4.693%

Balanced Accuracy: 87.46%

NGINX ModSecurity

ModSecurity is the most popular open-source WAF Engine available in the market for 20 years and is signatures-based. Many of the solutions tested here use it as a base. It is available as an add-on for NGINX but will be End-Of-Life in July 2024.

We tested NGINX with ModSecurity Core Rule Set v3.3.4 (latest version) Default settings:

The results were as follows:

Security Quality (True Positive Rate): 86.716%

Detection Quality (False Positive Rate): 10.604%

Balanced Accuracy: 88.06%

F5 NGINX AppProtect

NGINX AppProtect WAF is a paid add-on to NGINX Plus and NGINX Plus Ingress based on the traditional F5 signature-based WAF solution. The AppProtect WAF comes with two policies - Default and Strict. The Default policy provides OWASP-Top-10 protection. The Strict policy is recommended by NGINX for “protecting sensitive applications that require more security but with a higher risk of false positives." It includes over 6000 signatures.

We tested the product with both policies.

NGINX AppProtect Default was configured as follows:

The results were as follows:

Security Quality (True Positive Rate): 78.132%

Detection Quality (False Positive Rate): 2.01%

Balanced Accuracy: 88.06%

NGINX AppProtect Strict WAF was configured as follows:

The results were as follows:

Security Quality (True Positive Rate): 98.186%

Detection Quality (False Positive Rate): 13.151%

Balanced Accuracy: 92.52%

open-appsec / CloudGuard AppSec

open-appsec/CloudGuard AppSec by Check Point is a machine-learning-based WAF that uses supervised and unsupervised machine-learning models to determine whether traffic is malicious.

We tested the product in two configurations using out-of-the-box settings and with no learning period.

Default - activate protections when Confidence is High and above:

The results were as follows:

Security Quality (True Positive Rate): 98.895%

Detection Quality (False Positive Rate): 4.253%

Balanced Accuracy: 97.32%

Critical - activate protections when Confidence is Critical:

The results were as follows:

Security Quality (True Positive Rate): 94.405%

Detection Quality (False Positive Rate): 0.814%

Balanced Accuracy: 96.8%

Analysis

Understanding the metrics used in this comparison is key to interpreting the results accurately. Below we delve deeper into each one of these metrics.

Security Quality (True Positive Rate)

The True Positive Rate gauges the WAF's ability to correctly detect and block malicious requests. A higher TPR is desirable as it suggests a more robust protection against attacks.

In the test, the highest True Positive Rate (TPR) was achieved by the open-appsec/ CloudGuard AppSec registering a TPR of 98.953% with the out-of-the-box Default profile.

Close on its heels were Azure WAF - OWASP 3.2 Rule set with a TPR of 98.547% also with the Defaults and NGINX AppProtect with a TPR of 98.186% using a Strict Profile.

The remaining WAFs showed variable TPRs, with CloudFlare demonstrating the lowest TPR at 69.297%.

Detection Quality (False Positive Rate)

The False Positive Rate measures the WAF's ability to correctly identify and allow legitimate requests. A lower FPR means the WAF is better at recognizing correct traffic and letting it pass, which is critical to avoid unnecessary business disruptions and administration overhead.

In the test lowest False Positive Rate (FPR) was achieved by CloudFlare registering an almost perfect 0.055%. open-appsec/CloudGuard AppSec - followed behind with a FPR of 0.814% using the Critical Profile and NGINX AppProtect Default Profile close behind with 2.01%.

Microsoft Azure WAF - OWASP 3.2 Rule set exhibited a very high FPR 38.346%.

Below is a graphical representation of Security Quality and Detection Quality results. The visualization provides an immediate, intuitive understanding of how well each WAF solution achieves the dual goals of blocking malicious requests and allowing legitimate ones. WAF solutions that appear towards the top right of the graph have achieved a strong balance between these two objectives.

Balanced Accuracy

Balanced Accuracy provides a more holistic view of the WAF's performance, considering both Security Quality and Detection Quality. Higher balanced accuracy indicates an optimal WAF solution that balances attack detection and legitimate traffic allowance.

The open-appsec/CloudGuard AppSec - Default Profile led the pack with a BA of 97.28%, closely followed by the open-appsec/CloudGuard AppSec - Critical Profile, registering a BA of 96.8%. Azure WAF - OWASP 3.2 Rule set had the lowest BA, standing at 80.1%.

Summary

For WAF products to deliver the promise of protecting Web Applications and APIs, they must excel in both Security Quality and Detection Quality. This article provides a valuable lab-based comparison that allows us to see how leading WAF solutions perform in the real world. It reveals a significant difference between product performance with some products posing a significant security risk or need for very heavy tuning initially and ongoing for the product to be used in real-world environments.

We hope that by sharing not just the findings of our test, but also the methodology, datasets, and tooling, we allow users to test the solutions they use and contribute to much-needed transparency in the industry. We were truly surprised about some of the results and welcome the audience to reproduce the result of this test and ask questions.

Finally, we are proud that open-appsec / CloudGuard AppSec proves once again that the best way to implement Web Application Security is by using a combination of supervised and unsupervised machine learning engines as they provide not just the best Security Quality and Detection Quality but also the best protection against zero-day attacks. The product was the only one that blocked zero-day attacks such as Log4Shell, Spring4Shell, Text4Shell and Claroty WAF on Bypass.

open-appsec is an open-source project that builds on machine learning to provide pre-emptive web app & API threat protection against OWASP-Top-10 and zero-day attacks. It simplifies maintenance as there is no threat signature upkeep and exception handling, like common in many WAF solutions.

To learn more about how open-appsec works, see this White Paper and the in-depth Video Tutorial. You can also experiment with deployment in the free Playground.

Best WAF solutions in 2023 - real-world comparison

Introduction

Methodology

Datasets

Legitimate Requests Dataset

Malicious Requests Dataset

Tooling

Comparison Metrics

Test Environment

Findings

Microsoft Azure WAF

CloudFlare WAF

AWS WAF

NGINX ModSecurity

F5 NGINX AppProtect

open-appsec / CloudGuard AppSec

Analysis

Security Quality (True Positive Rate)

Detection Quality (False Positive Rate)

Balanced Accuracy

Summary

Back to All Blogs