May 13, 2026

Data Scraping

By

Tendem Team

How Anti-Bot Systems Work and How to Scrape Cloudflare, DataDome, Akamai

If you have tried web scraping at any scale in 2026, you have met the wall. A CAPTCHA appears. A 403 status code returns. A Cloudflare challenge page loads instead of the data you need. Your scraper worked yesterday; today it returns empty results. The site has not changed – its anti-bot system has adapted to your access pattern.

Anti-bot technology has advanced faster than most scraping tools can keep up with. Cloudflare began blocking AI-based scraping by default in July 2025. DataDome now runs over 85,000 customer-specific machine learning models, making each protected website a unique detection challenge (Scrapfly 2026). Modern anti-bot systems do not just block IP addresses – they fingerprint your TLS handshake, analyze your browser environment, monitor your mouse movements, and evaluate whether your behavior looks human or automated.

Understanding how these systems work is the first step to scraping reliably. This article explains the five detection layers that modern anti-bot systems use, why common scraping approaches fail against each one, and the strategies that actually work for collecting data from protected sites in 2026.

The Five Layers of Anti-Bot Detection

Modern anti-bot systems like Cloudflare, DataDome, Akamai, PerimeterX (now HUMAN Security), and Kasada do not rely on a single detection method. They stack five complementary layers, each catching what the others miss. Passing one layer means nothing – you must pass all five simultaneously.

Layer 1: IP Reputation and Rate Analysis

The oldest and simplest layer. Anti-bot systems maintain databases of known datacenter IP ranges, VPN providers, and previously flagged addresses. Any IP with a history of bot-like behavior starts with a low trust score. Beyond reputation, rate analysis counts requests per IP against rolling time windows. Exceed the threshold and the IP receives either a throttle (slower responses) or a permanent deny.

Why basic scrapers fail here: a scraper making requests from a single IP address – or from a known datacenter range – triggers this layer immediately. Even with IP rotation, datacenter proxies carry inherently low trust scores because anti-bot vendors maintain lists of datacenter IP ranges from major cloud providers.

Layer 2: TLS Fingerprinting

When your scraper connects over HTTPS, a TLS handshake occurs before any HTTP data transfers. During this handshake, your client reveals its supported cipher suites, TLS extensions, protocol versions, and elliptic curve preferences. The JA3 fingerprinting technique concatenates and hashes these values into a unique identifier (RoundProxies 2026).

Why basic scrapers fail here: Python’s requests library produces a JA3 hash that is instantly recognizable as an automated script. Node.js with Axios produces a different but equally detectable fingerprint. Even headless Chrome without stealth modifications has a distinct TLS fingerprint. Cloudflare and DataDome maintain databases of known bot signatures and block matching fingerprints on sight (DEV Community 2026).

Layer 3: Browser Fingerprinting

If you pass TLS inspection, the next layer executes JavaScript in your browser (or headless browser) to collect hundreds of environmental signals: screen resolution, installed fonts, WebGL renderer, audio context, canvas fingerprint, browser plugins, timezone, language settings, and even the performance characteristics of your JavaScript engine.

Headless browsers have detectable properties: navigator.webdriver=true, missing plugins, wrong screen dimensions, no GPU renderer, and inconsistent performance timings. Each anomaly reduces the trust score. Stack enough anomalies and the visit is classified as automated – even though a “real browser” is technically being used.

Layer 4: Behavioral Analysis

The most sophisticated layer monitors how the visitor interacts with the page. Real users generate inconsistent, organic telemetry: erratic mouse movements, variable scroll speeds, occasional misclicks, and natural pauses between actions. Automated scripts produce linear movement traces, perfectly consistent timing, or no interaction telemetry at all (iWeb Scraping 2026).

DataDome and Kasada deploy JavaScript collection scripts that gather coordinate sequences from mouse movement, acceleration patterns in scroll events, timing intervals between clicks, and keyboard event sequences. This behavioral data feeds into classifiers that produce high-confidence automated/human classifications – particularly on login flows, checkout pages, and search result interactions.

Layer 5: CAPTCHA and Challenge Systems

When the trust score from the first four layers falls below a threshold, the visitor receives an active challenge. In 2026, this goes beyond traditional image-selection CAPTCHAs. Cloudflare Turnstile operates invisibly, scoring behavioral signals without explicit user interaction. reCAPTCHA v3 assigns a risk score based on the entire session without presenting a visual puzzle. Interactive CAPTCHAs (image selection, puzzle sliding, 3D object rotation) are reserved for the most suspicious visitors (Use Apify 2026).

CAPTCHA solving services cost approximately $1 per 1,000 solves for basic challenges, but success rates vary wildly depending on the challenge type and the solving service quality. For invisible challenges like Turnstile, automated solving is significantly harder because there is no discrete challenge to solve – the entire session is the challenge.

The Major Anti-Bot Vendors in 2026

Vendor

Notable Clients

Detection Strengths

Difficulty Level

Cloudflare

20%+ of all websites; Shopify (99.2%)

TLS fingerprinting, JS challenges, Turnstile, adaptive rules

Moderate–High

DataDome

E-commerce, travel, ticketing

85,000+ custom ML models; per-site behavioral learning

High

Akamai Bot Manager

Banks, airlines, major retailers

Device fingerprinting, behavioral analysis, reputation scoring

High

PerimeterX (HUMAN)

Enterprise SaaS, media, financial

Behavioral biometrics, predictive analytics

High

Kasada

Real estate, hospitality, sports

Proof-of-work challenges, anti-replay, behavioral telemetry

Very High

Strategies That Actually Work in 2026

1. Use Residential Proxies, Not Datacenter

Residential IPs from real ISPs carry significantly higher trust scores than datacenter IPs. Rotating residential proxies from diverse geographic locations reduce the probability of any single IP accumulating enough requests to trigger rate limiting. The trade-off is cost: residential proxy pricing ranges from $2/GB on annual plans to $8.50/GB pay-as-you-go (TitanNet 2026).

2. Match Your TLS Fingerprint to a Real Browser

Libraries like curl_cffi (Python) impersonate the TLS fingerprint of popular browsers, making your requests indistinguishable from Chrome or Firefox at the handshake layer. This single change eliminates the most common instant-block trigger for Python-based scrapers (RoundProxies 2026).

3. Use Stealth Browsers for JavaScript-Heavy Sites

For sites that execute JavaScript challenges, stealth browser tools like Camoufox (Firefox-based, C++-level fingerprint modifications) and SeleniumBase UC Mode provide browser environments that pass fingerprint detection tests. Camoufox consistently achieves 0% detection on CreepJS and BrowserScan tests (RoundProxies 2026). These are not standard headless browsers – they are specifically engineered to appear as real human browser sessions.

4. Simulate Human Behavior

Add randomized delays between requests (2–5 seconds for protected sites, not the 0.1 seconds that default scraping produces). Implement mouse movement simulation for pages that track interaction telemetry. Vary scroll patterns, viewport sizes, and interaction timing. The goal is not perfection – it is organic inconsistency that matches how real humans browse.

5. Rotate Everything

Rotate IP addresses, user agents, browser fingerprints, and session tokens. Anti-bot systems correlate signals across requests – a different IP with the same browser fingerprint is just as detectable as the same IP with different user agents. True evasion requires varying all signals simultaneously.

6. Use Managed Scraping Infrastructure

For most businesses, the anti-bot arms race is not a fight worth having internally. Managed scraping services and APIs (Bright Data, ScrapFly, Oxylabs) maintain dedicated anti-detection infrastructure – residential proxy pools, stealth browser farms, CAPTCHA solving, and continuous adaptation to vendor updates. This is their core competency, and their economies of scale make it more cost-effective than building equivalent infrastructure in-house.

Where Human Expertise Makes the Difference

Anti-bot systems are designed to detect and block automated access. When automated strategies fail – and they will, periodically, as vendors update their detection – human expertise is what gets scraping operations back online.

Diagnosis is the first and most critical step. When a scraper stops working, the cause could be any of the five detection layers, a combination of several, or a site-specific rule that does not match any standard pattern. Human engineers diagnose the specific detection method by analyzing response headers, challenge types, and block patterns – then engineer a targeted solution rather than guessing at generic fixes.

CAPTCHA solving for 2FA-gated content, session management across authenticated workflows, and strategic decisions about scraping intensity (how much data to extract before backing off) all require human judgment that no automated system can provide reliably.

Let Tendem handle the anti-bot complexity – our AI agent manages the extraction while human experts engineer the access strategies that keep data flowing.

The Arms Race Will Continue

Anti-bot systems will keep evolving. Vendors will adopt new fingerprinting techniques, deploy more sophisticated ML models, and implement detection methods that have not been invented yet. The businesses that scrape successfully in 2026 and beyond will not be the ones with the cleverest automation – they will be the ones that combine automated infrastructure with human expertise that can adapt faster than detection systems evolve.

Conclusion

Anti-bot systems in 2026 operate across five detection layers: IP reputation, TLS fingerprinting, browser fingerprinting, behavioral analysis, and CAPTCHA challenges. Bypassing any single layer is straightforward; bypassing all five simultaneously requires residential proxies, TLS impersonation, stealth browsers, behavioral simulation, and continuous adaptation to vendor updates.

For most businesses, this level of infrastructure is not worth building internally. The most reliable and cost-effective approach is to use managed scraping services that maintain anti-detection capabilities as a core competency – combined with human expertise to diagnose and resolve the inevitable cases where automated strategies need adjustment.

Skip the anti-bot headaches – describe your data needs to Tendem’s AI agent and get reliable data delivery without managing proxy infrastructure or detection evasion.

Related Resources

See why pure AI scraping breaks down in our why pure AI scraping fails article.

Understand authenticated scraping in our scraping behind logins guide.

Compare DIY vs managed approaches in our true cost of DIY web scraping article.

Evaluate services in our best web scraping services comparison.

Explore Tendem’s data scraping services.

© Toloka AI BV. All rights reserved.

We use cookies. You can accept, reject, or manage them.

Manage cookies

© Toloka AI BV. All rights reserved.

We use cookies. You can accept, reject, or manage them.

Manage cookies

© Toloka AI BV. All rights reserved.

We use cookies. You can accept, reject, or manage them.

Manage cookies