← back to live progress · see real examples →

How this scrape works

Pulling every product (name, price, stock, description, image, category) from all 7 country sites of Smyths Toys — 58,650 products across UK, IE, DE, AT, CH, FR, NL — for free, from one laptop running a small fleet of browser lanes.

The obstacle

Smyths sits behind Imperva / Distil Advanced Bot Protection — a "bouncer" that makes every visitor's browser solve a hidden JavaScript puzzle before it serves any page. Ordinary scraping tools have no JavaScript engine, so they get a "Request unsuccessful" wall instead of data. Even the sitemap (the list of products) is behind it.

What we tried — and what happened

ApproachResultWhy
plain curl / Python requestsblockedNo JS engine — can't solve the challenge.
Firecrawl (cloud, stealth mode)challengedManaged scraper still flagged by Imperva.
Playwright — headless ChromeblockedDistil specifically detects headless browsers.
Playwright — headed real ChromeblockedImperva fingerprints the automation control layer itself.
Playwright + stealth pluginblockedStealth hides some tells, not the core CDP leak.
patchright (patched Chrome driver) + real Chromeworks ✓Removes the exact automation signal Distil hunts for.
The breakthrough: patchright — a patched browser driver that strips the leftover "I'm a robot" signals from Chrome. It's the only free thing that walks through Imperva's front door.

How it actually pulls the data

  1. Discovery. A real browser opens each country's sitemap (solving the challenge automatically) and collects every product URL — the pattern is /p/<id>. That's how we know there are 58,650.
  2. Extraction. Each product page already contains a clean JSON-LD block (structured data search engines read) with name, price, availability, brand, image, description. We fetch it inside the warmed browser session and parse it — no need to render the whole page.
  3. Storage. One line per product appended to a file, so the run is fully resumable — crash, reboot, or block, it picks up where it left off.

Staying under the radar

Imperva counts requests per IP, so every lane (see the fleet, below) deliberately goes slow and human-like:

The walls we hit live

EventFix
Tripped a per-IP rate limit (~450 requests in 25 min) → got challengedStop, cool down, resume at a gentler pace.
Tried NordVPN to spread loadAutomated access on the VPN IP is blocked (datacenter IPs are pre-flagged) — but once a human solves one CAPTCHA on it, that VPN session scrapes fine. The current run is on a NordVPN IP.
A CAPTCHA appearedA human solves it once in the live window; the scraper detects the clearance and auto-resumes through that same session.
Single lane too slow (~days)Scaled to a parallel fleet — up to 9 browsers, each on a different NordVPN exit IP (see below).
Datacenter IPs degradeEach Nord exit works in bursts (~100 products) then re-challenges; one exit (San Francisco) went fully dead. A self-healing watchdog relaunches dead lanes, the dead exit was retired, and its share of products was handed to the surviving lanes.
Key lesson: the real bottleneck is the human-solved challenge — every IP, even a proxy, needs one to open its session. Speed then comes from running several solved sessions in parallel (next section), with a person tapping the occasional CAPTCHA.

Going parallel: the browser fleet

One lane would take days, so we run several browsers at once — each routed through a different NordVPN SOCKS5 exit (a different IP). Because Imperva's rate limit is per-IP, each lane gets its own budget, multiplying throughput.

Honest trade-off: these are datacenter IPs, so they degrade — lanes re-challenge after a burst and need the odd re-solve, and an exit can die outright. The genuinely fast, near-zero-CAPTCHA path is paid residential proxies (trusted home IPs). The free fleet trades a little babysitting for $0.

Timeline (this run)

← back to live progress