← back to live progress · see real examples →
Pulling every product (name, price, stock, description, image, category) from all 7 country sites of Smyths Toys — 58,650 products across UK, IE, DE, AT, CH, FR, NL — for free, from one laptop running a small fleet of browser lanes.
Smyths sits behind Imperva / Distil Advanced Bot Protection — a "bouncer" that makes every visitor's browser solve a hidden JavaScript puzzle before it serves any page. Ordinary scraping tools have no JavaScript engine, so they get a "Request unsuccessful" wall instead of data. Even the sitemap (the list of products) is behind it.
| Approach | Result | Why |
|---|---|---|
plain curl / Python requests | blocked | No JS engine — can't solve the challenge. |
| Firecrawl (cloud, stealth mode) | challenged | Managed scraper still flagged by Imperva. |
| Playwright — headless Chrome | blocked | Distil specifically detects headless browsers. |
| Playwright — headed real Chrome | blocked | Imperva fingerprints the automation control layer itself. |
| Playwright + stealth plugin | blocked | Stealth hides some tells, not the core CDP leak. |
| patchright (patched Chrome driver) + real Chrome | works ✓ | Removes the exact automation signal Distil hunts for. |
The breakthrough: patchright — a patched browser driver that strips the leftover "I'm a robot" signals from Chrome. It's the only free thing that walks through Imperva's front door.
/p/<id>. That's how we know there are 58,650.JSON-LD block (structured data search engines read) with name, price, availability, brand, image, description. We fetch it inside the warmed browser session and parse it — no need to render the whole page.Imperva counts requests per IP, so every lane (see the fleet, below) deliberately goes slow and human-like:
| Event | Fix |
|---|---|
| Tripped a per-IP rate limit (~450 requests in 25 min) → got challenged | Stop, cool down, resume at a gentler pace. |
| Tried NordVPN to spread load | Automated access on the VPN IP is blocked (datacenter IPs are pre-flagged) — but once a human solves one CAPTCHA on it, that VPN session scrapes fine. The current run is on a NordVPN IP. |
| A CAPTCHA appeared | A human solves it once in the live window; the scraper detects the clearance and auto-resumes through that same session. |
| Single lane too slow (~days) | Scaled to a parallel fleet — up to 9 browsers, each on a different NordVPN exit IP (see below). |
| Datacenter IPs degrade | Each Nord exit works in bursts (~100 products) then re-challenges; one exit (San Francisco) went fully dead. A self-healing watchdog relaunches dead lanes, the dead exit was retired, and its share of products was handed to the surviving lanes. |
Key lesson: the real bottleneck is the human-solved challenge — every IP, even a proxy, needs one to open its session. Speed then comes from running several solved sessions in parallel (next section), with a person tapping the occasional CAPTCHA.
One lane would take days, so we run several browsers at once — each routed through a different NordVPN SOCKS5 exit (a different IP). Because Imperva's rate limit is per-IP, each lane gets its own budget, multiplying throughput.
Honest trade-off: these are datacenter IPs, so they degrade — lanes re-challenge after a burst and need the odd re-solve, and an exit can die outright. The genuinely fast, near-zero-CAPTCHA path is paid residential proxies (trusted home IPs). The free fleet trades a little babysitting for $0.