← back to live progress · see real examples →

How this scrape works

Pulling every product (name, price, stock, description, image, category) from all 7 country sites of Smyths Toys — 58,650 products across UK, IE, DE, AT, CH, FR, NL — for free, from one laptop running a small fleet of browser lanes.

The obstacle

Smyths sits behind Imperva / Distil Advanced Bot Protection — a "bouncer" that makes every visitor's browser solve a hidden JavaScript puzzle before it serves any page. Ordinary scraping tools have no JavaScript engine, so they get a "Request unsuccessful" wall instead of data. Even the sitemap (the list of products) is behind it.

What we tried — and what happened

Approach	Result	Why
plain `curl` / Python requests	blocked	No JS engine — can't solve the challenge.
Firecrawl (cloud, stealth mode)	challenged	Managed scraper still flagged by Imperva.
Playwright — headless Chrome	blocked	Distil specifically detects headless browsers.
Playwright — headed real Chrome	blocked	Imperva fingerprints the automation control layer itself.
Playwright + stealth plugin	blocked	Stealth hides some tells, not the core CDP leak.
patchright (patched Chrome driver) + real Chrome	works ✓	Removes the exact automation signal Distil hunts for.

The breakthrough: patchright — a patched browser driver that strips the leftover "I'm a robot" signals from Chrome. It's the only free thing that walks through Imperva's front door.

How it actually pulls the data

Discovery. A real browser opens each country's sitemap (solving the challenge automatically) and collects every product URL — the pattern is /p/<id>. That's how we know there are 58,650.
Extraction. Each product page already contains a clean JSON-LD block (structured data search engines read) with name, price, availability, brand, image, description. We fetch it inside the warmed browser session and parse it — no need to render the whole page.
Storage. One line per product appended to a file, so the run is fully resumable — crash, reboot, or block, it picks up where it left off.

Staying under the radar

Imperva counts requests per IP, so every lane (see the fleet, below) deliberately goes slow and human-like:

Random 0.8–7s gaps between requests (not a robotic fixed interval).
Occasional long "reading" pauses.
Product order shuffled across all countries, so it never marches predictably through one category.

The walls we hit live

Event	Fix
Tripped a per-IP rate limit (~450 requests in 25 min) → got challenged	Stop, cool down, resume at a gentler pace.
Tried NordVPN to spread load	Automated access on the VPN IP is blocked (datacenter IPs are pre-flagged) — but once a human solves one CAPTCHA on it, that VPN session scrapes fine. The current run is on a NordVPN IP.
A CAPTCHA appeared	A human solves it once in the live window; the scraper detects the clearance and auto-resumes through that same session.
Single lane too slow (~days)	Scaled to a parallel fleet — up to 9 browsers, each on a different NordVPN exit IP (see below).
Datacenter IPs degrade	Each Nord exit works in bursts (~100 products) then re-challenges; one exit (San Francisco) went fully dead. A self-healing watchdog relaunches dead lanes, the dead exit was retired, and its share of products was handed to the surviving lanes.

Key lesson: the real bottleneck is the human-solved challenge — every IP, even a proxy, needs one to open its session. Speed then comes from running several solved sessions in parallel (next section), with a person tapping the occasional CAPTCHA.

Going parallel: the browser fleet

One lane would take days, so we run several browsers at once — each routed through a different NordVPN SOCKS5 exit (a different IP). Because Imperva's rate limit is per-IP, each lane gets its own budget, multiplying throughput.

Different IP per browser. Each lane uses a separate Nord exit (Amsterdam, Atlanta, Dallas, New York, …) — up to 9, now 8 after one died.
A tiny local relay (gost). Chrome can't log in to a SOCKS proxy by itself, so a small local relay handles the Nord login and hands each browser a clean local port.
One human CAPTCHA per lane. Each lane's session is opened by a single hand-solved challenge, then it scrapes its own shard of the catalogue.
Self-healing + resumable. A watchdog checks every 5 min, relaunches any lane that dies, and the work is sharded so nothing is lost or scraped twice.

Honest trade-off: these are datacenter IPs, so they degrade — lanes re-challenge after a burst and need the odd re-solve, and an exit can die outright. The genuinely fast, near-zero-CAPTCHA path is paid residential proxies (trusted home IPs). The free fleet trades a little babysitting for $0.

Timeline (this run)

loading…

← back to live progress