Key Takeaways
API-first beats browser-first. Before writing a single line of Playwright, check for an official API, RSS feed, sitemap, or public dataset. Browser automation should be the last resort, not the first.
robots.txt and ToS are contractual signals. In 2026, EU AI Act transparency rules and U.S. case law (notably hiQ v. LinkedIn and Meta v. Bright Data) have made the compliance posture of a scraping pipeline a board-level concern, not just an engineering preference.
Playwright is the modern default for dynamic sites. Its async architecture, network interception, and first-class Python bindings make it the most maintainable choice for sites that genuinely require a real browser.
Politeness is a feature, not an afterthought. Production pipelines need rate limiting, exponential backoff, a contact-bearing User-Agent, and observability — exactly the same hygiene you'd apply to any outbound integration.
Why This Guide Exists
If you search for "Playwright scraping" today, most results push you toward stealth plugins, residential proxy rotation, and fingerprint spoofing. That content is written for adversarial use cases — and it's a poor fit for the work most data engineers actually do: pulling public, permitted data into a warehouse for analytics, monitoring, or research.
This guide is the opposite. It assumes you have a legitimate data need, a budget for doing it correctly, and a legal team that will eventually ask, "can you show me how this pipeline respects the source site's wishes?"
Everything below is structured to answer that question.
The Compliance Layer: Decide Before You Code
Before touching Playwright, work through this checklist. Skipping it is the single most common reason scraping projects get killed in legal review.
Is there an official API or feed?
In order of preference:
Official REST/GraphQL API — almost always the right answer if it exists.
Bulk dataset / data dump — Common Crawl, Wikipedia dumps, government open data portals.
RSS, Atom, or sitemap.xml — structured, intended for machine consumption.
HTML scraping with requests + selectolax — for static pages.
Headless browser automation (Playwright) — only when the data is rendered client-side or behind interactive state.
Each step down costs more in engineering time, infrastructure, and compliance risk. Most teams skip straight to step 5 and regret it six months later.
Read robots.txt — and honor it
robots.txt is not legally binding in every jurisdiction, but ignoring it is the fastest way to lose a "good faith" argument in court. Python's standard library handles this:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if not rp.can_fetch("MyCompany-DataBot/1.0", "https://example.com/products"):
raise PermissionError("robots.txt disallows this path")
For production, prefer reppy or protego — both handle the Google-extended robots.txt spec correctly, including Crawl-delay and wildcard patterns.
Read the Terms of Service
Look for clauses on:
Automated access (often in §"Acceptable Use")
Rate limits or query caps
Data redistribution rights
Personal data handling — if any field could identify an EU resident, GDPR Article 6 applies whether you're in Europe or not.
If the ToS forbids scraping, your options are: (a) get written permission, (b) use an official API tier, or (c) pick a different data source. There is no clever fourth option.
Identify yourself
Set a User-Agent that names your organization and provides a contact address:
MyCompany-DataBot/1.0 (+https://mycompany.example/bot-info; data-eng@mycompany.example)
This is the single highest-leverage compliance signal you can send. Site operators who would otherwise block you will often whitelist a named, reachable bot.
Playwright vs. Selenium vs. Puppeteer in 2026
| Criterion | Playwright | Selenium 4 | Puppeteer |
|---|
| Python support | First-class (playwright) | First-class | Community port only |
| Async model | Native asyncio | Sync + remote WebDriver | Native (Node) |
| Auto-waiting | Yes, built-in | Manual WebDriverWait | Yes |
| Network interception | Yes, granular | Limited (via CDP) | Yes |
| Multi-browser | Chromium, Firefox, WebKit | All major | Chromium only |
| Cross-platform CI | Excellent | Excellent | Good |
For new Python projects on dynamic sites, Playwright is the default recommendation. Selenium remains the right call when you need broad legacy browser coverage (IE mode, old Safari) or are integrating with an existing Selenium Grid.
A Production-Grade Polite Scraper in Playwright
The following pattern is what we run in production. It is intentionally boring.
Project layout
scraper/
├── pyproject.toml
├── src/
│ └── scraper/
│ ├── __init__.py
│ ├── compliance.py # robots.txt + ToS checks
│ ├── fetch.py # Playwright wrapper
│ ├── parse.py # selectors → structured records
│ └── pipeline.py # orchestration
└── tests/
### The fetch layer
# src/scraper/fetch.py
import asyncio
import logging
from contextlib import asynccontextmanager
from playwright.async_api import async_playwright, Browser, BrowserContext
logger = logging.getLogger(__name__)
USER_AGENT = (
"MyCompany-DataBot/1.0 " "(+https://mycompany.example/bot-info; data-eng@mycompany.example)" )
@asynccontextmanager
async def browser_context() -> BrowserContext:
async with async_playwright() as p: browser: Browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent=USER_AGENT,
viewport={"width": 1280, "height": 800},
locale="en-US",
)
try:
yield context finally:
await context.close()
await browser.close()
async def fetch_rendered_html(url: str, *, wait_selector: str | None = None) -> str:
async with browser_context() as ctx:
page = await ctx.new_page()
response = await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
if response is None or not response.ok:
raise RuntimeError(f"Bad response for {url}: {response and response.status}")
if wait_selector:
await page.wait_for_selector(wait_selector, timeout=10_000)
return await page.content()
Notice what's not here: no stealth plugins, no fingerprint patching, no proxy rotation. The User-Agent is honest. The viewport is standard. The browser is in headless=True because there is no reason to pretend otherwise.
Rate limiting and backoff
Use a token bucket per domain, plus exponential backoff on 429/503:
# src/scraper/pipeline.py
import asyncio
import random
from aiolimiter import AsyncLimiter
# 1 request per 2 seconds per domain — adjust to the site's Crawl-delay or ToS.
DOMAIN_LIMITERS: dict[str, AsyncLimiter] = {}
def limiter_for(domain: str) -> AsyncLimiter:
if domain not in DOMAIN_LIMITERS:
DOMAIN_LIMITERS[domain] = AsyncLimiter(max_rate=1, time_period=2)
return DOMAIN_LIMITERS[domain]
async def polite_fetch(url: str, domain: str, attempt: int = 1) -> str:
async with limiter_for(domain):
try:
return await fetch_rendered_html(url)
except RuntimeError as e:
if attempt >= 5:
raise
backoff = (2 ** attempt) + random.uniform(0, 1)
logger.warning("Backing off %.1fs after error: %s", backoff, e)
await asyncio.sleep(backoff)
return await polite_fetch(url, domain, attempt + 1)
If the site publishes a Crawl-delay, use that value. If it doesn't, start at 1 request every 2–5 seconds and only increase after you've contacted the site operator.
Caching to avoid re-fetching
Every re-fetch you avoid is one fewer request hitting the target site. A simple SQLite cache keyed on URL + a daily bucket goes a long way:
import sqlite3, hashlib, datetime as dt
def cache_key(url: str) -> str:
today = dt.date.today().isoformat()
return hashlib.sha256(f"{today}:{url}".encode()).hexdigest()
For larger pipelines, an HTTP cache layer like hishel (for httpx) or a CDN-style proxy cache is worth the setup time.
Handling Dynamic Content Without Crossing Lines
The legitimate reason to reach for Playwright is that the data is rendered by JavaScript. A few patterns that stay on the right side of the line:
Prefer the underlying API call
Open DevTools → Network → XHR. Nine times out of ten, the page is fetching its own data from a JSON endpoint. If that endpoint is public and not rate-protected by a "do not call directly" notice, calling it with httpx is faster, cheaper, and gentler on the site than rendering the page.
import httpx
async with httpx.AsyncClient(headers={"User-Agent": USER_AGENT}) as client: r = await client.get("https://example.com/api/products?page=1") r.raise_for_status()
data = r.json()
### Wait for the real signal, not arbitrary sleeps
# Bad: brittle, wastes time, hides bugs
await page.wait_for_timeout(5000)
# Good: tied to actual page state
await page.wait_for_selector("[data-testid='results-loaded']")
await page.wait_for_load_state("networkidle")
### Block what you don't need
Skip images, fonts, and analytics scripts you don't need. This reduces load on the target site and speeds up your pipeline:
await context.route("**/*", lambda route: (
route.abort() if route.request.resource_type in {"image", "font", "media"} else route.continue_() ))
Observability: Prove Your Pipeline Is Behaving
When legal or the data source asks "are you sure your scraper is polite?", you want logs, not vibes.
Minimum instrumentation:
Per-domain request rate (Prometheus histogram or equivalent).
HTTP status code distribution — a spike in 429s means you're being told to slow down.
robots.txt re-check timestamp — re-read at least daily; sites change their policies.
Outbound User-Agent logged on every request, so you can prove what you sent.
import structlog
log = structlog.get_logger()
log.info( "fetch",
url=url,
domain=domain,
user_agent=USER_AGENT,
status=response.status,
duration_ms=duration_ms,
)
When to Stop and Ask for Permission
A short checklist. If you answer "yes" to any of these, pause and contact the site operator:
The site has explicitly blocked your User-Agent or IP range.
You are scraping more than ~10,000 pages from a single domain.
The data includes anything that could identify a natural person.
You plan to redistribute or resell the scraped data.
The ToS is ambiguous and your use case is commercial.
An email to legal@ or partnerships@ costs a day. A cease-and-desist costs a quarter.
FAQ
Is web scraping legal in 2026?
In most jurisdictions, scraping public, non-personal data while honoring robots.txt and ToS is legal. The U.S. Ninth Circuit's hiQ v. LinkedIn ruling (reaffirmed on remand) established that scraping publicly accessible data does not violate the CFAA. The EU's position under GDPR and the AI Act is stricter: any scraping that touches personal data needs a lawful basis under Article 6, regardless of whether the data is technically public. Always consult counsel for your specific use case.
Should I use Playwright or Scrapy?
Use Scrapy when the target site is mostly static HTML and you need throughput, built-in pipelines, and a mature ecosystem of middlewares. Use Playwright when the data is rendered client-side, requires authenticated state, or depends on interactive elements. The two compose well: Scrapy for the crawl frontier and dedupe, Playwright (via scrapy-playwright) only for the pages that genuinely need a browser.
Does Playwright support robots.txt natively?
No. Playwright is a browser automation library; it has no opinion on crawl policy. You enforce robots.txt in your own pipeline code, before Playwright is invoked. The protego library is the most spec-compliant option in Python today.
How do I avoid getting blocked without using stealth plugins?
Honor the site's rate limits, set a contact-bearing User-Agent, cache aggressively, and don't scrape what you don't need. Most blocks are triggered by request volume, not by fingerprinting. If you've followed the politeness rules and are still being blocked, that is the site operator telling you they don't want to be scraped — and the answer is to stop and ask, not to evade.
Is headless mode detectable?
Yes — and that's fine. In a compliant pipeline you are not hiding the fact that you're a bot; you're being a well-behaved bot. The goal is not to impersonate a human, it is to identify yourself and behave proportionately. If a site blocks all headless traffic, treat that as a policy signal and seek an API agreement instead.
What about CAPTCHAs?
If you're hitting CAPTCHAs on a public-data scraping job, you are almost certainly either (a) requesting too fast, (b) scraping a site that does not want to be scraped, or (c) accessing a path that requires authentication you don't have. None of these are solved by CAPTCHA-bypass services. Slow down, re-read the ToS, or reach out to the site operator.