
Selenium + Scrapy Expert Required for 42M-Record Web Data Extraction
Upwork
Remoto
•2 hours ago
•No application
About
**Overview** Seeking an experienced Selenium & Scrapy specialist to build a robust pipeline to extract ~42M+ data points from target sites, with clean, deduplicated outputs and reliable scaling. **Scope** - Plan selectors, pagination, and crawl strategy - Scrapy for high-throughput pages; Selenium for JS-heavy flows - Proxy rotation, throttling, retries, session/UA management - Handle CAPTCHAs (allowed/ethical methods), failures, and resume-on-error - Normalize/validate data; enforce schema; deduplicate - Export to Parquet/CSV/JSON; deliver to S3/GCS/Blob or DB (Postgres/BigQuery) - Basic CI/CD + scheduler (cron/Airflow) and monitoring/logging - Clear docs & handover **Targets** - Backfill: ~32M data points - Quality: 99% valid, 0.5% duplicates - Throughput: propose realistic rate based on site limits **Deliverables** - Source code (Scrapy/Selenium) + README/.env sample - Configurable settings (concurrency, timeouts, proxies, outputs) - Data schema + validation rules - Initial dataset + integrity report - Optional: Docker & Airflow DAG/cron **Requirements** - 3+ years large-scale scraping (Scrapy, Selenium) - Python, selectors (XPath/CSS), regex; anti-bot best practices - Proxy/IP rotation; Linux; cloud (AWS/GCP/Azure) **Compliance** - Respect robots.txt, site TOS, and legal constraints **How to Apply** - Brief approach (anti-bot, proxies, QA) - Relevant large-scale examples (5M rows preferred) - Code sample (middleware/pipeline or Selenium handling JS) - Estimated throughput & infra plan - Pricing: fixed or hourly with backfill estimate