
Python Developer – Low-Latency Web Scraper (1200 Investor Relations Pages)
Upwork
Remoto
•6 hours ago
•No application
About
We need a developer to build a Python-based web scraper that monitors around 1,200 investor relations pages from company websites. The scraper should detect when a new update goes live and capture: the latest post headline the body text (or PDF content if that’s how it’s published) a short summary of the post The target is under 5 seconds latency from when the update appears on the site to when we capture and process it. What’s involved Each IR page has its own quirks (different structures, formats, update methods), so you’ll need to write site-specific logic and maintain it. Use Python (3.11+) with async libraries like httpx / aiohttp for fast checks. Handle RSS feeds or sitemaps where available, but most will be plain HTML or PDFs. Extract titles, timestamps, links, body text. Build in PDF parsing (pdfminer.six, pdfplumber etc.) where needed. Add simple summarisation (e.g. SpaCy/NLTK, HuggingFace if lightweight). Make sure we don’t capture duplicates — use hashes, canonical URLs, etc. Keep it efficient: we want fast turnaround without hammering sites (respect robots.txt and rate limits). Output normalised JSON (we’ll define schema). Must-have skills Strong Python (asyncio, concurrency, handling lots of requests fast). Real experience scraping at scale and speed. HTML parsing (lxml, BeautifulSoup, selectolax). PDF extraction experience. Comfortable working with queues / simple pipelines (Redis/Kafka/RabbitMQ). Know how to make things observable (basic logging/metrics). Nice-to-have Familiarity with investor relations or financial news. Experience handling anti-bot measures. Text summarisation / NLP. Playwright (Python) for tricky JavaScript-heavy sites (only if really needed). Deliverable A working scraper covering ~1,200 IR pages. Near real-time detection (sub 5s on average). Robust enough to keep running during market hours. Simple way to monitor success/failures. Sample set or URL's SAP Investor Relations Main Investor Relations https://www.sap.com/investors/en.html Novo Nordisk Investor Relations / News Financial Results & Events Overview https://www.novonordisk.com/investors/financial-results.html Novartis News / Press Release Latest News https://www.novartis.com/news Roche Media / News Media (Main News Page) https://www.roche.com/media Shell Energy Press / News Main Newsroom / Media Releases https://www.shell.com/news-and-insights/newsroom.html HSBC News / Press Release Media releases https://www.hsbc.com/news-and-views/news/media-releases Unilever Press / News Main Press & Media https://www.unilever.com/news/press-and-media/