Overview
An SEO company needed to regularly scan client websites for illegal, explicit, or forbidden terms.
Each client site had:
- multiple pages
- deep internal links
- varying site structures
The existing workflow involved manual checks, was slow, inconsistent, and didn’t scale. The system needed to:
- run multiple crawls concurrently across different client sites
- crawl websites up to a configurable depth
- analyze page content against a large blacklist (~400,000 phrases)
- respect whitelisted contexts to reduce false positives
I helped design and implement an internal backend system to crawl client websites and detect illegal or forbidden content at scale, replacing a manual, error-prone SEO compliance process.
My Role
- Backend developer responsible for end-to-end system design and implementation
- Worked closely with an external vendor to:
- refine requirements
- finalize application flow
- iterate on technical decisions
- Built the entire backend, including crawling, queuing, phrase detection, concurrency control, and reporting
Crawling and Page Analysis
- Implemented a BFS-based web crawler using Puppeteer
- Enforced:
- configurable depth limits (default up to depth 5)
- duplicate URL elimination using a Set
- Each page was:
- rendered in a headless browser
- scanned for violations
- visually annotated (matched terms highlighted)
- captured via screenshots for audit/reference
- Screenshots were uploaded to S3 for later inspection
Queuing and Concurrency
- Used Redis + BullMQ for job orchestration
- Designed the queue to support:
- multiple concurrent crawl jobs
- multiple sites per crawl
- multiple pages per site
- Initially queued entire payloads; later refactored to queue task IDs only when inputs grew large
- Controlled concurrency via:
- BullMQ workers
- p-limit for fine-grained async control
- Managed browser and tab lifecycles dynamically:
- scaled tabs based on available memory
- capped tabs per browser
- spawned additional browser processes when needed
Phrase Detection at Scale
- Early approach:
- regex-based matching
- whitelisted phrases masked before blacklist scanning
- As blacklist size grew to ~400k phrases, performance degraded
- Migrated to Aho–Corasick algorithm for:
- linear-time multi-pattern matching
- significantly improved performance and memory usage
- Preserved whitelist context logic to avoid false positives
Challenges & Key Learnings
- Process instability: crawlers would stop mid-run → improved logging and lifecycle management
- Memory spikes from Puppeteer → added memory-aware scaling for browsers and tabs
- Blocked or explicit sites → hardened error handling and diagnostics
- Input growth:
- evolved from KB-sized inputs to hundreds of MBs
- refactored ingestion pipeline to:
- upload inputs to S3
- stream and parse JSON in workers
- handle async + streaming edge cases
- Optimized large database operations using batching and query tuning
Outcome
- Replaced a largely manual compliance workflow with a fully automated system
- Enabled:
- continuous scanning of ~4,000+ pages daily
- consistent and repeatable violation detection
- The system was in active testing and iteration when I transitioned away
If I Were Rebuilding It Today
- Design concurrency and resource limits upfront
- Separate the core detection engine from crawling and reporting earlier
- Invest earlier in observability and failure diagnostics