Overview

An SEO company needed to regularly scan client websites for illegal, explicit, or forbidden terms.
Each client site had:

  • multiple pages
  • deep internal links
  • varying site structures

The existing workflow involved manual checks, was slow, inconsistent, and didn’t scale. The system needed to:

  • run multiple crawls concurrently across different client sites
  • crawl websites up to a configurable depth
  • analyze page content against a large blacklist (~400,000 phrases)
  • respect whitelisted contexts to reduce false positives

I helped design and implement an internal backend system to crawl client websites and detect illegal or forbidden content at scale, replacing a manual, error-prone SEO compliance process.

My Role

  • Backend developer responsible for end-to-end system design and implementation
  • Worked closely with an external vendor to:
    • refine requirements
    • finalize application flow
    • iterate on technical decisions
  • Built the entire backend, including crawling, queuing, phrase detection, concurrency control, and reporting

Crawling and Page Analysis

  • Implemented a BFS-based web crawler using Puppeteer
  • Enforced:
    • configurable depth limits (default up to depth 5)
    • duplicate URL elimination using a Set
  • Each page was:
    • rendered in a headless browser
    • scanned for violations
    • visually annotated (matched terms highlighted)
    • captured via screenshots for audit/reference
  • Screenshots were uploaded to S3 for later inspection

Queuing and Concurrency

  • Used Redis + BullMQ for job orchestration
  • Designed the queue to support:
    • multiple concurrent crawl jobs
    • multiple sites per crawl
    • multiple pages per site
  • Initially queued entire payloads; later refactored to queue task IDs only when inputs grew large
  • Controlled concurrency via:
    • BullMQ workers
    • p-limit for fine-grained async control
  • Managed browser and tab lifecycles dynamically:
    • scaled tabs based on available memory
    • capped tabs per browser
    • spawned additional browser processes when needed

Phrase Detection at Scale

  • Early approach:
    • regex-based matching
    • whitelisted phrases masked before blacklist scanning
  • As blacklist size grew to ~400k phrases, performance degraded
  • Migrated to Aho–Corasick algorithm for:
    • linear-time multi-pattern matching
    • significantly improved performance and memory usage
  • Preserved whitelist context logic to avoid false positives

Challenges & Key Learnings

  • Process instability: crawlers would stop mid-run → improved logging and lifecycle management
  • Memory spikes from Puppeteer → added memory-aware scaling for browsers and tabs
  • Blocked or explicit sites → hardened error handling and diagnostics
  • Input growth:
    • evolved from KB-sized inputs to hundreds of MBs
    • refactored ingestion pipeline to:
      • upload inputs to S3
      • stream and parse JSON in workers
      • handle async + streaming edge cases
  • Optimized large database operations using batching and query tuning

Outcome

  • Replaced a largely manual compliance workflow with a fully automated system
  • Enabled:
    • continuous scanning of ~4,000+ pages daily
    • consistent and repeatable violation detection
  • The system was in active testing and iteration when I transitioned away

If I Were Rebuilding It Today

  • Design concurrency and resource limits upfront
  • Separate the core detection engine from crawling and reporting earlier
  • Invest earlier in observability and failure diagnostics