Auto Crawling & Violation Detection

Overview

An SEO company needed to regularly scan client websites for illegal, explicit, or forbidden terms.
Each client site had:

multiple pages
deep internal links
varying site structures

The existing workflow involved manual checks, was slow, inconsistent, and didn’t scale. The system needed to:

run multiple crawls concurrently across different client sites
crawl websites up to a configurable depth
analyze page content against a large blacklist (~400,000 phrases)
respect whitelisted contexts to reduce false positives

I helped design and implement an internal backend system to crawl client websites and detect illegal or forbidden content at scale, replacing a manual, error-prone SEO compliance process.

My Role

Backend developer responsible for end-to-end system design and implementation
Worked closely with an external vendor to:
- refine requirements
- finalize application flow
- iterate on technical decisions
Built the entire backend, including crawling, queuing, phrase detection, concurrency control, and reporting

Crawling and Page Analysis

Implemented a BFS-based web crawler using Puppeteer
Enforced:
- configurable depth limits (default up to depth 5)
- duplicate URL elimination using a Set
Each page was:
- rendered in a headless browser
- scanned for violations
- visually annotated (matched terms highlighted)
- captured via screenshots for audit/reference
Screenshots were uploaded to S3 for later inspection

Queuing and Concurrency

Used Redis + BullMQ for job orchestration
Designed the queue to support:
- multiple concurrent crawl jobs
- multiple sites per crawl
- multiple pages per site
Initially queued entire payloads; later refactored to queue task IDs only when inputs grew large
Controlled concurrency via:
- BullMQ workers
- p-limit for fine-grained async control
Managed browser and tab lifecycles dynamically:
- scaled tabs based on available memory
- capped tabs per browser
- spawned additional browser processes when needed

Phrase Detection at Scale

Early approach:
- regex-based matching
- whitelisted phrases masked before blacklist scanning
As blacklist size grew to ~400k phrases, performance degraded
Migrated to Aho–Corasick algorithm for:
- linear-time multi-pattern matching
- significantly improved performance and memory usage
Preserved whitelist context logic to avoid false positives

Challenges & Key Learnings

Process instability: crawlers would stop mid-run → improved logging and lifecycle management
Memory spikes from Puppeteer → added memory-aware scaling for browsers and tabs
Blocked or explicit sites → hardened error handling and diagnostics
Input growth:
- evolved from KB-sized inputs to hundreds of MBs
- refactored ingestion pipeline to:
  - upload inputs to S3
  - stream and parse JSON in workers
  - handle async + streaming edge cases
Optimized large database operations using batching and query tuning

Outcome

Replaced a largely manual compliance workflow with a fully automated system
Enabled:
- continuous scanning of ~4,000+ pages daily
- consistent and repeatable violation detection
The system was in active testing and iteration when I transitioned away

If I Were Rebuilding It Today

Design concurrency and resource limits upfront
Separate the core detection engine from crawling and reporting earlier
Invest earlier in observability and failure diagnostics