Andrew Harris
Engineering Logs

Notes from running large-scale data acquisition in production.

Architecture, economics, and reliability - nothing about marketing funnels or "thought leadership."

Short, specific teardowns of how acquisition systems actually fail, scale, and pay for themselves. Written for the people running them.

New post every ~2 weeks Architecture · Economics · Anti-bot

Coming soon

First posts queued, in order of priority. If a topic here is what you're wrestling with, just email me - I'll send notes ahead of the writeup.

Log #1 · Coming soon

Why most scraping projects fail

The recurring failure modes I see across audits: source choice before architecture, cost-per-document never measured, anti-bot strategy that can't survive a single vendor change, and "scrape now, structure later" pipelines that never get structured. A field guide based on a decade of production systems.

Log #2 · Coming soon

Crawl economics: what a page actually costs you

Most teams price scraping at proxy cost. Real cost is proxy + compute + retry amplification + anti-bot vendor + human babysitting + data quality cleanup + replacement when the source breaks. A model for cost-per-acquired-document and where the actual margins live.

Log #3 · Coming soon

When Bright Data is the wrong solution

Managed scraping vendors are good at exactly the problems they're designed for and quietly terrible at the rest. A decision framework for build vs. buy on proxies, scraping APIs, and full managed data feeds - written from the buyer side, not the vendor side.

Log #4 · Coming soon

Why RAG fails without an acquisition strategy

"We'll just embed everything" is not an acquisition strategy. Why most AI products plateau on data quality and what a real ingestion pipeline looks like - source selection, extraction tier, structure-first vs. structure-later, refresh cadence, and the part nobody talks about: continuous coverage.

Get notified

No newsletter system yet - just email me and I'll add you to a small "ship list" that gets a one-line note when each log goes live.