Scraped a rival's full catalog, with receipts

My role

Competitive-intelligence work behind the Cognilium legal-directory strategy. The scraper, the dataset, and the analysis are mine.

The problem

We needed the full, defensible competitive map of the legal-AI directory space, real data with a chain of custody, not guesses and not a scrape we had to take on faith.

What I did

Reverse-engineered DreamLegal's API instead of scraping HTML: the site is a Next.js app with a public JSON API, so I read the endpoints directly. Probed 51, found the 5 that carried 100% of the data (product list, per-product score, reviews, taxonomy, blogs).
Caught the API lying: every full pass returned a random ~65% sample of the catalog while always reporting a total of 651, so one scrape looked complete but quietly missed a third. I wrote a saturation scraper (paginate, dedupe by id, stop after consecutive zero-gain passes); 10 passes converged exactly to 651.
Made every record auditable: ~1,700 polite requests at ~1 req / 0.5s, identifiable user-agent, robots respected, and an immutable raw layer where each response is stored verbatim with a SHA256 hash and full request/response metadata. The normalized layer regenerates from raw.
Cross-checked 3 more platforms (Legaltech Hub, LawNext, r/legaltech) by HTML parsing under the same evidence discipline, 2,791 more vendor entities.

The result

A complete, auditable 651-product dataset, every record traceable to a hashed raw response, downloadable below. Building it also surfaced two things the raw numbers hid: an elaborate scoring 'moat' on ~0 real engagement, and an apparent market leader running ~86% bot traffic.

The judgment call: what the AI couldn't do

Two catches the raw data hid. First, the API was non-deterministic: it returned a random ~65% slice each pass while reporting a fixed total, so a normal scrape silently missed a third and still looked complete. I only caught it because the unique count drifted between runs; the fix was a saturation scraper that converged to the true 651. Second, the apparent market leader's traffic was ~86% synthetic. The data was right; the obvious read was wrong. Catching both was the whole value.

The signal in the noise

86% synthetic / bot

14% real

One rival looked like the market leader on raw traffic. The shape of it told the truth.

By the numbers

products, full catalog

bot traffic caught

API endpoints probed

saturation passes

Proof

Download: Full 651-product dataset + the methodology (below).

On request: Scraper toolchain + the per-platform analysis reports.

Take it with you

Download the full scraped dataset (JSON, 651 products)Download ↓Scrape methodology (how it was built + what it caught)Download ↓

← Previous

Brand rules the AI can't break

A directory where money can't buy rank