Available for new work · Lahore, PKLinkedInGitHubUpwork

Scraped a rival's full catalog, with receipts

My role

Competitive-intelligence work behind the Cognilium legal-directory strategy. The scraper, the dataset, and the analysis are mine.

The problem
We needed the full, defensible competitive map of the legal-AI directory space, real data with a chain of custody, not guesses and not a scrape we had to take on faith.
What I did
  • Reverse-engineered DreamLegal's API instead of scraping HTML: the site is a Next.js app with a public JSON API, so I read the endpoints directly. Probed 51, found the 5 that carried 100% of the data (product list, per-product score, reviews, taxonomy, blogs).
  • Caught the API lying: every full pass returned a random ~65% sample of the catalog while always reporting a total of 651, so one scrape looked complete but quietly missed a third. I wrote a saturation scraper (paginate, dedupe by id, stop after consecutive zero-gain passes); 10 passes converged exactly to 651.
  • Made every record auditable: ~1,700 polite requests at ~1 req / 0.5s, identifiable user-agent, robots respected, and an immutable raw layer where each response is stored verbatim with a SHA256 hash and full request/response metadata. The normalized layer regenerates from raw.
  • Cross-checked 3 more platforms (Legaltech Hub, LawNext, r/legaltech) by HTML parsing under the same evidence discipline, 2,791 more vendor entities.
The result
A complete, auditable 651-product dataset, every record traceable to a hashed raw response, downloadable below. Building it also surfaced two things the raw numbers hid: an elaborate scoring 'moat' on ~0 real engagement, and an apparent market leader running ~86% bot traffic.
The judgment call: what the AI couldn't do

Two catches the raw data hid. First, the API was non-deterministic: it returned a random ~65% slice each pass while reporting a fixed total, so a normal scrape silently missed a third and still looked complete. I only caught it because the unique count drifted between runs; the fix was a saturation scraper that converged to the true 651. Second, the apparent market leader's traffic was ~86% synthetic. The data was right; the obvious read was wrong. Catching both was the whole value.

The signal in the noise
86% synthetic / bot
14% real

One rival looked like the market leader on raw traffic. The shape of it told the truth.

By the numbers
0
products, full catalog
0
bot traffic caught
0
API endpoints probed
0
saturation passes
Proof
Download: Full 651-product dataset + the methodology (below).
On request: Scraper toolchain + the per-platform analysis reports.