# Legal-tech directory, scrape methodology

A short record of how this dataset was built, the integrity rules behind it, and what the analysis caught. The dataset itself is `legaltech-directory-dataset.json` (651 product records).

## Target and approach

- Primary target: DreamLegal, the only at-scale competitor in the AI contract-review directory space.
- Method: API reverse-engineering, not HTML scraping. DreamLegal runs a Next.js site with a public JSON API; the scraper reads those endpoints directly. A browser was used only for full-page screenshots as visual evidence, never for the data layer.
- Surface: 5 core endpoints carried 100% of the usable data; 51 were probed in total to map what the platform exposes (product list, per-product score, reviews, taxonomy, blogs, paid placements, comparisons).

## Integrity (every record is auditable)

- Polite, identifiable user-agent with a research contact. Rate limited to ~1 request / 0.5s with exponential backoff on errors. robots.txt respected (permissive, no crawl-delay).
- Immutable raw layer: every response is saved verbatim with a sibling metadata file carrying source URL, fetch timestamp, request body, response status and headers, and a SHA256 hash of the body. The normalized layer regenerates from raw, so nothing is ever silently edited.
- Idempotent re-runs: re-running skips work already done; any single product can be refreshed end to end.

## The catch (the part a single-pass scraper misses)

- The API was lying. Every full pass returned a random ~65% sample of the catalog while always reporting a total of 651. One scrape looked complete but was not, and returned a different subset each run. The fix was a saturation scraper: paginate repeatedly, dedupe by id, stop after consecutive zero-gain passes. Ten passes converged exactly to 651. Their total was honest; their pagination was broken. A human had to notice the unique count drifting between runs.
- The engagement moat was empty. The platform ships an elaborate scoring system with tiers and percentiles, but summed engagement across the catalog was effectively zero (views, bookmarks, reviews all at or near 0). Infrastructure for engagement it has not yet earned.
- The "market leader" was bots. The largest catalog by size drew ~86% of its traffic from one country at ~1.1 pages per visit, the signature of referral or bot traffic, not legal-practitioner attention. Catalog size did not equal audience.

## Secondary platforms

- Legaltech Hub, LawNext Directory, and r/legaltech were captured by HTML parsing (no clean API), under the same evidence discipline (raw HTML + per-page hash).

## Honest note

This was competitive research. The listing data belongs to the platforms it came from; this export is shared as proof of method and rigor, not as a product.
