diff --git a/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md b/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md new file mode 100644 index 0000000..cd7e0a8 --- /dev/null +++ b/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md @@ -0,0 +1,226 @@ +# Facebook Comet Rewrite Design + +## Summary + +Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction. +The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract. + +## Goals + +- Replace both Facebook search and item-detail extraction with a current-shape parser. +- Keep authenticated direct HTTP requests as the primary fetch strategy. +- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction. +- Detect auth-gated, unavailable, and unknown responses explicitly. +- Update tests so they model current route markers and failure modes instead of legacy page objects. + +## Non-Goals + +- Reworking non-Facebook scrapers. +- Converting the scraper to browser-only automation. +- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction. +- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation. + +## Current State + +The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct. +The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`. +The item-detail path is centered on legacy extraction paths such as: + +- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target` +- nested `__bbox.require[...]` variations +- recursive search through `parsed.require` + +Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as: + +- `XCometMarketplaceSearchController` +- `XCometMarketplacePermalinkController` +- `routing_namespace":"fb_comet"` +- `use_ssr_state_manager":true` +- `ServerJS` +- `Bootloader` +- `data-sjs` +- `data-btmanifest` + +The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results. + +## Chosen Approach + +Use a hybrid Comet-bootstrap parser. + +The scraper will: + +1. Fetch authenticated HTML directly. +2. Classify the response using current route and auth markers. +3. Parse inline bootstrap/state payloads using route-specific probes. +4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape. + +This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure. + +## Design + +### Route Classification + +Add a small response-classification layer before data extraction. +It should identify these states from the fetched response URL and HTML: + +- `auth_gated` +- `unavailable` +- `search` +- `item` +- `unknown` + +Signals to use: + +- final URL containing `/login/` or login-shell text +- final URL containing `unavailable_product=1` +- search controller markers such as `XCometMarketplaceSearchController` +- item controller markers such as `XCometMarketplacePermalinkController` +- shared Comet markers such as `routing_namespace":"fb_comet"` + +This classification layer becomes the top-level contract for both fetch functions. + +### Search Extraction + +The search path will be rewritten around Comet search-route markers. + +Primary behavior: + +- fetch the Marketplace search HTML with auth cookies +- confirm the response class is `search` +- extract inline bootstrap/state blobs from script tags and page attributes +- probe for route-specific search payloads associated with `XCometMarketplaceSearchController` +- map decoded search results into summary listing records + +Search summary fields should remain aligned with the current public output shape: + +- item URL +- title +- formatted price and normalized cents when possible +- city/address summary when present +- seller summary when present in the search payload +- category/status/media fields only when they are present with stable meaning + +Fallback behavior: + +- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns +- use item links matching `/marketplace/item/` as the anchor for fallback extraction +- treat fallback results as summary-only data, not rich detail data + +### Item Extraction + +The item-detail path will be rewritten around the Comet permalink route. + +Primary behavior: + +- fetch the item permalink HTML with auth cookies +- confirm the response class is `item` +- extract inline bootstrap/state blobs from script tags and page attributes +- probe for permalink payloads associated with `XCometMarketplacePermalinkController` +- decode the richest recoverable item record and map it into `FacebookListingDetails` + +Priority item fields: + +- item ID and permalink URL +- title +- formatted price and normalized cents when possible +- condition +- description +- listed age / creation date when derivable +- approximate location +- seller name and seller ID when present +- listing status when the payload makes it explicit + +Fallback behavior: + +- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure +- prioritize title, price, condition, description, location text, and seller module content +- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing + +### Bootstrap Parsing Strategy + +The parser should stop assuming a single stable JSON path. +Instead, it should work in two phases: + +1. Discover candidate bootstrap payloads. +2. Score candidates against the expected route shape. + +Candidate discovery inputs: + +- raw `