docs: add facebook comet rewrite design

2026-04-21 23:02:47 -04:00
parent 45cff20377
commit ba889a1f9d
1 changed files with 226 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md
+++ b/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md
@@ -0,0 +1,226 @@
+# Facebook Comet Rewrite Design
+
+## Summary
+
+Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
+The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract.
+
+## Goals
+
+- Replace both Facebook search and item-detail extraction with a current-shape parser.
+- Keep authenticated direct HTTP requests as the primary fetch strategy.
+- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
+- Detect auth-gated, unavailable, and unknown responses explicitly.
+- Update tests so they model current route markers and failure modes instead of legacy page objects.
+
+## Non-Goals
+
+- Reworking non-Facebook scrapers.
+- Converting the scraper to browser-only automation.
+- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction.
+- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.
+
+## Current State
+
+The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct.
+The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`.
+The item-detail path is centered on legacy extraction paths such as:
+
+- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
+- nested `__bbox.require[...]` variations
+- recursive search through `parsed.require`
+
+Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:
+
+- `XCometMarketplaceSearchController`
+- `XCometMarketplacePermalinkController`
+- `routing_namespace":"fb_comet"`
+- `use_ssr_state_manager":true`
+- `ServerJS`
+- `Bootloader`
+- `data-sjs`
+- `data-btmanifest`
+
+The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results.
+
+## Chosen Approach
+
+Use a hybrid Comet-bootstrap parser.
+
+The scraper will:
+
+1. Fetch authenticated HTML directly.
+2. Classify the response using current route and auth markers.
+3. Parse inline bootstrap/state payloads using route-specific probes.
+4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.
+
+This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.
+
+## Design
+
+### Route Classification
+
+Add a small response-classification layer before data extraction.
+It should identify these states from the fetched response URL and HTML:
+
+- `auth_gated`
+- `unavailable`
+- `search`
+- `item`
+- `unknown`
+
+Signals to use:
+
+- final URL containing `/login/` or login-shell text
+- final URL containing `unavailable_product=1`
+- search controller markers such as `XCometMarketplaceSearchController`
+- item controller markers such as `XCometMarketplacePermalinkController`
+- shared Comet markers such as `routing_namespace":"fb_comet"`
+
+This classification layer becomes the top-level contract for both fetch functions.
+
+### Search Extraction
+
+The search path will be rewritten around Comet search-route markers.
+
+Primary behavior:
+
+- fetch the Marketplace search HTML with auth cookies
+- confirm the response class is `search`
+- extract inline bootstrap/state blobs from script tags and page attributes
+- probe for route-specific search payloads associated with `XCometMarketplaceSearchController`
+- map decoded search results into summary listing records
+
+Search summary fields should remain aligned with the current public output shape:
+
+- item URL
+- title
+- formatted price and normalized cents when possible
+- city/address summary when present
+- seller summary when present in the search payload
+- category/status/media fields only when they are present with stable meaning
+
+Fallback behavior:
+
+- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
+- use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
+- treat fallback results as summary-only data, not rich detail data
+
+### Item Extraction
+
+The item-detail path will be rewritten around the Comet permalink route.
+
+Primary behavior:
+
+- fetch the item permalink HTML with auth cookies
+- confirm the response class is `item`
+- extract inline bootstrap/state blobs from script tags and page attributes
+- probe for permalink payloads associated with `XCometMarketplacePermalinkController`
+- decode the richest recoverable item record and map it into `FacebookListingDetails`
+
+Priority item fields:
+
+- item ID and permalink URL
+- title
+- formatted price and normalized cents when possible
+- condition
+- description
+- listed age / creation date when derivable
+- approximate location
+- seller name and seller ID when present
+- listing status when the payload makes it explicit
+
+Fallback behavior:
+
+- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
+- prioritize title, price, condition, description, location text, and seller module content
+- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing
+
+### Bootstrap Parsing Strategy
+
+The parser should stop assuming a single stable JSON path.
+Instead, it should work in two phases:
+
+1. Discover candidate bootstrap payloads.
+2. Score candidates against the expected route shape.
+
+Candidate discovery inputs:
+
+- raw `<script>` contents
+- `data-sjs` and related page attributes
+- `ServerJS` / `Bootloader` inline blobs
+- route controller names
+
+Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries.
+Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.
+
+The parser should not depend on one hard-coded object name surviving forever.
+Instead, it should look for route-specific semantic clusters and choose the strongest candidate.
+
+### Legacy Removal
+
+The old Facebook scraper should be removed as a primary strategy.
+Specifically:
+
+- delete old item-detail extraction paths centered on `marketplace_product_details_page`
+- delete legacy-first `require` / `__bbox` navigation tables
+- delete tests whose only purpose is to preserve those legacy paths
+
+If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.
+
+### Error Handling
+
+Facebook responses should now fail with explicit route-aware outcomes:
+
+1. Missing/invalid auth cookie input.
+2. Auth-gated response.
+3. Unavailable or stale item response.
+4. Search or item route detected, but no decodable data found.
+5. Unknown response shape.
+
+Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.
+
+### Testing Strategy
+
+Follow TDD for the rewrite.
+Write failing tests for the new route-aware parser before replacing production code.
+
+Coverage targets:
+
+1. Search responses classify correctly from current Comet controller markers.
+2. Item responses classify correctly from current Comet controller markers.
+3. Login-gated and unavailable responses are detected before parsing.
+4. Search bootstrap parsing produces summary listing results from current-shape fixtures.
+5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
+6. Search fallback extraction works when route markers exist but structured payload decoding fails.
+7. Item fallback extraction works when route markers exist but structured payload decoding fails.
+8. Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.
+
+Verification target after implementation:
+
+- `bun test packages/core/test/facebook-core.test.ts`
+- `bun test packages/core/test/facebook-integration.test.ts`
+- a live authenticated Facebook probe covering search and item routes
+
+## Public API Surface
+
+Keep the current public function names unless the rewrite proves that a signature change is required:
+
+- `fetchFacebookItems(...)`
+- `fetchFacebookItem(...)`
+- `extractFacebookMarketplaceData(...)`
+- `extractFacebookItemData(...)`
+
+The internals should change substantially, but callers should not need a new integration surface for this rewrite.
+
+## Risks
+
+- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
+- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
+- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
+- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.
+
+## Rollout Notes
+
+The code, fixtures, and tests should change together.
+There should be no mixed state where the implementation is Comet-aware but the tests still encode `marketplace_product_details_page` as the primary contract.