# Facebook Comet Rewrite Design ## Summary Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction. The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract. ## Goals - Replace both Facebook search and item-detail extraction with a current-shape parser. - Keep authenticated direct HTTP requests as the primary fetch strategy. - Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction. - Detect auth-gated, unavailable, and unknown responses explicitly. - Update tests so they model current route markers and failure modes instead of legacy page objects. ## Non-Goals - Reworking non-Facebook scrapers. - Converting the scraper to browser-only automation. - Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction. - Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation. ## Current State The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct. The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`. The item-detail path is centered on legacy extraction paths such as: - `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target` - nested `__bbox.require[...]` variations - recursive search through `parsed.require` Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as: - `XCometMarketplaceSearchController` - `XCometMarketplacePermalinkController` - `routing_namespace":"fb_comet"` - `use_ssr_state_manager":true` - `ServerJS` - `Bootloader` - `data-sjs` - `data-btmanifest` The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results. ## Chosen Approach Use a hybrid Comet-bootstrap parser. The scraper will: 1. Fetch authenticated HTML directly. 2. Classify the response using current route and auth markers. 3. Parse inline bootstrap/state payloads using route-specific probes. 4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape. This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure. ## Design ### Route Classification Add a small response-classification layer before data extraction. It should identify these states from the fetched response URL and HTML: - `auth_gated` - `unavailable` - `search` - `item` - `unknown` Signals to use: - final URL containing `/login/` or login-shell text - final URL containing `unavailable_product=1` - search controller markers such as `XCometMarketplaceSearchController` - item controller markers such as `XCometMarketplacePermalinkController` - shared Comet markers such as `routing_namespace":"fb_comet"` This classification layer becomes the top-level contract for both fetch functions. ### Search Extraction The search path will be rewritten around Comet search-route markers. Primary behavior: - fetch the Marketplace search HTML with auth cookies - confirm the response class is `search` - extract inline bootstrap/state blobs from script tags and page attributes - probe for route-specific search payloads associated with `XCometMarketplaceSearchController` - map decoded search results into summary listing records Search summary fields should remain aligned with the current public output shape: - item URL - title - formatted price and normalized cents when possible - city/address summary when present - seller summary when present in the search payload - category/status/media fields only when they are present with stable meaning Fallback behavior: - if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns - use item links matching `/marketplace/item/` as the anchor for fallback extraction - treat fallback results as summary-only data, not rich detail data ### Item Extraction The item-detail path will be rewritten around the Comet permalink route. Primary behavior: - fetch the item permalink HTML with auth cookies - confirm the response class is `item` - extract inline bootstrap/state blobs from script tags and page attributes - probe for permalink payloads associated with `XCometMarketplacePermalinkController` - decode the richest recoverable item record and map it into `FacebookListingDetails` Priority item fields: - item ID and permalink URL - title - formatted price and normalized cents when possible - condition - description - listed age / creation date when derivable - approximate location - seller name and seller ID when present - listing status when the payload makes it explicit Fallback behavior: - if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure - prioritize title, price, condition, description, location text, and seller module content - return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing ### Bootstrap Parsing Strategy The parser should stop assuming a single stable JSON path. Instead, it should work in two phases: 1. Discover candidate bootstrap payloads. 2. Score candidates against the expected route shape. Candidate discovery inputs: - raw `