9.5 KiB
Facebook Comet Rewrite Design
Summary
Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy require, __bbox, and marketplace_product_details_page structures as the main parsing contract.
Goals
- Replace both Facebook search and item-detail extraction with a current-shape parser.
- Keep authenticated direct HTTP requests as the primary fetch strategy.
- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
- Detect auth-gated, unavailable, and unknown responses explicitly.
- Update tests so they model current route markers and failure modes instead of legacy page objects.
Non-Goals
- Reworking non-Facebook scrapers.
- Converting the scraper to browser-only automation.
- Preserving old parser behavior for
marketplace_product_details_pageor__bbox-driven item extraction. - Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.
Current State
The current implementation in packages/core/src/scrapers/facebook.ts still uses authenticated HTTP requests, which remains correct.
The search path parses embedded script JSON and looks for marketplace_search.feed_units.edges.
The item-detail path is centered on legacy extraction paths such as:
parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target- nested
__bbox.require[...]variations - recursive search through
parsed.require
Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:
XCometMarketplaceSearchControllerXCometMarketplacePermalinkControllerrouting_namespace":"fb_comet"use_ssr_state_manager":trueServerJSBootloaderdata-sjsdata-btmanifest
The same live investigation also showed that authenticated item pages no longer expose the old marketplace_product_details_page marker reliably, while live search still returns usable results.
Chosen Approach
Use a hybrid Comet-bootstrap parser.
The scraper will:
- Fetch authenticated HTML directly.
- Classify the response using current route and auth markers.
- Parse inline bootstrap/state payloads using route-specific probes.
- Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.
This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.
Design
Route Classification
Add a small response-classification layer before data extraction. It should identify these states from the fetched response URL and HTML:
auth_gatedunavailablesearchitemunknown
Signals to use:
- final URL containing
/login/or login-shell text - final URL containing
unavailable_product=1 - search controller markers such as
XCometMarketplaceSearchController - item controller markers such as
XCometMarketplacePermalinkController - shared Comet markers such as
routing_namespace":"fb_comet"
This classification layer becomes the top-level contract for both fetch functions.
Search Extraction
The search path will be rewritten around Comet search-route markers.
Primary behavior:
- fetch the Marketplace search HTML with auth cookies
- confirm the response class is
search - extract inline bootstrap/state blobs from script tags and page attributes
- probe for route-specific search payloads associated with
XCometMarketplaceSearchController - map decoded search results into summary listing records
Search summary fields should remain aligned with the current public output shape:
- item URL
- title
- formatted price and normalized cents when possible
- city/address summary when present
- seller summary when present in the search payload
- category/status/media fields only when they are present with stable meaning
Fallback behavior:
- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
- use item links matching
/marketplace/item/<id>as the anchor for fallback extraction - treat fallback results as summary-only data, not rich detail data
Item Extraction
The item-detail path will be rewritten around the Comet permalink route.
Primary behavior:
- fetch the item permalink HTML with auth cookies
- confirm the response class is
item - extract inline bootstrap/state blobs from script tags and page attributes
- probe for permalink payloads associated with
XCometMarketplacePermalinkController - decode the richest recoverable item record and map it into
FacebookListingDetails
Priority item fields:
- item ID and permalink URL
- title
- formatted price and normalized cents when possible
- condition
- description
- listed age / creation date when derivable
- approximate location
- seller name and seller ID when present
- listing status when the payload makes it explicit
Fallback behavior:
- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
- prioritize title, price, condition, description, location text, and seller module content
- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing
Bootstrap Parsing Strategy
The parser should stop assuming a single stable JSON path. Instead, it should work in two phases:
- Discover candidate bootstrap payloads.
- Score candidates against the expected route shape.
Candidate discovery inputs:
- raw
<script>contents data-sjsand related page attributesServerJS/Bootloaderinline blobs- route controller names
Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries. Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.
The parser should not depend on one hard-coded object name surviving forever. Instead, it should look for route-specific semantic clusters and choose the strongest candidate.
Legacy Removal
The old Facebook scraper should be removed as a primary strategy. Specifically:
- delete old item-detail extraction paths centered on
marketplace_product_details_page - delete legacy-first
require/__bboxnavigation tables - delete tests whose only purpose is to preserve those legacy paths
If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.
Error Handling
Facebook responses should now fail with explicit route-aware outcomes:
- Missing/invalid auth cookie input.
- Auth-gated response.
- Unavailable or stale item response.
- Search or item route detected, but no decodable data found.
- Unknown response shape.
Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.
Testing Strategy
Follow TDD for the rewrite. Write failing tests for the new route-aware parser before replacing production code.
Coverage targets:
- Search responses classify correctly from current Comet controller markers.
- Item responses classify correctly from current Comet controller markers.
- Login-gated and unavailable responses are detected before parsing.
- Search bootstrap parsing produces summary listing results from current-shape fixtures.
- Item bootstrap parsing produces rich listing details from current-shape fixtures.
- Search fallback extraction works when route markers exist but structured payload decoding fails.
- Item fallback extraction works when route markers exist but structured payload decoding fails.
- Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.
Verification target after implementation:
bun test packages/core/test/facebook-core.test.tsbun test packages/core/test/facebook-integration.test.ts- a live authenticated Facebook probe covering search and item routes
Public API Surface
Keep the current public function names unless the rewrite proves that a signature change is required:
fetchFacebookItems(...)fetchFacebookItem(...)extractFacebookMarketplaceData(...)extractFacebookItemData(...)
The internals should change substantially, but callers should not need a new integration surface for this rewrite.
Risks
- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.
Rollout Notes
The code, fixtures, and tests should change together.
There should be no mixed state where the implementation is Comet-aware but the tests still encode marketplace_product_details_page as the primary contract.