Files

Dmytro Stanchiev ba889a1f9d docs: add facebook comet rewrite design

2026-04-21 23:02:47 -04:00

9.5 KiB

Raw Blame History

Facebook Comet Rewrite Design

Summary

Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction. The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy require, __bbox, and marketplace_product_details_page structures as the main parsing contract.

Goals

Replace both Facebook search and item-detail extraction with a current-shape parser.
Keep authenticated direct HTTP requests as the primary fetch strategy.
Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
Detect auth-gated, unavailable, and unknown responses explicitly.
Update tests so they model current route markers and failure modes instead of legacy page objects.

Non-Goals

Reworking non-Facebook scrapers.
Converting the scraper to browser-only automation.
Preserving old parser behavior for marketplace_product_details_page or __bbox-driven item extraction.
Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.

Current State

The current implementation in packages/core/src/scrapers/facebook.ts still uses authenticated HTTP requests, which remains correct. The search path parses embedded script JSON and looks for marketplace_search.feed_units.edges. The item-detail path is centered on legacy extraction paths such as:

parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target
nested __bbox.require[...] variations
recursive search through parsed.require

Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:

XCometMarketplaceSearchController
XCometMarketplacePermalinkController
routing_namespace":"fb_comet"
use_ssr_state_manager":true
ServerJS
Bootloader
data-sjs
data-btmanifest

The same live investigation also showed that authenticated item pages no longer expose the old marketplace_product_details_page marker reliably, while live search still returns usable results.

Chosen Approach

Use a hybrid Comet-bootstrap parser.

The scraper will:

Fetch authenticated HTML directly.
Classify the response using current route and auth markers.
Parse inline bootstrap/state payloads using route-specific probes.
Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.

This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.

Design

Route Classification

Add a small response-classification layer before data extraction. It should identify these states from the fetched response URL and HTML:

auth_gated
unavailable
search
item
unknown

Signals to use:

final URL containing /login/ or login-shell text
final URL containing unavailable_product=1
search controller markers such as XCometMarketplaceSearchController
item controller markers such as XCometMarketplacePermalinkController
shared Comet markers such as routing_namespace":"fb_comet"

This classification layer becomes the top-level contract for both fetch functions.

Search Extraction

The search path will be rewritten around Comet search-route markers.

Primary behavior:

fetch the Marketplace search HTML with auth cookies
confirm the response class is search
extract inline bootstrap/state blobs from script tags and page attributes
probe for route-specific search payloads associated with XCometMarketplaceSearchController
map decoded search results into summary listing records

Search summary fields should remain aligned with the current public output shape:

item URL
title
formatted price and normalized cents when possible
city/address summary when present
seller summary when present in the search payload
category/status/media fields only when they are present with stable meaning

Fallback behavior:

if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
use item links matching /marketplace/item/<id> as the anchor for fallback extraction
treat fallback results as summary-only data, not rich detail data

Item Extraction

The item-detail path will be rewritten around the Comet permalink route.

Primary behavior:

fetch the item permalink HTML with auth cookies
confirm the response class is item
extract inline bootstrap/state blobs from script tags and page attributes
probe for permalink payloads associated with XCometMarketplacePermalinkController
decode the richest recoverable item record and map it into FacebookListingDetails

Priority item fields:

item ID and permalink URL
title
formatted price and normalized cents when possible
condition
description
listed age / creation date when derivable
approximate location
seller name and seller ID when present
listing status when the payload makes it explicit

Fallback behavior:

if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
prioritize title, price, condition, description, location text, and seller module content
return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing

Bootstrap Parsing Strategy

The parser should stop assuming a single stable JSON path. Instead, it should work in two phases:

Discover candidate bootstrap payloads.
Score candidates against the expected route shape.

Candidate discovery inputs:

raw <script> contents
data-sjs and related page attributes
ServerJS / Bootloader inline blobs
route controller names

Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries. Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.

The parser should not depend on one hard-coded object name surviving forever. Instead, it should look for route-specific semantic clusters and choose the strongest candidate.

Legacy Removal

The old Facebook scraper should be removed as a primary strategy. Specifically:

delete old item-detail extraction paths centered on marketplace_product_details_page
delete legacy-first require / __bbox navigation tables
delete tests whose only purpose is to preserve those legacy paths

If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.

Error Handling

Facebook responses should now fail with explicit route-aware outcomes:

Missing/invalid auth cookie input.
Auth-gated response.
Unavailable or stale item response.
Search or item route detected, but no decodable data found.
Unknown response shape.

Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.

Testing Strategy

Follow TDD for the rewrite. Write failing tests for the new route-aware parser before replacing production code.

Coverage targets:

Search responses classify correctly from current Comet controller markers.
Item responses classify correctly from current Comet controller markers.
Login-gated and unavailable responses are detected before parsing.
Search bootstrap parsing produces summary listing results from current-shape fixtures.
Item bootstrap parsing produces rich listing details from current-shape fixtures.
Search fallback extraction works when route markers exist but structured payload decoding fails.
Item fallback extraction works when route markers exist but structured payload decoding fails.
Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.

Verification target after implementation:

bun test packages/core/test/facebook-core.test.ts
bun test packages/core/test/facebook-integration.test.ts
a live authenticated Facebook probe covering search and item routes

Public API Surface

Keep the current public function names unless the rewrite proves that a signature change is required:

fetchFacebookItems(...)
fetchFacebookItem(...)
extractFacebookMarketplaceData(...)
extractFacebookItemData(...)

The internals should change substantially, but callers should not need a new integration surface for this rewrite.

Risks

Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
Search and item pages may each contain multiple partial payloads, making candidate ranking important.
Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.

Rollout Notes

The code, fixtures, and tests should change together. There should be no mixed state where the implementation is Comet-aware but the tests still encode marketplace_product_details_page as the primary contract.

9.5 KiB Raw Blame History