264 lines
9.5 KiB
Markdown
264 lines
9.5 KiB
Markdown
# Facebook Comet Rewrite Design
|
|
|
|
## Summary
|
|
|
|
Replace the legacy Facebook Marketplace scraper with a route-aware implementation built
|
|
around current Comet bootstrap markers and route-specific extraction.
|
|
The new scraper will keep authenticated direct HTTP fetches as the primary transport,
|
|
but it will stop treating legacy `require`, `__bbox`, and
|
|
`marketplace_product_details_page` structures as the main parsing contract.
|
|
|
|
## Goals
|
|
|
|
- Replace both Facebook search and item-detail extraction with a current-shape parser.
|
|
- Keep authenticated direct HTTP requests as the primary fetch strategy.
|
|
- Parse route-specific Comet bootstrap/state payloads before falling back to
|
|
rendered-HTML extraction.
|
|
- Detect auth-gated, unavailable, and unknown responses explicitly.
|
|
- Update tests so they model current route markers and failure modes instead of legacy
|
|
page objects.
|
|
|
|
## Non-Goals
|
|
|
|
- Reworking non-Facebook scrapers.
|
|
- Converting the scraper to browser-only automation.
|
|
- Preserving old parser behavior for `marketplace_product_details_page` or
|
|
`__bbox`-driven item extraction.
|
|
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively
|
|
before implementation.
|
|
|
|
## Current State
|
|
|
|
The current implementation in `packages/core/src/scrapers/facebook.ts` still uses
|
|
authenticated HTTP requests, which remains correct.
|
|
The search path parses embedded script JSON and looks for
|
|
`marketplace_search.feed_units.edges`. The item-detail path is centered on legacy
|
|
extraction paths such as:
|
|
|
|
- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
|
|
- nested `__bbox.require[...]` variations
|
|
- recursive search through `parsed.require`
|
|
|
|
Live evidence gathered earlier in this session and by the isolated research subagent
|
|
shows that current Facebook Marketplace pages are Comet route-driven and expose markers
|
|
such as:
|
|
|
|
- `XCometMarketplaceSearchController`
|
|
- `XCometMarketplacePermalinkController`
|
|
- `routing_namespace":"fb_comet"`
|
|
- `use_ssr_state_manager":true`
|
|
- `ServerJS`
|
|
- `Bootloader`
|
|
- `data-sjs`
|
|
- `data-btmanifest`
|
|
|
|
The same live investigation also showed that authenticated item pages no longer expose
|
|
the old `marketplace_product_details_page` marker reliably, while live search still
|
|
returns usable results.
|
|
|
|
## Chosen Approach
|
|
|
|
Use a hybrid Comet-bootstrap parser.
|
|
|
|
The scraper will:
|
|
|
|
1. Fetch authenticated HTML directly.
|
|
2. Classify the response using current route and auth markers.
|
|
3. Parse inline bootstrap/state payloads using route-specific probes.
|
|
4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the
|
|
payload cannot be decoded into the expected search or item shape.
|
|
|
|
This keeps the cheaper direct-HTTP transport while shifting the parser contract from
|
|
legacy page-object names to current Comet route structure.
|
|
|
|
## Design
|
|
|
|
### Route Classification
|
|
|
|
Add a small response-classification layer before data extraction.
|
|
It should identify these states from the fetched response URL and HTML:
|
|
|
|
- `auth_gated`
|
|
- `unavailable`
|
|
- `search`
|
|
- `item`
|
|
- `unknown`
|
|
|
|
Signals to use:
|
|
|
|
- final URL containing `/login/` or login-shell text
|
|
- final URL containing `unavailable_product=1`
|
|
- search controller markers such as `XCometMarketplaceSearchController`
|
|
- item controller markers such as `XCometMarketplacePermalinkController`
|
|
- shared Comet markers such as `routing_namespace":"fb_comet"`
|
|
|
|
This classification layer becomes the top-level contract for both fetch functions.
|
|
|
|
### Search Extraction
|
|
|
|
The search path will be rewritten around Comet search-route markers.
|
|
|
|
Primary behavior:
|
|
|
|
- fetch the Marketplace search HTML with auth cookies
|
|
- confirm the response class is `search`
|
|
- extract inline bootstrap/state blobs from script tags and page attributes
|
|
- probe for route-specific search payloads associated with
|
|
`XCometMarketplaceSearchController`
|
|
- map decoded search results into summary listing records
|
|
|
|
Search summary fields should remain aligned with the current public output shape:
|
|
|
|
- item URL
|
|
- title
|
|
- formatted price and normalized cents when possible
|
|
- city/address summary when present
|
|
- seller summary when present in the search payload
|
|
- category/status/media fields only when they are present with stable meaning
|
|
|
|
Fallback behavior:
|
|
|
|
- if search route markers are present but structured payload decoding fails, extract
|
|
listing summaries from rendered HTML anchors and text patterns
|
|
- use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
|
|
- treat fallback results as summary-only data, not rich detail data
|
|
|
|
### Item Extraction
|
|
|
|
The item-detail path will be rewritten around the Comet permalink route.
|
|
|
|
Primary behavior:
|
|
|
|
- fetch the item permalink HTML with auth cookies
|
|
- confirm the response class is `item`
|
|
- extract inline bootstrap/state blobs from script tags and page attributes
|
|
- probe for permalink payloads associated with `XCometMarketplacePermalinkController`
|
|
- decode the richest recoverable item record and map it into `FacebookListingDetails`
|
|
|
|
Priority item fields:
|
|
|
|
- item ID and permalink URL
|
|
- title
|
|
- formatted price and normalized cents when possible
|
|
- condition
|
|
- description
|
|
- listed age / creation date when derivable
|
|
- approximate location
|
|
- seller name and seller ID when present
|
|
- listing status when the payload makes it explicit
|
|
|
|
Fallback behavior:
|
|
|
|
- if permalink route markers are present but no stable payload object is decodable,
|
|
extract data from rendered HTML text structure
|
|
- prioritize title, price, condition, description, location text, and seller module
|
|
content
|
|
- return partial item data when core user-facing fields are present rather than failing
|
|
solely because deeper commerce metadata is missing
|
|
|
|
### Bootstrap Parsing Strategy
|
|
|
|
The parser should stop assuming a single stable JSON path.
|
|
Instead, it should work in two phases:
|
|
|
|
1. Discover candidate bootstrap payloads.
|
|
2. Score candidates against the expected route shape.
|
|
|
|
Candidate discovery inputs:
|
|
|
|
- raw `<script>` contents
|
|
- `data-sjs` and related page attributes
|
|
- `ServerJS` / `Bootloader` inline blobs
|
|
- route controller names
|
|
|
|
Candidate scoring for search should favor objects that contain repeated result-card
|
|
semantics, item IDs, listing links, titles, prices, or location summaries.
|
|
Candidate scoring for item pages should favor objects that contain singular listing
|
|
semantics, title, price, condition, description, location, seller, or permalink context.
|
|
|
|
The parser should not depend on one hard-coded object name surviving forever.
|
|
Instead, it should look for route-specific semantic clusters and choose the strongest
|
|
candidate.
|
|
|
|
### Legacy Removal
|
|
|
|
The old Facebook scraper should be removed as a primary strategy.
|
|
Specifically:
|
|
|
|
- delete old item-detail extraction paths centered on `marketplace_product_details_page`
|
|
- delete legacy-first `require` / `__bbox` navigation tables
|
|
- delete tests whose only purpose is to preserve those legacy paths
|
|
|
|
If a minimal legacy compatibility branch remains, it must be a last-resort fallback
|
|
behind the new route-aware parser and should not shape test fixtures or design
|
|
decisions.
|
|
|
|
### Error Handling
|
|
|
|
Facebook responses should now fail with explicit route-aware outcomes:
|
|
|
|
1. Missing/invalid auth cookie input.
|
|
2. Auth-gated response.
|
|
3. Unavailable or stale item response.
|
|
4. Search or item route detected, but no decodable data found.
|
|
5. Unknown response shape.
|
|
|
|
Error messages should name the actual class of failure instead of implying that every
|
|
parse miss is caused by expired cookies.
|
|
|
|
### Testing Strategy
|
|
|
|
Follow TDD for the rewrite.
|
|
Write failing tests for the new route-aware parser before replacing production code.
|
|
|
|
Coverage targets:
|
|
|
|
1. Search responses classify correctly from current Comet controller markers.
|
|
2. Item responses classify correctly from current Comet controller markers.
|
|
3. Login-gated and unavailable responses are detected before parsing.
|
|
4. Search bootstrap parsing produces summary listing results from current-shape
|
|
fixtures.
|
|
5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
|
|
6. Search fallback extraction works when route markers exist but structured payload
|
|
decoding fails.
|
|
7. Item fallback extraction works when route markers exist but structured payload
|
|
decoding fails.
|
|
8. Old legacy-only item fixtures are removed or rewritten so they no longer define the
|
|
contract.
|
|
|
|
Verification target after implementation:
|
|
|
|
- `bun test packages/core/test/facebook-core.test.ts`
|
|
- `bun test packages/core/test/facebook-integration.test.ts`
|
|
- a live authenticated Facebook probe covering search and item routes
|
|
|
|
## Public API Surface
|
|
|
|
Keep the current public function names unless the rewrite proves that a signature change
|
|
is required:
|
|
|
|
- `fetchFacebookItems(...)`
|
|
- `fetchFacebookItem(...)`
|
|
- `extractFacebookMarketplaceData(...)`
|
|
- `extractFacebookItemData(...)`
|
|
|
|
The internals should change substantially, but callers should not need a new integration
|
|
surface for this rewrite.
|
|
|
|
## Risks
|
|
|
|
- Facebook may change bootstrap payload naming again, so route/controller markers are
|
|
more stable than exact nested object paths but still not guaranteed.
|
|
- Search and item pages may each contain multiple partial payloads, making candidate
|
|
ranking important.
|
|
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs
|
|
clear precedence rules.
|
|
- Live fixtures can drift from production quickly, so tests must model route semantics
|
|
rather than exact one-off payloads where possible.
|
|
|
|
## Rollout Notes
|
|
|
|
The code, fixtures, and tests should change together.
|
|
There should be no mixed state where the implementation is Comet-aware but the tests
|
|
still encode `marketplace_product_details_page` as the primary contract.
|