docs: add facebook comet rewrite design
This commit is contained in:
@@ -0,0 +1,226 @@
|
||||
# Facebook Comet Rewrite Design
|
||||
|
||||
## Summary
|
||||
|
||||
Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
|
||||
The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract.
|
||||
|
||||
## Goals
|
||||
|
||||
- Replace both Facebook search and item-detail extraction with a current-shape parser.
|
||||
- Keep authenticated direct HTTP requests as the primary fetch strategy.
|
||||
- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
|
||||
- Detect auth-gated, unavailable, and unknown responses explicitly.
|
||||
- Update tests so they model current route markers and failure modes instead of legacy page objects.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Reworking non-Facebook scrapers.
|
||||
- Converting the scraper to browser-only automation.
|
||||
- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction.
|
||||
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.
|
||||
|
||||
## Current State
|
||||
|
||||
The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct.
|
||||
The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`.
|
||||
The item-detail path is centered on legacy extraction paths such as:
|
||||
|
||||
- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
|
||||
- nested `__bbox.require[...]` variations
|
||||
- recursive search through `parsed.require`
|
||||
|
||||
Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:
|
||||
|
||||
- `XCometMarketplaceSearchController`
|
||||
- `XCometMarketplacePermalinkController`
|
||||
- `routing_namespace":"fb_comet"`
|
||||
- `use_ssr_state_manager":true`
|
||||
- `ServerJS`
|
||||
- `Bootloader`
|
||||
- `data-sjs`
|
||||
- `data-btmanifest`
|
||||
|
||||
The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results.
|
||||
|
||||
## Chosen Approach
|
||||
|
||||
Use a hybrid Comet-bootstrap parser.
|
||||
|
||||
The scraper will:
|
||||
|
||||
1. Fetch authenticated HTML directly.
|
||||
2. Classify the response using current route and auth markers.
|
||||
3. Parse inline bootstrap/state payloads using route-specific probes.
|
||||
4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.
|
||||
|
||||
This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.
|
||||
|
||||
## Design
|
||||
|
||||
### Route Classification
|
||||
|
||||
Add a small response-classification layer before data extraction.
|
||||
It should identify these states from the fetched response URL and HTML:
|
||||
|
||||
- `auth_gated`
|
||||
- `unavailable`
|
||||
- `search`
|
||||
- `item`
|
||||
- `unknown`
|
||||
|
||||
Signals to use:
|
||||
|
||||
- final URL containing `/login/` or login-shell text
|
||||
- final URL containing `unavailable_product=1`
|
||||
- search controller markers such as `XCometMarketplaceSearchController`
|
||||
- item controller markers such as `XCometMarketplacePermalinkController`
|
||||
- shared Comet markers such as `routing_namespace":"fb_comet"`
|
||||
|
||||
This classification layer becomes the top-level contract for both fetch functions.
|
||||
|
||||
### Search Extraction
|
||||
|
||||
The search path will be rewritten around Comet search-route markers.
|
||||
|
||||
Primary behavior:
|
||||
|
||||
- fetch the Marketplace search HTML with auth cookies
|
||||
- confirm the response class is `search`
|
||||
- extract inline bootstrap/state blobs from script tags and page attributes
|
||||
- probe for route-specific search payloads associated with `XCometMarketplaceSearchController`
|
||||
- map decoded search results into summary listing records
|
||||
|
||||
Search summary fields should remain aligned with the current public output shape:
|
||||
|
||||
- item URL
|
||||
- title
|
||||
- formatted price and normalized cents when possible
|
||||
- city/address summary when present
|
||||
- seller summary when present in the search payload
|
||||
- category/status/media fields only when they are present with stable meaning
|
||||
|
||||
Fallback behavior:
|
||||
|
||||
- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
|
||||
- use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
|
||||
- treat fallback results as summary-only data, not rich detail data
|
||||
|
||||
### Item Extraction
|
||||
|
||||
The item-detail path will be rewritten around the Comet permalink route.
|
||||
|
||||
Primary behavior:
|
||||
|
||||
- fetch the item permalink HTML with auth cookies
|
||||
- confirm the response class is `item`
|
||||
- extract inline bootstrap/state blobs from script tags and page attributes
|
||||
- probe for permalink payloads associated with `XCometMarketplacePermalinkController`
|
||||
- decode the richest recoverable item record and map it into `FacebookListingDetails`
|
||||
|
||||
Priority item fields:
|
||||
|
||||
- item ID and permalink URL
|
||||
- title
|
||||
- formatted price and normalized cents when possible
|
||||
- condition
|
||||
- description
|
||||
- listed age / creation date when derivable
|
||||
- approximate location
|
||||
- seller name and seller ID when present
|
||||
- listing status when the payload makes it explicit
|
||||
|
||||
Fallback behavior:
|
||||
|
||||
- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
|
||||
- prioritize title, price, condition, description, location text, and seller module content
|
||||
- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing
|
||||
|
||||
### Bootstrap Parsing Strategy
|
||||
|
||||
The parser should stop assuming a single stable JSON path.
|
||||
Instead, it should work in two phases:
|
||||
|
||||
1. Discover candidate bootstrap payloads.
|
||||
2. Score candidates against the expected route shape.
|
||||
|
||||
Candidate discovery inputs:
|
||||
|
||||
- raw `<script>` contents
|
||||
- `data-sjs` and related page attributes
|
||||
- `ServerJS` / `Bootloader` inline blobs
|
||||
- route controller names
|
||||
|
||||
Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries.
|
||||
Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.
|
||||
|
||||
The parser should not depend on one hard-coded object name surviving forever.
|
||||
Instead, it should look for route-specific semantic clusters and choose the strongest candidate.
|
||||
|
||||
### Legacy Removal
|
||||
|
||||
The old Facebook scraper should be removed as a primary strategy.
|
||||
Specifically:
|
||||
|
||||
- delete old item-detail extraction paths centered on `marketplace_product_details_page`
|
||||
- delete legacy-first `require` / `__bbox` navigation tables
|
||||
- delete tests whose only purpose is to preserve those legacy paths
|
||||
|
||||
If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.
|
||||
|
||||
### Error Handling
|
||||
|
||||
Facebook responses should now fail with explicit route-aware outcomes:
|
||||
|
||||
1. Missing/invalid auth cookie input.
|
||||
2. Auth-gated response.
|
||||
3. Unavailable or stale item response.
|
||||
4. Search or item route detected, but no decodable data found.
|
||||
5. Unknown response shape.
|
||||
|
||||
Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
Follow TDD for the rewrite.
|
||||
Write failing tests for the new route-aware parser before replacing production code.
|
||||
|
||||
Coverage targets:
|
||||
|
||||
1. Search responses classify correctly from current Comet controller markers.
|
||||
2. Item responses classify correctly from current Comet controller markers.
|
||||
3. Login-gated and unavailable responses are detected before parsing.
|
||||
4. Search bootstrap parsing produces summary listing results from current-shape fixtures.
|
||||
5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
|
||||
6. Search fallback extraction works when route markers exist but structured payload decoding fails.
|
||||
7. Item fallback extraction works when route markers exist but structured payload decoding fails.
|
||||
8. Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.
|
||||
|
||||
Verification target after implementation:
|
||||
|
||||
- `bun test packages/core/test/facebook-core.test.ts`
|
||||
- `bun test packages/core/test/facebook-integration.test.ts`
|
||||
- a live authenticated Facebook probe covering search and item routes
|
||||
|
||||
## Public API Surface
|
||||
|
||||
Keep the current public function names unless the rewrite proves that a signature change is required:
|
||||
|
||||
- `fetchFacebookItems(...)`
|
||||
- `fetchFacebookItem(...)`
|
||||
- `extractFacebookMarketplaceData(...)`
|
||||
- `extractFacebookItemData(...)`
|
||||
|
||||
The internals should change substantially, but callers should not need a new integration surface for this rewrite.
|
||||
|
||||
## Risks
|
||||
|
||||
- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
|
||||
- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
|
||||
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
|
||||
- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.
|
||||
|
||||
## Rollout Notes
|
||||
|
||||
The code, fixtures, and tests should change together.
|
||||
There should be no mixed state where the implementation is Comet-aware but the tests still encode `marketplace_product_details_page` as the primary contract.
|
||||
Reference in New Issue
Block a user