# Facebook Comet Rewrite Design

## Summary

Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract.

## Goals

- Replace both Facebook search and item-detail extraction with a current-shape parser.
- Keep authenticated direct HTTP requests as the primary fetch strategy.
- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
- Detect auth-gated, unavailable, and unknown responses explicitly.
- Update tests so they model current route markers and failure modes instead of legacy page objects.

## Non-Goals

- Reworking non-Facebook scrapers.
- Converting the scraper to browser-only automation.
- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction.
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.

## Current State

The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct.
The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`.
The item-detail path is centered on legacy extraction paths such as:

- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
- nested `__bbox.require[...]` variations
- recursive search through `parsed.require`

Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:

- `XCometMarketplaceSearchController`
- `XCometMarketplacePermalinkController`
- `routing_namespace":"fb_comet"`
- `use_ssr_state_manager":true`
- `ServerJS`
- `Bootloader`
- `data-sjs`
- `data-btmanifest`

The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results.

## Chosen Approach

Use a hybrid Comet-bootstrap parser.

The scraper will:

1. Fetch authenticated HTML directly.
2. Classify the response using current route and auth markers.
3. Parse inline bootstrap/state payloads using route-specific probes.
4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.

This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.

## Design

### Route Classification

Add a small response-classification layer before data extraction.
It should identify these states from the fetched response URL and HTML:

- `auth_gated`
- `unavailable`
- `search`
- `item`
- `unknown`

Signals to use:

- final URL containing `/login/` or login-shell text
- final URL containing `unavailable_product=1`
- search controller markers such as `XCometMarketplaceSearchController`
- item controller markers such as `XCometMarketplacePermalinkController`
- shared Comet markers such as `routing_namespace":"fb_comet"`

This classification layer becomes the top-level contract for both fetch functions.

### Search Extraction

The search path will be rewritten around Comet search-route markers.

Primary behavior:

- fetch the Marketplace search HTML with auth cookies
- confirm the response class is `search`
- extract inline bootstrap/state blobs from script tags and page attributes
- probe for route-specific search payloads associated with `XCometMarketplaceSearchController`
- map decoded search results into summary listing records

Search summary fields should remain aligned with the current public output shape:

- item URL
- title
- formatted price and normalized cents when possible
- city/address summary when present
- seller summary when present in the search payload
- category/status/media fields only when they are present with stable meaning

Fallback behavior:

- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
- use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
- treat fallback results as summary-only data, not rich detail data

### Item Extraction

The item-detail path will be rewritten around the Comet permalink route.

Primary behavior:

- fetch the item permalink HTML with auth cookies
- confirm the response class is `item`
- extract inline bootstrap/state blobs from script tags and page attributes
- probe for permalink payloads associated with `XCometMarketplacePermalinkController`
- decode the richest recoverable item record and map it into `FacebookListingDetails`

Priority item fields:

- item ID and permalink URL
- title
- formatted price and normalized cents when possible
- condition
- description
- listed age / creation date when derivable
- approximate location
- seller name and seller ID when present
- listing status when the payload makes it explicit

Fallback behavior:

- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
- prioritize title, price, condition, description, location text, and seller module content
- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing

### Bootstrap Parsing Strategy

The parser should stop assuming a single stable JSON path.
Instead, it should work in two phases:

1. Discover candidate bootstrap payloads.
2. Score candidates against the expected route shape.

Candidate discovery inputs:

- raw `<script>` contents
- `data-sjs` and related page attributes
- `ServerJS` / `Bootloader` inline blobs
- route controller names

Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries.
Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.

The parser should not depend on one hard-coded object name surviving forever.
Instead, it should look for route-specific semantic clusters and choose the strongest candidate.

### Legacy Removal

The old Facebook scraper should be removed as a primary strategy.
Specifically:

- delete old item-detail extraction paths centered on `marketplace_product_details_page`
- delete legacy-first `require` / `__bbox` navigation tables
- delete tests whose only purpose is to preserve those legacy paths

If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.

### Error Handling

Facebook responses should now fail with explicit route-aware outcomes:

1. Missing/invalid auth cookie input.
2. Auth-gated response.
3. Unavailable or stale item response.
4. Search or item route detected, but no decodable data found.
5. Unknown response shape.

Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.

### Testing Strategy

Follow TDD for the rewrite.
Write failing tests for the new route-aware parser before replacing production code.

Coverage targets:

1. Search responses classify correctly from current Comet controller markers.
2. Item responses classify correctly from current Comet controller markers.
3. Login-gated and unavailable responses are detected before parsing.
4. Search bootstrap parsing produces summary listing results from current-shape fixtures.
5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
6. Search fallback extraction works when route markers exist but structured payload decoding fails.
7. Item fallback extraction works when route markers exist but structured payload decoding fails.
8. Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.

Verification target after implementation:

- `bun test packages/core/test/facebook-core.test.ts`
- `bun test packages/core/test/facebook-integration.test.ts`
- a live authenticated Facebook probe covering search and item routes

## Public API Surface

Keep the current public function names unless the rewrite proves that a signature change is required:

- `fetchFacebookItems(...)`
- `fetchFacebookItem(...)`
- `extractFacebookMarketplaceData(...)`
- `extractFacebookItemData(...)`

The internals should change substantially, but callers should not need a new integration surface for this rewrite.

## Risks

- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.

## Rollout Notes

The code, fixtures, and tests should change together.
There should be no mixed state where the implementation is Comet-aware but the tests still encode `marketplace_product_details_page` as the primary contract.