chore: format markdown

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
2026-05-01 11:42:54 -04:00
parent d2c3c07e7d
commit 7ab33d0b02
15 changed files with 925 additions and 417 deletions
--- a/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md
+++ b/docs/superpowers/specs/2026-04-21-facebook-comet-rewrite-design.md
@@ -2,35 +2,46 @@

 ## Summary

-Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
-The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract.
+Replace the legacy Facebook Marketplace scraper with a route-aware implementation built
+around current Comet bootstrap markers and route-specific extraction.
+The new scraper will keep authenticated direct HTTP fetches as the primary transport,
+but it will stop treating legacy `require`, `__bbox`, and
+`marketplace_product_details_page` structures as the main parsing contract.

 ## Goals

 - Replace both Facebook search and item-detail extraction with a current-shape parser.
 - Keep authenticated direct HTTP requests as the primary fetch strategy.
- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
+- Parse route-specific Comet bootstrap/state payloads before falling back to
+  rendered-HTML extraction.
 - Detect auth-gated, unavailable, and unknown responses explicitly.
- Update tests so they model current route markers and failure modes instead of legacy page objects.
+- Update tests so they model current route markers and failure modes instead of legacy
+  page objects.

 ## Non-Goals

 - Reworking non-Facebook scrapers.
 - Converting the scraper to browser-only automation.
- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction.
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.
+- Preserving old parser behavior for `marketplace_product_details_page` or
+  `__bbox`-driven item extraction.
+- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively
+  before implementation.

 ## Current State

-The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct.
-The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`.
-The item-detail path is centered on legacy extraction paths such as:
+The current implementation in `packages/core/src/scrapers/facebook.ts` still uses
+authenticated HTTP requests, which remains correct.
+The search path parses embedded script JSON and looks for
+`marketplace_search.feed_units.edges`. The item-detail path is centered on legacy
+extraction paths such as:

 - `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
 - nested `__bbox.require[...]` variations
 - recursive search through `parsed.require`

-Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:
+Live evidence gathered earlier in this session and by the isolated research subagent
+shows that current Facebook Marketplace pages are Comet route-driven and expose markers
+such as:

 - `XCometMarketplaceSearchController`
 - `XCometMarketplacePermalinkController`
@@ -41,7 +52,9 @@ Live evidence gathered earlier in this session and by the isolated research suba
 - `data-sjs`
 - `data-btmanifest`

-The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results.
+The same live investigation also showed that authenticated item pages no longer expose
+the old `marketplace_product_details_page` marker reliably, while live search still
+returns usable results.

 ## Chosen Approach

@@ -52,9 +65,11 @@ The scraper will:
 1. Fetch authenticated HTML directly.
 2. Classify the response using current route and auth markers.
 3. Parse inline bootstrap/state payloads using route-specific probes.
-4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.
+4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the
+   payload cannot be decoded into the expected search or item shape.

-This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.
+This keeps the cheaper direct-HTTP transport while shifting the parser contract from
+legacy page-object names to current Comet route structure.

 ## Design

@@ -88,7 +103,8 @@ Primary behavior:
 - fetch the Marketplace search HTML with auth cookies
 - confirm the response class is `search`
 - extract inline bootstrap/state blobs from script tags and page attributes
- probe for route-specific search payloads associated with `XCometMarketplaceSearchController`
+- probe for route-specific search payloads associated with
+  `XCometMarketplaceSearchController`
 - map decoded search results into summary listing records

 Search summary fields should remain aligned with the current public output shape:
@@ -102,7 +118,8 @@ Search summary fields should remain aligned with the current public output shape

 Fallback behavior:

- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
+- if search route markers are present but structured payload decoding fails, extract
+  listing summaries from rendered HTML anchors and text patterns
 - use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
 - treat fallback results as summary-only data, not rich detail data

@@ -132,9 +149,12 @@ Priority item fields:

 Fallback behavior:

- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
- prioritize title, price, condition, description, location text, and seller module content
- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing
+- if permalink route markers are present but no stable payload object is decodable,
+  extract data from rendered HTML text structure
+- prioritize title, price, condition, description, location text, and seller module
+  content
+- return partial item data when core user-facing fields are present rather than failing
+  solely because deeper commerce metadata is missing

 ### Bootstrap Parsing Strategy

@@ -151,11 +171,14 @@ Candidate discovery inputs:
 - `ServerJS` / `Bootloader` inline blobs
 - route controller names

-Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries.
-Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.
+Candidate scoring for search should favor objects that contain repeated result-card
+semantics, item IDs, listing links, titles, prices, or location summaries.
+Candidate scoring for item pages should favor objects that contain singular listing
+semantics, title, price, condition, description, location, seller, or permalink context.

 The parser should not depend on one hard-coded object name surviving forever.
-Instead, it should look for route-specific semantic clusters and choose the strongest candidate.
+Instead, it should look for route-specific semantic clusters and choose the strongest
+candidate.

 ### Legacy Removal

@@ -166,7 +189,9 @@ Specifically:
 - delete legacy-first `require` / `__bbox` navigation tables
 - delete tests whose only purpose is to preserve those legacy paths

-If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.
+If a minimal legacy compatibility branch remains, it must be a last-resort fallback
+behind the new route-aware parser and should not shape test fixtures or design
+decisions.

 ### Error Handling

@@ -178,7 +203,8 @@ Facebook responses should now fail with explicit route-aware outcomes:
 4. Search or item route detected, but no decodable data found.
 5. Unknown response shape.

-Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.
+Error messages should name the actual class of failure instead of implying that every
+parse miss is caused by expired cookies.

 ### Testing Strategy

@@ -190,11 +216,15 @@ Coverage targets:
 1. Search responses classify correctly from current Comet controller markers.
 2. Item responses classify correctly from current Comet controller markers.
 3. Login-gated and unavailable responses are detected before parsing.
-4. Search bootstrap parsing produces summary listing results from current-shape fixtures.
+4. Search bootstrap parsing produces summary listing results from current-shape
+   fixtures.
 5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
-6. Search fallback extraction works when route markers exist but structured payload decoding fails.
-7. Item fallback extraction works when route markers exist but structured payload decoding fails.
-8. Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.
+6. Search fallback extraction works when route markers exist but structured payload
+   decoding fails.
+7. Item fallback extraction works when route markers exist but structured payload
+   decoding fails.
+8. Old legacy-only item fixtures are removed or rewritten so they no longer define the
+   contract.

 Verification target after implementation:

@@ -204,23 +234,30 @@ Verification target after implementation:

 ## Public API Surface

-Keep the current public function names unless the rewrite proves that a signature change is required:
+Keep the current public function names unless the rewrite proves that a signature change
+is required:

 - `fetchFacebookItems(...)`
 - `fetchFacebookItem(...)`
 - `extractFacebookMarketplaceData(...)`
 - `extractFacebookItemData(...)`

-The internals should change substantially, but callers should not need a new integration surface for this rewrite.
+The internals should change substantially, but callers should not need a new integration
+surface for this rewrite.

 ## Risks

- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.
+- Facebook may change bootstrap payload naming again, so route/controller markers are
+  more stable than exact nested object paths but still not guaranteed.
+- Search and item pages may each contain multiple partial payloads, making candidate
+  ranking important.
+- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs
+  clear precedence rules.
+- Live fixtures can drift from production quickly, so tests must model route semantics
+  rather than exact one-off payloads where possible.

 ## Rollout Notes

 The code, fixtures, and tests should change together.
-There should be no mixed state where the implementation is Comet-aware but the tests still encode `marketplace_product_details_page` as the primary contract.
+There should be no mixed state where the implementation is Comet-aware but the tests
+still encode `marketplace_product_details_page` as the primary contract.