chore: format markdown
Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
This commit is contained in:
@@ -2,35 +2,46 @@
|
||||
|
||||
## Summary
|
||||
|
||||
Replace the legacy Facebook Marketplace scraper with a route-aware implementation built around current Comet bootstrap markers and route-specific extraction.
|
||||
The new scraper will keep authenticated direct HTTP fetches as the primary transport, but it will stop treating legacy `require`, `__bbox`, and `marketplace_product_details_page` structures as the main parsing contract.
|
||||
Replace the legacy Facebook Marketplace scraper with a route-aware implementation built
|
||||
around current Comet bootstrap markers and route-specific extraction.
|
||||
The new scraper will keep authenticated direct HTTP fetches as the primary transport,
|
||||
but it will stop treating legacy `require`, `__bbox`, and
|
||||
`marketplace_product_details_page` structures as the main parsing contract.
|
||||
|
||||
## Goals
|
||||
|
||||
- Replace both Facebook search and item-detail extraction with a current-shape parser.
|
||||
- Keep authenticated direct HTTP requests as the primary fetch strategy.
|
||||
- Parse route-specific Comet bootstrap/state payloads before falling back to rendered-HTML extraction.
|
||||
- Parse route-specific Comet bootstrap/state payloads before falling back to
|
||||
rendered-HTML extraction.
|
||||
- Detect auth-gated, unavailable, and unknown responses explicitly.
|
||||
- Update tests so they model current route markers and failure modes instead of legacy page objects.
|
||||
- Update tests so they model current route markers and failure modes instead of legacy
|
||||
page objects.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Reworking non-Facebook scrapers.
|
||||
- Converting the scraper to browser-only automation.
|
||||
- Preserving old parser behavior for `marketplace_product_details_page` or `__bbox`-driven item extraction.
|
||||
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively before implementation.
|
||||
- Preserving old parser behavior for `marketplace_product_details_page` or
|
||||
`__bbox`-driven item extraction.
|
||||
- Reverse-engineering every internal Facebook bootstrap payload shape exhaustively
|
||||
before implementation.
|
||||
|
||||
## Current State
|
||||
|
||||
The current implementation in `packages/core/src/scrapers/facebook.ts` still uses authenticated HTTP requests, which remains correct.
|
||||
The search path parses embedded script JSON and looks for `marketplace_search.feed_units.edges`.
|
||||
The item-detail path is centered on legacy extraction paths such as:
|
||||
The current implementation in `packages/core/src/scrapers/facebook.ts` still uses
|
||||
authenticated HTTP requests, which remains correct.
|
||||
The search path parses embedded script JSON and looks for
|
||||
`marketplace_search.feed_units.edges`. The item-detail path is centered on legacy
|
||||
extraction paths such as:
|
||||
|
||||
- `parsed.require[0][3].__bbox.result.data.viewer.marketplace_product_details_page.target`
|
||||
- nested `__bbox.require[...]` variations
|
||||
- recursive search through `parsed.require`
|
||||
|
||||
Live evidence gathered earlier in this session and by the isolated research subagent shows that current Facebook Marketplace pages are Comet route-driven and expose markers such as:
|
||||
Live evidence gathered earlier in this session and by the isolated research subagent
|
||||
shows that current Facebook Marketplace pages are Comet route-driven and expose markers
|
||||
such as:
|
||||
|
||||
- `XCometMarketplaceSearchController`
|
||||
- `XCometMarketplacePermalinkController`
|
||||
@@ -41,7 +52,9 @@ Live evidence gathered earlier in this session and by the isolated research suba
|
||||
- `data-sjs`
|
||||
- `data-btmanifest`
|
||||
|
||||
The same live investigation also showed that authenticated item pages no longer expose the old `marketplace_product_details_page` marker reliably, while live search still returns usable results.
|
||||
The same live investigation also showed that authenticated item pages no longer expose
|
||||
the old `marketplace_product_details_page` marker reliably, while live search still
|
||||
returns usable results.
|
||||
|
||||
## Chosen Approach
|
||||
|
||||
@@ -52,9 +65,11 @@ The scraper will:
|
||||
1. Fetch authenticated HTML directly.
|
||||
2. Classify the response using current route and auth markers.
|
||||
3. Parse inline bootstrap/state payloads using route-specific probes.
|
||||
4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the payload cannot be decoded into the expected search or item shape.
|
||||
4. Fall back to rendered-HTML extraction only when bootstrap markers are present but the
|
||||
payload cannot be decoded into the expected search or item shape.
|
||||
|
||||
This keeps the cheaper direct-HTTP transport while shifting the parser contract from legacy page-object names to current Comet route structure.
|
||||
This keeps the cheaper direct-HTTP transport while shifting the parser contract from
|
||||
legacy page-object names to current Comet route structure.
|
||||
|
||||
## Design
|
||||
|
||||
@@ -88,7 +103,8 @@ Primary behavior:
|
||||
- fetch the Marketplace search HTML with auth cookies
|
||||
- confirm the response class is `search`
|
||||
- extract inline bootstrap/state blobs from script tags and page attributes
|
||||
- probe for route-specific search payloads associated with `XCometMarketplaceSearchController`
|
||||
- probe for route-specific search payloads associated with
|
||||
`XCometMarketplaceSearchController`
|
||||
- map decoded search results into summary listing records
|
||||
|
||||
Search summary fields should remain aligned with the current public output shape:
|
||||
@@ -102,7 +118,8 @@ Search summary fields should remain aligned with the current public output shape
|
||||
|
||||
Fallback behavior:
|
||||
|
||||
- if search route markers are present but structured payload decoding fails, extract listing summaries from rendered HTML anchors and text patterns
|
||||
- if search route markers are present but structured payload decoding fails, extract
|
||||
listing summaries from rendered HTML anchors and text patterns
|
||||
- use item links matching `/marketplace/item/<id>` as the anchor for fallback extraction
|
||||
- treat fallback results as summary-only data, not rich detail data
|
||||
|
||||
@@ -132,9 +149,12 @@ Priority item fields:
|
||||
|
||||
Fallback behavior:
|
||||
|
||||
- if permalink route markers are present but no stable payload object is decodable, extract data from rendered HTML text structure
|
||||
- prioritize title, price, condition, description, location text, and seller module content
|
||||
- return partial item data when core user-facing fields are present rather than failing solely because deeper commerce metadata is missing
|
||||
- if permalink route markers are present but no stable payload object is decodable,
|
||||
extract data from rendered HTML text structure
|
||||
- prioritize title, price, condition, description, location text, and seller module
|
||||
content
|
||||
- return partial item data when core user-facing fields are present rather than failing
|
||||
solely because deeper commerce metadata is missing
|
||||
|
||||
### Bootstrap Parsing Strategy
|
||||
|
||||
@@ -151,11 +171,14 @@ Candidate discovery inputs:
|
||||
- `ServerJS` / `Bootloader` inline blobs
|
||||
- route controller names
|
||||
|
||||
Candidate scoring for search should favor objects that contain repeated result-card semantics, item IDs, listing links, titles, prices, or location summaries.
|
||||
Candidate scoring for item pages should favor objects that contain singular listing semantics, title, price, condition, description, location, seller, or permalink context.
|
||||
Candidate scoring for search should favor objects that contain repeated result-card
|
||||
semantics, item IDs, listing links, titles, prices, or location summaries.
|
||||
Candidate scoring for item pages should favor objects that contain singular listing
|
||||
semantics, title, price, condition, description, location, seller, or permalink context.
|
||||
|
||||
The parser should not depend on one hard-coded object name surviving forever.
|
||||
Instead, it should look for route-specific semantic clusters and choose the strongest candidate.
|
||||
Instead, it should look for route-specific semantic clusters and choose the strongest
|
||||
candidate.
|
||||
|
||||
### Legacy Removal
|
||||
|
||||
@@ -166,7 +189,9 @@ Specifically:
|
||||
- delete legacy-first `require` / `__bbox` navigation tables
|
||||
- delete tests whose only purpose is to preserve those legacy paths
|
||||
|
||||
If a minimal legacy compatibility branch remains, it must be a last-resort fallback behind the new route-aware parser and should not shape test fixtures or design decisions.
|
||||
If a minimal legacy compatibility branch remains, it must be a last-resort fallback
|
||||
behind the new route-aware parser and should not shape test fixtures or design
|
||||
decisions.
|
||||
|
||||
### Error Handling
|
||||
|
||||
@@ -178,7 +203,8 @@ Facebook responses should now fail with explicit route-aware outcomes:
|
||||
4. Search or item route detected, but no decodable data found.
|
||||
5. Unknown response shape.
|
||||
|
||||
Error messages should name the actual class of failure instead of implying that every parse miss is caused by expired cookies.
|
||||
Error messages should name the actual class of failure instead of implying that every
|
||||
parse miss is caused by expired cookies.
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
@@ -190,11 +216,15 @@ Coverage targets:
|
||||
1. Search responses classify correctly from current Comet controller markers.
|
||||
2. Item responses classify correctly from current Comet controller markers.
|
||||
3. Login-gated and unavailable responses are detected before parsing.
|
||||
4. Search bootstrap parsing produces summary listing results from current-shape fixtures.
|
||||
4. Search bootstrap parsing produces summary listing results from current-shape
|
||||
fixtures.
|
||||
5. Item bootstrap parsing produces rich listing details from current-shape fixtures.
|
||||
6. Search fallback extraction works when route markers exist but structured payload decoding fails.
|
||||
7. Item fallback extraction works when route markers exist but structured payload decoding fails.
|
||||
8. Old legacy-only item fixtures are removed or rewritten so they no longer define the contract.
|
||||
6. Search fallback extraction works when route markers exist but structured payload
|
||||
decoding fails.
|
||||
7. Item fallback extraction works when route markers exist but structured payload
|
||||
decoding fails.
|
||||
8. Old legacy-only item fixtures are removed or rewritten so they no longer define the
|
||||
contract.
|
||||
|
||||
Verification target after implementation:
|
||||
|
||||
@@ -204,23 +234,30 @@ Verification target after implementation:
|
||||
|
||||
## Public API Surface
|
||||
|
||||
Keep the current public function names unless the rewrite proves that a signature change is required:
|
||||
Keep the current public function names unless the rewrite proves that a signature change
|
||||
is required:
|
||||
|
||||
- `fetchFacebookItems(...)`
|
||||
- `fetchFacebookItem(...)`
|
||||
- `extractFacebookMarketplaceData(...)`
|
||||
- `extractFacebookItemData(...)`
|
||||
|
||||
The internals should change substantially, but callers should not need a new integration surface for this rewrite.
|
||||
The internals should change substantially, but callers should not need a new integration
|
||||
surface for this rewrite.
|
||||
|
||||
## Risks
|
||||
|
||||
- Facebook may change bootstrap payload naming again, so route/controller markers are more stable than exact nested object paths but still not guaranteed.
|
||||
- Search and item pages may each contain multiple partial payloads, making candidate ranking important.
|
||||
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs clear precedence rules.
|
||||
- Live fixtures can drift from production quickly, so tests must model route semantics rather than exact one-off payloads where possible.
|
||||
- Facebook may change bootstrap payload naming again, so route/controller markers are
|
||||
more stable than exact nested object paths but still not guaranteed.
|
||||
- Search and item pages may each contain multiple partial payloads, making candidate
|
||||
ranking important.
|
||||
- Fallback rendered-HTML extraction may be noisier than bootstrap decoding and needs
|
||||
clear precedence rules.
|
||||
- Live fixtures can drift from production quickly, so tests must model route semantics
|
||||
rather than exact one-off payloads where possible.
|
||||
|
||||
## Rollout Notes
|
||||
|
||||
The code, fixtures, and tests should change together.
|
||||
There should be no mixed state where the implementation is Comet-aware but the tests still encode `marketplace_product_details_page` as the primary contract.
|
||||
There should be no mixed state where the implementation is Comet-aware but the tests
|
||||
still encode `marketplace_product_details_page` as the primary contract.
|
||||
|
||||
Reference in New Issue
Block a user