Files
ca-marketplace-scraper/docs/superpowers/specs/2026-05-01-facebook-anti-bot-challenge-solver-design.md
Dmytro Stanchiev ec545723bb feat(facebook): add challenge detection and session warming utilities
facebook-challenge.ts: session warmup, header construction, and challenge type detection. Spec document for the anti-bot challenge solver design.
2026-05-02 19:03:00 -04:00

174 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Facebook Marketplace Anti-Bot Challenge Solver Design
## Summary
Add a challenge-detection and challenge-solving layer to the Facebook Marketplace
scraper so it can handle anti-bot gates (checkpoint pages, token rotation, cookie
requirements) programmatically.
Build the solver in pure Bun — no browser automation in production.
Use `agent-browser` only for one-time debug reconnaissance.
## Goals
- Identify which anti-bot challenge(s) Facebook Marketplace triggers against
programmatic HTTP requests.
- Implement detection + solving for each discovered challenge type.
- Wire the solver into `fetchFacebookItems` and `fetchFacebookItem` so challenges are
handled transparently.
- Follow the same pattern as the existing `ebay-challenge.ts` (detect → solve → retry
with clearance).
- Zero browser automation at runtime.
Pure `fetch` + `Bun` APIs + npm packages only.
## Non-Goals
- Solving login/auth-wall challenges (those require fresh cookies — not solvable
programmatically).
- Full account login automation (cookies must be provided by the user).
- Browser-based scraping or Puppeteer/Playwright integration.
- Solving challenges for non-Marketplace Facebook endpoints.
## Current State
The Facebook scraper (`packages/core/src/scrapers/facebook.ts`) fetches Marketplace
search and item pages via authenticated `fetch` with cookies from `FACEBOOK_COOKIE` env
var. It:
- Sends a browser-like header set (`sec-ch-ua`, `user-agent`, etc.)
- Parses SSR HTML for embedded JSON in script tags
- Has no challenge detection — if Facebook returns a challenge page, the scraper
silently fails (no listings parsed, classifies as “unknown”)
- Depends entirely on cookie freshness
The eBay scraper already follows the challenge-solver pattern in this codebase:
`ebay.ts` uses `warmEbaySession()`, `isChallengeRedirect()`, `isChallengeHtml()`, and
`solveEbayChallenge()` from `ebay-challenge.ts`.
## Chosen Approach
**Reconnaissance-first development:**
1. Use `agent-browser` (debug only) to capture a real Facebook Marketplace browsing
session via HAR.
2. Probe programmatic `fetch` to see what Facebook returns without a browser.
3. Diff the two to identify the gap (missing headers?
missing cookies? missing JS execution?).
4. Build a modular solver in `packages/core/src/utils/facebook-challenge.ts` that
detects each challenge type and applies the appropriate fix.
5. Wire it into `facebook.ts` following the eBay pattern.
## Design
### File Plan
| File | Purpose |
| --- | --- |
| `packages/core/src/utils/facebook-challenge.ts` | Challenge detection, solving, and cookie/session utilities |
| `packages/core/src/scrapers/facebook.ts` | Modified: warmup, challenge detection before parsing, retry loop |
| `packages/core/test/facebook-challenge.test.ts` | Unit tests with mock challenge HTML fixtures |
### Flow
```
fetchFacebookItems(searchUrl)
├── warmFacebookSession() → GET facebook.com/ (collect datr + Akamai cookies)
├── fetchHtml(searchUrl) → receives response
├── detectFacebookChallenge(response)
│ ├── checkpoint/challenge HTML → solveCheckpointChallenge()
│ ├── redirect to /login → fail (cookies expired)
│ ├── missing required cookies → regenerate session
│ ├── 429 rate limit → backoff + retry (existing http.ts handles this)
│ └── no challenge → proceed to parsing
├── if solveCheckpointChallenge succeeds → retry fetchHtml with clearance cookie
└── parse results
```
### Challenge Types (to be confirmed by reconnaissance)
| Type | Expected Signal | Solving Strategy |
| --- | --- | --- |
| Login wall | Redirect to `/login` or HTML `"You must log in"` | Fail — user must provide fresh cookies |
| Checkpoint page | HTML contains `checkpoint` or `challenge` path | Parse hidden form fields, compute proof-of-work if present, submit answer endpoint |
| `datr` cookie missing | No `datr` in cookie jar → request fails | Fetch homepage first to obtain `datr` (session warmup) |
| DTSG token needed | Form submissions fail with CSRF error | Extract `fb_dtsg` from page HTML, include in request body |
| GraphQL header check | Request blocked without internal headers | Extract `x-fb-friendly-name` from browser HAR, replicate |
| Akamai/bot-manager | Redirect loops or blank pages without Akamai cookies | Homepage warmup to collect `bm_sv`, `bm_mi`, etc. |
### Key Modules
**`facebook-challenge.ts`:**
```
// Session warmup — fetch homepage to prime cookies
warmFacebookSession(): Promise<Record<string, string>>
// Challenge detection
detectFacebookChallenge(html, status, url, headers): ChallengeType | null
// Checkpoint solver
solveCheckpointChallenge(html, cookies): Promise<ChallengeResult>
// DTSG token extraction
extractDtsg(html): string | null
// Cookie jar management (shared with ebay.ts pattern)
mergeCookies(...): Record<string, string>
```
**`ChallengeResult` type:**
```ts
interface ChallengeResult {
solved: boolean;
cookies?: Record<string, string>; // clearance cookies to replay
token?: string; // challenge response token
error?: string; // why it failed
}
```
### Error Handling
- Solver failure → return `ChallengeResult { solved: false, error: "..." }`, scraper
logs warning and returns empty results (never throws).
- Unrecognized challenge → log the response URL and HTML snippet for future analysis.
- Rate limits → handled by existing `http.ts` exponential backoff (no change needed).
- Solver timeout → 30s cap on any challenge computation, fall back to `solved: false`.
### Testing
| Test | What It Verifies |
| --- | --- |
| `detectFacebookChallenge` with sample checkpoint HTML | Correctly identifies checkpoint challenge |
| `detectFacebookChallenge` with normal search HTML | Returns null (no false positives) |
| `detectFacebookChallenge` with login redirect | Identifies auth-gated |
| `solveCheckpointChallenge` with known PoW params | Produces correct answer |
| `warmFacebookSession` with mocked fetch | Collects expected cookies |
| `extractDtsg` with sample page HTML | Extracts the DTSG token |
| Integration: fetch → challenge → solve → retry → results | End-to-end mock flow |
| Solver throws → scraper returns empty, no crash | Graceful fallback |
| Solver unknown challenge → logs warning, returns empty | No unhandled challenge crashes |
Test data will use anonymized HTML fixtures (no real user data).
## Reconnaissance Steps (debug-only, one-time)
1. **Probe programmatically:** `fetch` Marketplace search with/without cookies, record
status code and HTML.
2. **Browser session:** `agent-browser` → log into Facebook → navigate Marketplace →
record HAR.
3. **Diff analysis:** Compare browser request headers vs.
our programmatic headers.
4. **Cookie inventory:** List all cookies from browser session, identify which are
essential.
5. **Challenge trigger:** Identify what change in request signature triggers a
challenge.
6. **Replay test:** Replay browsers exact request via `fetch` to confirm
headers/cookies are the differentiator.
All reconnaissance artifacts saved under `docs/facebook-challenge/`.
## Decisions Deferred to Post-Reconnaissance
- Exact challenge types and solving strategies (depends on what Facebook actually uses).
- Whether a PoW solver, CAPTCHA solver, or token-extraction approach is needed.
- npm package dependencies (only add what the reconnaissance proves necessary).