feat: port upstream scraper improvements to monorepo
Kijiji improvements: - Add error classes: NetworkError, ParseError, RateLimitError, ValidationError - Add exponential backoff with jitter for retries - Add request timeout (30s abort) - Add pagination support (SearchOptions.maxPages) - Add location/category mappings and resolution functions - Add enhanced DetailedListing interface with images, seller info, attributes - Add GraphQL client for seller details Facebook improvements: - Add parseFacebookCookieString() for parsing cookie strings - Add ensureFacebookCookies() with env var fallback - Add extractFacebookItemData() with multiple extraction paths - Add fetchFacebookItem() for individual item fetching - Add extraction metrics and API stability monitoring - Add vehicle-specific field extraction - Improve error handling with specific guidance for auth errors Shared utilities: - Update http.ts with new error classes and improved fetchHtml Documentation: - Port KIJIJI.md, FMARKETPLACE.md, AGENTS.md from upstream Tests: - Port kijiji-core, kijiji-integration, kijiji-utils tests - Port facebook-core, facebook-integration tests - Add test setup file Scripts: - Port parse-facebook-cookies.ts script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
448
KIJIJI.md
Normal file
448
KIJIJI.md
Normal file
@@ -0,0 +1,448 @@
|
||||
# Kijiji API Findings
|
||||
|
||||
## Overview
|
||||
Kijiji is a Canadian classifieds marketplace that uses a modern web application built with Next.js and Apollo GraphQL. The search results are powered by a GraphQL API with client-side state management.
|
||||
|
||||
## Initial Page Load (Homepage)
|
||||
- **URL**: https://www.kijiji.ca/
|
||||
- **Architecture**: Server-side rendered React application with Next.js
|
||||
- **Data Sources**:
|
||||
- Static assets loaded from `webapp-static.ca-kijiji-production.classifiedscloud.io`
|
||||
- Image media served from `media.kijiji.ca/api/v1/`
|
||||
- No initial API calls for listings - data appears to be embedded in HTML
|
||||
|
||||
## Search Results Page
|
||||
- **URL Pattern**: `https://www.kijiji.ca/b-[location]/[keywords]/k0l0`
|
||||
- **Example**: `https://www.kijiji.ca/b-canada/iphone/k0l0`
|
||||
- **Technology Stack**: Next.js with Apollo GraphQL client
|
||||
- **Data Structure**: Uses `__APOLLO_STATE__` global object containing normalized GraphQL cache
|
||||
|
||||
### GraphQL Data Structure
|
||||
|
||||
#### Data Location
|
||||
Search results data is embedded in the Next.js page props under `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`. The data is pre-rendered on the server and sent to the client. Each page (including pagination) has its own pre-rendered data.
|
||||
|
||||
#### Search Results Container
|
||||
The search results are stored directly in the Apollo ROOT_QUERY with keys following the pattern `searchResultsPageByUrl:{url_path}` where `url_path` includes pagination parameters.
|
||||
|
||||
```json
|
||||
{
|
||||
"searchResultsPageByUrl:/b-buy-sell/canada/iphone/k0c10l0": { ... },
|
||||
"searchResultsPageByUrl:/b-buy-sell/canada/iphone/k0c10l0?page=2": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
#### Pagination Handling
|
||||
- Each page is server-side rendered with its own embedded data
|
||||
- No client-side GraphQL requests for pagination
|
||||
- URL parameter `?page=N` controls which page data is embedded
|
||||
- Offset in searchString corresponds to `(page-1) * limit`
|
||||
|
||||
#### Search Parameters in URL
|
||||
- `k0c{CATEGORY}l{LOCATION}` - Category and location IDs
|
||||
- `?page=N` - Page number (1-based)
|
||||
- Data contains `offset` and `limit` for API-style pagination
|
||||
|
||||
#### Individual Listing Structure
|
||||
```json
|
||||
{
|
||||
"id": "1732061412",
|
||||
"title": "iPhone 13",
|
||||
"description": "iPhone 13, always had a screen protector on it...",
|
||||
"imageCount": 3,
|
||||
"imageUrls": ["https://media.kijiji.ca/api/v1/ca-prod-fsbo-ads/images/..."],
|
||||
"categoryId": 760,
|
||||
"url": "https://www.kijiji.ca/v-cell-phone/...",
|
||||
"activationDate": "2026-01-21T16:51:16.000Z",
|
||||
"sortingDate": "2026-01-21T16:51:16.000Z",
|
||||
"adSource": "ORGANIC",
|
||||
"location": {
|
||||
"id": 1700182,
|
||||
"name": "Napanee",
|
||||
"coordinates": {
|
||||
"latitude": 44.48774,
|
||||
"longitude": -76.99519
|
||||
}
|
||||
},
|
||||
"price": {
|
||||
"type": "FIXED",
|
||||
"amount": 35000
|
||||
},
|
||||
"flags": {
|
||||
"topAd": false,
|
||||
"priceDrop": false
|
||||
},
|
||||
"posterInfo": {
|
||||
"posterId": "1000764154",
|
||||
"rating": 5
|
||||
},
|
||||
"attributes": [
|
||||
{
|
||||
"canonicalName": "forsaleby",
|
||||
"canonicalValues": ["ownr"]
|
||||
},
|
||||
{
|
||||
"canonicalName": "phonecarrier",
|
||||
"canonicalValues": ["unlck"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### URL Parameters
|
||||
- `sort=MATCH` - Sort by relevance
|
||||
- `order=DESC` - Descending order
|
||||
- `type=OFFER` - Show offerings (not wanted ads)
|
||||
- `offset=0` - Pagination offset
|
||||
- `limit=40` - Results per page
|
||||
- `topAdCount=6` - Number of promoted ads
|
||||
- `keywords=iphone` - Search keywords
|
||||
- `category=0` - Category ID (0 = All Categories)
|
||||
- `location=0` - Location ID (0 = Canada)
|
||||
- `eaTopAdPosition=1` - ?
|
||||
|
||||
### Image API
|
||||
- **Endpoint**: `https://media.kijiji.ca/api/v1/`
|
||||
- **Pattern**: `/ca-prod-fsbo-ads/images/{uuid}?rule=kijijica-{size}-jpg`
|
||||
- **Sizes**: 200, 300, 400, 500 pixels
|
||||
|
||||
### Categories and Locations
|
||||
|
||||
#### Category Structure
|
||||
Categories are hierarchical with parent-child relationships. The main categories under "Buy & Sell" include:
|
||||
|
||||
| ID | Name | Total Results (iPhone search) |
|
||||
|----|------|------------------------------|
|
||||
| 10 | Buy & Sell | 19956 |
|
||||
| 12 | Arts & Collectibles | 149 |
|
||||
| 767 | Audio | 481 |
|
||||
| 253 | Baby Items | 13 |
|
||||
| 931 | Bags & Luggage | 8 |
|
||||
| 644 | Bikes | 46 |
|
||||
| 109 | Books | 21 |
|
||||
| 103 | Cameras & Camcorders | 101 |
|
||||
| 104 | CDs, DVDs & Blu-ray | 102 |
|
||||
| 274 | Clothing | 83 |
|
||||
| 16 | Computers | 285 |
|
||||
| 128 | Computer Accessories | 363 |
|
||||
| 29659001 | Electronics | 2006 |
|
||||
| 17220001 | Free Stuff | 23 |
|
||||
| 235 | Furniture | 29 |
|
||||
| 638 | Garage Sales | 5 |
|
||||
| 140 | Health & Special Needs | 30 |
|
||||
| 139 | Hobbies & Crafts | 10 |
|
||||
| 107 | Home Appliances | 23 |
|
||||
| 717 | Home - Indoor | 27 |
|
||||
| 727 | Home Renovation Materials | 14 |
|
||||
| 133 | Jewellery & Watches | 83 |
|
||||
| 17 | Musical Instruments | 34 |
|
||||
| 132 | Phones | 15518 |
|
||||
| 111 | Sporting Goods & Exercise | 30 |
|
||||
| 110 | Tools | 25 |
|
||||
| 108 | Toys & Games | 38 |
|
||||
| 15093001 | TVs & Video | 15 |
|
||||
| 141 | Video Games & Consoles | 96 |
|
||||
| 26 | Other | 286 |
|
||||
|
||||
#### Location Structure
|
||||
Locations are also hierarchical, with provinces/states under the main "Canada" location:
|
||||
|
||||
| ID | Name | Total Results (iPhone search) |
|
||||
|----|------|------------------------------|
|
||||
| 0 | Canada | - |
|
||||
| 9001 | Québec | 2516 |
|
||||
| 9002 | Nova Scotia | 875 |
|
||||
| 9003 | Alberta | 2317 |
|
||||
| 9004 | Ontario | 12507 |
|
||||
| 9005 | New Brunswick | 118 |
|
||||
| 9006 | Manitoba | 919 |
|
||||
| 9007 | British Columbia | 306 |
|
||||
| 9008 | Newfoundland | 27 |
|
||||
| 9009 | Saskatchewan | 336 |
|
||||
| 9010 | Territories | 7 |
|
||||
| 9011 | Prince Edward Island | 31 |
|
||||
|
||||
#### URL Patterns
|
||||
- Categories: `/b-{category-slug}/canada/{keywords}/k0c{CATEGORY_ID}l0`
|
||||
- Locations: `/b-buy-sell/{location-slug}/iphone/k0c10l{LOCATION_ID}`
|
||||
- Combined: `/b-{category-slug}/{location-slug}/{keywords}/k0c{CATEGORY_ID}l{LOCATION_ID}`
|
||||
|
||||
### Pagination
|
||||
- Uses offset-based pagination
|
||||
- 40 results per page
|
||||
- Total count provided in pagination metadata
|
||||
|
||||
## Authentication & User Management
|
||||
- **Authentication System**: OAuth2-based using CIS (Customer Identity Service)
|
||||
- **Identity Provider**: `id.kijiji.ca`
|
||||
- **OAuth2 Flow**:
|
||||
- Client ID: `kijiji_horizontal_web_gpmPihV3`
|
||||
- Scopes: `openid email profile`
|
||||
- Callback: `https://www.kijiji.ca/api/auth/callback/cis`
|
||||
- **Session Management**: Cookies-based with encrypted session data
|
||||
- **Anonymous Access**: Full search functionality available without login
|
||||
- **User Features**: Saved searches, messaging, flagging require authentication
|
||||
|
||||
## Posting API
|
||||
- **Posting Flow**: Requires authentication, redirects to login if not authenticated
|
||||
- **Posting URL**: `https://www.kijiji.ca/p-post-ad.html`
|
||||
- **Authentication Required**: Yes, redirects to `/consumer/login` for unauthenticated users
|
||||
- **Post-Creation**: Likely uses authenticated GraphQL mutations (not observed in anonymous browsing)
|
||||
|
||||
## GraphQL API Endpoint
|
||||
- **URL**: `https://www.kijiji.ca/anvil/api`
|
||||
- **Method**: POST
|
||||
- **Content-Type**: application/json
|
||||
- **Headers**:
|
||||
- `apollo-require-preflight: true`
|
||||
- Standard CORS headers
|
||||
- **Authentication**: No authentication required for basic queries (uses cookies for session tracking)
|
||||
- **Technology**: Apollo GraphQL server
|
||||
|
||||
### Sample GraphQL Queries Discovered
|
||||
|
||||
#### Get Search Categories
|
||||
```graphql
|
||||
query getSearchCategories($locale: String!) {
|
||||
searchCategories {
|
||||
id
|
||||
localizedName(locale: $locale)
|
||||
parentId
|
||||
__typename
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables: `{"locale": "en-CA"}`
|
||||
|
||||
Response includes hierarchical category structure with IDs and localized names.
|
||||
|
||||
#### Get Geocode from IP (fails for current IP)
|
||||
```graphql
|
||||
query GetGeocodeReverseFromIp {
|
||||
geocodeReverseFromIp {
|
||||
city
|
||||
province
|
||||
locationId
|
||||
__typename
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This query fails for the current IP address, suggesting geolocation-based features may not work or require different IP ranges.
|
||||
|
||||
#### Get Category Path
|
||||
```graphql
|
||||
query GetCategoryPath($categoryId: Int!, $locale: String, $locationId: Int) {
|
||||
category(id: $categoryId) {
|
||||
id
|
||||
localizedName(locale: $locale)
|
||||
parentId
|
||||
searchSeoUrl(locationId: $locationId)
|
||||
categoryPaths {
|
||||
id
|
||||
localizedName(locale: $locale)
|
||||
parentId
|
||||
searchSeoUrl(locationId: $locationId)
|
||||
__typename
|
||||
}
|
||||
__typename
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables: `{"categoryId": 10, "locationId": 0, "locale": "en-CA"}`
|
||||
|
||||
## Latest Findings (2026-01-21)
|
||||
|
||||
### Client-Side GraphQL Queries Observed
|
||||
- **getSearchCategories**: Retrieves category hierarchy for search filters
|
||||
- **GetGeocodeReverseFromIp**: Attempts to geolocate user (fails for current IP)
|
||||
|
||||
### GraphQL Schema Insights
|
||||
Testing direct GraphQL queries revealed:
|
||||
- Field "searchResults" does not exist on Query type
|
||||
- Suggested alternatives: "searchResultsPage" or "searchUrl"
|
||||
- This suggests the search functionality may use different GraphQL operations than direct queries
|
||||
|
||||
The embedded Apollo state approach appears to be the primary method for accessing search data, with GraphQL used for auxiliary operations like categories and geolocation.
|
||||
|
||||
### Server-Side Rendering Architecture
|
||||
Search results are fully server-side rendered with data embedded in HTML. Each page (including pagination) contains its own pre-rendered data. No client-side GraphQL requests are made for:
|
||||
|
||||
- Initial search results
|
||||
- Pagination navigation
|
||||
- Search result data
|
||||
|
||||
### Network Analysis Findings
|
||||
- GraphQL endpoint: `https://www.kijiji.ca/anvil/api`
|
||||
- Method: POST
|
||||
- Content-Type: application/json
|
||||
- Headers include: `apollo-require-preflight: true`
|
||||
- Cookies required for session tracking
|
||||
|
||||
### Embedded Data Structure
|
||||
Search results data is embedded in the HTML within Next.js `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__` object. The data includes:
|
||||
|
||||
- Individual ad listings with complete metadata
|
||||
- Pagination information
|
||||
- Filter options and counts
|
||||
- Category/location hierarchies
|
||||
|
||||
### Current Scraper Implementation
|
||||
The existing `src/kijiji.ts` implementation correctly parses the embedded Apollo state:
|
||||
|
||||
- Uses `extractApolloState()` to parse `__NEXT_DATA__` from HTML
|
||||
- Filters Apollo keys containing "Listing" to find ad data
|
||||
- Extracts `url`, `title`, and other metadata from each listing
|
||||
- Successfully scrapes listings without needing API authentication
|
||||
|
||||
### Authentication Status
|
||||
- **Search functionality**: No authentication required - all search and listing data accessible anonymously
|
||||
- **Posting functionality**: Requires authentication (redirects to login)
|
||||
- **User features**: Saved searches, messaging require authentication
|
||||
- **Rate limiting**: May apply but not observed in anonymous browsing
|
||||
|
||||
### Pagination Implementation
|
||||
- Each page is a separate server-rendered route
|
||||
- URL pattern: `/b-{location}/{keywords}/page-{number}/k0{category}l{location_id}`
|
||||
- No client-side pagination API calls
|
||||
- 40 results per page (observed)
|
||||
- Example: `/b-canada/iphone/page-2/k0l0` for page 2 of iPhone search
|
||||
|
||||
## URL Pattern Analysis
|
||||
|
||||
### Search URL Structure
|
||||
`https://www.kijiji.ca/b-{category_slug}/{location_slug}/{keywords}/k0c{category_id}l{location_id}`
|
||||
|
||||
#### Examples Observed:
|
||||
- All categories, Canada: `/b-canada/iphone/k0l0` (c0 = All Categories, l0 = Canada)
|
||||
- Cell phones category: `/b-cell-phones/canada/iphone/k0c132l0` (c132 = Cell Phones)
|
||||
- With pagination: `/b-canada/iphone/page-2/k0l0`
|
||||
|
||||
#### URL Components:
|
||||
- `c{CATEGORY_ID}`: Category ID (0 = All Categories, 132 = Cell Phones, etc.)
|
||||
- `l{LOCATION_ID}`: Location ID (0 = Canada, 1700272 = GTA, etc.)
|
||||
- `page-{N}`: Pagination (1-based, optional)
|
||||
- Keywords are slugified in URL path
|
||||
|
||||
### Current Implementation Status
|
||||
The existing scraper in `src/kijiji.ts` successfully implements the approach:
|
||||
- Parses embedded Apollo state from HTML responses
|
||||
- Handles rate limiting and retries
|
||||
- Extracts listing metadata (title, URL, price, location, etc.)
|
||||
- Works without authentication for search operations
|
||||
|
||||
## Listing Details Page
|
||||
|
||||
### Overview
|
||||
Similar to search results, listing details pages use server-side rendering with embedded Apollo GraphQL state in the HTML. No dedicated API endpoint serves individual listing data - all information is pre-rendered on the server.
|
||||
|
||||
### Data Architecture
|
||||
- **Server-Side Rendering**: Each listing page is fully server-rendered with data embedded in HTML
|
||||
- **Embedded Apollo State**: Listing data is stored in `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`
|
||||
- **Client-Side GraphQL**: Additional data (categories, campaigns, similar listings, user profiles) fetched via GraphQL API
|
||||
|
||||
### Listing Data Structure
|
||||
The main listing data follows the same pattern as search results:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "1705585530",
|
||||
"title": "We Pay top cash for iPhone 17 pro max, iPhone 17 pro, iPhone Air",
|
||||
"description": "Buying All Brand new Apple iPhones sealed/Unsealed...",
|
||||
"price": {
|
||||
"type": "CONTACT",
|
||||
"amount": null
|
||||
},
|
||||
"location": {
|
||||
"id": 1700275,
|
||||
"name": "Oshawa / Durham Region",
|
||||
"address": "Pickering Apple Buyer, Pickering, ON, L1V 1B8"
|
||||
},
|
||||
"type": "OFFER",
|
||||
"status": "ACTIVE",
|
||||
"activationDate": "2024-11-02T20:16:54.000Z",
|
||||
"endDate": "3000-01-01T00:00:00.000Z",
|
||||
"metrics": {
|
||||
"views": 1720
|
||||
},
|
||||
"posterInfo": {
|
||||
"posterId": "1044934581",
|
||||
"rating": null
|
||||
},
|
||||
"attributes": [
|
||||
{
|
||||
"canonicalName": "forsaleby",
|
||||
"canonicalValues": ["business"]
|
||||
},
|
||||
{
|
||||
"canonicalName": "phonecarrier",
|
||||
"canonicalValues": ["unlocked"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Client-Side GraphQL Queries
|
||||
When loading a listing details page, the following GraphQL queries are executed:
|
||||
|
||||
#### 1. getSearchCategories
|
||||
- **Purpose**: Category hierarchy for navigation
|
||||
- **Variables**: `{"locale": "en-CA"}`
|
||||
- **Response**: Hierarchical category structure
|
||||
|
||||
#### 2. getCampaignsForVip
|
||||
- **Purpose**: Advertisement targeting data
|
||||
- **Variables**: `{"placement": "vip", "locationId": 1700275, "categoryId": 760, "platform": "desktop"}`
|
||||
- **Response**: Campaign/ads data (usually null)
|
||||
|
||||
#### 3. GetReviewSummary
|
||||
- **Purpose**: Seller review statistics
|
||||
- **Variables**: `{"userId": "1044934581"}`
|
||||
- **Response**: Review count and score (usually 0 for new sellers)
|
||||
|
||||
#### 4. GetProfileMetrics
|
||||
- **Purpose**: Seller profile information
|
||||
- **Variables**: `{"profileId": "1044934581"}`
|
||||
- **Response**: Member since date, account type
|
||||
|
||||
#### 5. GetListingsSimilar
|
||||
- **Purpose**: Similar listings for cross-selling
|
||||
- **Variables**: `{"listingId": "1705585530", "limit": 10, "isExternalId": false}`
|
||||
- **Response**: Array of similar listings with basic metadata
|
||||
|
||||
#### 6. GetGeocodeReverseFromIp
|
||||
- **Purpose**: Geolocation-based features
|
||||
- **Variables**: `{}`
|
||||
- **Response**: Fails with 404 for most IPs
|
||||
|
||||
### Implementation Status
|
||||
The existing `parseListing()` function in `src/kijiji.ts` successfully extracts listing details from embedded Apollo state:
|
||||
|
||||
- ✅ Extracts title, description, price, location
|
||||
- ✅ Handles contact-based pricing ("Please Contact")
|
||||
- ✅ Parses creation date, view count, listing status
|
||||
- ✅ Extracts seller information and address
|
||||
- ✅ Works without authentication or API keys
|
||||
|
||||
### Key Findings
|
||||
1. **No Dedicated Listing API**: Unlike search results, there's no separate GraphQL query for individual listing data
|
||||
2. **Complete Data Available**: All listing information is embedded in the initial HTML response
|
||||
3. **Additional Context Fetched**: Secondary GraphQL queries provide complementary data (reviews, similar listings)
|
||||
4. **Consistent Architecture**: Same Apollo state embedding pattern as search pages
|
||||
|
||||
### Current Scraper Implementation
|
||||
The scraper successfully extracts listing details by:
|
||||
1. Fetching the listing URL HTML
|
||||
2. Parsing embedded `__NEXT_DATA__` Apollo state
|
||||
3. Extracting the `Listing:{id}` object from Apollo cache
|
||||
4. Mapping fields to typed `ListingDetails` interface
|
||||
|
||||
This approach works reliably without requiring authentication or dealing with rate limiting on individual listing fetches.
|
||||
|
||||
## Next Steps
|
||||
- Explore posting/authentication APIs (requires user login)
|
||||
- Investigate if GraphQL API can be used for programmatic access with proper authentication
|
||||
- Test rate limiting patterns and optimal scraping strategies
|
||||
- Document additional category and location ID mappings
|
||||
Reference in New Issue
Block a user