diff --git a/KIJIJI.md b/KIJIJI.md new file mode 100644 index 0000000..64eebf1 --- /dev/null +++ b/KIJIJI.md @@ -0,0 +1,448 @@ +# Kijiji API Findings + +## Overview +Kijiji is a Canadian classifieds marketplace that uses a modern web application built with Next.js and Apollo GraphQL. The search results are powered by a GraphQL API with client-side state management. + +## Initial Page Load (Homepage) +- **URL**: https://www.kijiji.ca/ +- **Architecture**: Server-side rendered React application with Next.js +- **Data Sources**: + - Static assets loaded from `webapp-static.ca-kijiji-production.classifiedscloud.io` + - Image media served from `media.kijiji.ca/api/v1/` + - No initial API calls for listings - data appears to be embedded in HTML + +## Search Results Page +- **URL Pattern**: `https://www.kijiji.ca/b-[location]/[keywords]/k0l0` +- **Example**: `https://www.kijiji.ca/b-canada/iphone/k0l0` +- **Technology Stack**: Next.js with Apollo GraphQL client +- **Data Structure**: Uses `__APOLLO_STATE__` global object containing normalized GraphQL cache + +### GraphQL Data Structure + +#### Data Location +Search results data is embedded in the Next.js page props under `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`. The data is pre-rendered on the server and sent to the client. Each page (including pagination) has its own pre-rendered data. + +#### Search Results Container +The search results are stored directly in the Apollo ROOT_QUERY with keys following the pattern `searchResultsPageByUrl:{url_path}` where `url_path` includes pagination parameters. + +```json +{ + "searchResultsPageByUrl:/b-buy-sell/canada/iphone/k0c10l0": { ... }, + "searchResultsPageByUrl:/b-buy-sell/canada/iphone/k0c10l0?page=2": { ... } +} +``` + +#### Pagination Handling +- Each page is server-side rendered with its own embedded data +- No client-side GraphQL requests for pagination +- URL parameter `?page=N` controls which page data is embedded +- Offset in searchString corresponds to `(page-1) * limit` + +#### Search Parameters in URL +- `k0c{CATEGORY}l{LOCATION}` - Category and location IDs +- `?page=N` - Page number (1-based) +- Data contains `offset` and `limit` for API-style pagination + +#### Individual Listing Structure +```json +{ + "id": "1732061412", + "title": "iPhone 13", + "description": "iPhone 13, always had a screen protector on it...", + "imageCount": 3, + "imageUrls": ["https://media.kijiji.ca/api/v1/ca-prod-fsbo-ads/images/..."], + "categoryId": 760, + "url": "https://www.kijiji.ca/v-cell-phone/...", + "activationDate": "2026-01-21T16:51:16.000Z", + "sortingDate": "2026-01-21T16:51:16.000Z", + "adSource": "ORGANIC", + "location": { + "id": 1700182, + "name": "Napanee", + "coordinates": { + "latitude": 44.48774, + "longitude": -76.99519 + } + }, + "price": { + "type": "FIXED", + "amount": 35000 + }, + "flags": { + "topAd": false, + "priceDrop": false + }, + "posterInfo": { + "posterId": "1000764154", + "rating": 5 + }, + "attributes": [ + { + "canonicalName": "forsaleby", + "canonicalValues": ["ownr"] + }, + { + "canonicalName": "phonecarrier", + "canonicalValues": ["unlck"] + } + ] +} +``` + +### URL Parameters +- `sort=MATCH` - Sort by relevance +- `order=DESC` - Descending order +- `type=OFFER` - Show offerings (not wanted ads) +- `offset=0` - Pagination offset +- `limit=40` - Results per page +- `topAdCount=6` - Number of promoted ads +- `keywords=iphone` - Search keywords +- `category=0` - Category ID (0 = All Categories) +- `location=0` - Location ID (0 = Canada) +- `eaTopAdPosition=1` - ? + +### Image API +- **Endpoint**: `https://media.kijiji.ca/api/v1/` +- **Pattern**: `/ca-prod-fsbo-ads/images/{uuid}?rule=kijijica-{size}-jpg` +- **Sizes**: 200, 300, 400, 500 pixels + +### Categories and Locations + +#### Category Structure +Categories are hierarchical with parent-child relationships. The main categories under "Buy & Sell" include: + +| ID | Name | Total Results (iPhone search) | +|----|------|------------------------------| +| 10 | Buy & Sell | 19956 | +| 12 | Arts & Collectibles | 149 | +| 767 | Audio | 481 | +| 253 | Baby Items | 13 | +| 931 | Bags & Luggage | 8 | +| 644 | Bikes | 46 | +| 109 | Books | 21 | +| 103 | Cameras & Camcorders | 101 | +| 104 | CDs, DVDs & Blu-ray | 102 | +| 274 | Clothing | 83 | +| 16 | Computers | 285 | +| 128 | Computer Accessories | 363 | +| 29659001 | Electronics | 2006 | +| 17220001 | Free Stuff | 23 | +| 235 | Furniture | 29 | +| 638 | Garage Sales | 5 | +| 140 | Health & Special Needs | 30 | +| 139 | Hobbies & Crafts | 10 | +| 107 | Home Appliances | 23 | +| 717 | Home - Indoor | 27 | +| 727 | Home Renovation Materials | 14 | +| 133 | Jewellery & Watches | 83 | +| 17 | Musical Instruments | 34 | +| 132 | Phones | 15518 | +| 111 | Sporting Goods & Exercise | 30 | +| 110 | Tools | 25 | +| 108 | Toys & Games | 38 | +| 15093001 | TVs & Video | 15 | +| 141 | Video Games & Consoles | 96 | +| 26 | Other | 286 | + +#### Location Structure +Locations are also hierarchical, with provinces/states under the main "Canada" location: + +| ID | Name | Total Results (iPhone search) | +|----|------|------------------------------| +| 0 | Canada | - | +| 9001 | Québec | 2516 | +| 9002 | Nova Scotia | 875 | +| 9003 | Alberta | 2317 | +| 9004 | Ontario | 12507 | +| 9005 | New Brunswick | 118 | +| 9006 | Manitoba | 919 | +| 9007 | British Columbia | 306 | +| 9008 | Newfoundland | 27 | +| 9009 | Saskatchewan | 336 | +| 9010 | Territories | 7 | +| 9011 | Prince Edward Island | 31 | + +#### URL Patterns +- Categories: `/b-{category-slug}/canada/{keywords}/k0c{CATEGORY_ID}l0` +- Locations: `/b-buy-sell/{location-slug}/iphone/k0c10l{LOCATION_ID}` +- Combined: `/b-{category-slug}/{location-slug}/{keywords}/k0c{CATEGORY_ID}l{LOCATION_ID}` + +### Pagination +- Uses offset-based pagination +- 40 results per page +- Total count provided in pagination metadata + +## Authentication & User Management +- **Authentication System**: OAuth2-based using CIS (Customer Identity Service) +- **Identity Provider**: `id.kijiji.ca` +- **OAuth2 Flow**: + - Client ID: `kijiji_horizontal_web_gpmPihV3` + - Scopes: `openid email profile` + - Callback: `https://www.kijiji.ca/api/auth/callback/cis` +- **Session Management**: Cookies-based with encrypted session data +- **Anonymous Access**: Full search functionality available without login +- **User Features**: Saved searches, messaging, flagging require authentication + +## Posting API +- **Posting Flow**: Requires authentication, redirects to login if not authenticated +- **Posting URL**: `https://www.kijiji.ca/p-post-ad.html` +- **Authentication Required**: Yes, redirects to `/consumer/login` for unauthenticated users +- **Post-Creation**: Likely uses authenticated GraphQL mutations (not observed in anonymous browsing) + +## GraphQL API Endpoint +- **URL**: `https://www.kijiji.ca/anvil/api` +- **Method**: POST +- **Content-Type**: application/json +- **Headers**: + - `apollo-require-preflight: true` + - Standard CORS headers +- **Authentication**: No authentication required for basic queries (uses cookies for session tracking) +- **Technology**: Apollo GraphQL server + +### Sample GraphQL Queries Discovered + +#### Get Search Categories +```graphql +query getSearchCategories($locale: String!) { + searchCategories { + id + localizedName(locale: $locale) + parentId + __typename + } +} +``` + +Variables: `{"locale": "en-CA"}` + +Response includes hierarchical category structure with IDs and localized names. + +#### Get Geocode from IP (fails for current IP) +```graphql +query GetGeocodeReverseFromIp { + geocodeReverseFromIp { + city + province + locationId + __typename + } +} +``` + +This query fails for the current IP address, suggesting geolocation-based features may not work or require different IP ranges. + +#### Get Category Path +```graphql +query GetCategoryPath($categoryId: Int!, $locale: String, $locationId: Int) { + category(id: $categoryId) { + id + localizedName(locale: $locale) + parentId + searchSeoUrl(locationId: $locationId) + categoryPaths { + id + localizedName(locale: $locale) + parentId + searchSeoUrl(locationId: $locationId) + __typename + } + __typename + } +} +``` + +Variables: `{"categoryId": 10, "locationId": 0, "locale": "en-CA"}` + +## Latest Findings (2026-01-21) + +### Client-Side GraphQL Queries Observed +- **getSearchCategories**: Retrieves category hierarchy for search filters +- **GetGeocodeReverseFromIp**: Attempts to geolocate user (fails for current IP) + +### GraphQL Schema Insights +Testing direct GraphQL queries revealed: +- Field "searchResults" does not exist on Query type +- Suggested alternatives: "searchResultsPage" or "searchUrl" +- This suggests the search functionality may use different GraphQL operations than direct queries + +The embedded Apollo state approach appears to be the primary method for accessing search data, with GraphQL used for auxiliary operations like categories and geolocation. + +### Server-Side Rendering Architecture +Search results are fully server-side rendered with data embedded in HTML. Each page (including pagination) contains its own pre-rendered data. No client-side GraphQL requests are made for: + +- Initial search results +- Pagination navigation +- Search result data + +### Network Analysis Findings +- GraphQL endpoint: `https://www.kijiji.ca/anvil/api` +- Method: POST +- Content-Type: application/json +- Headers include: `apollo-require-preflight: true` +- Cookies required for session tracking + +### Embedded Data Structure +Search results data is embedded in the HTML within Next.js `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__` object. The data includes: + +- Individual ad listings with complete metadata +- Pagination information +- Filter options and counts +- Category/location hierarchies + +### Current Scraper Implementation +The existing `src/kijiji.ts` implementation correctly parses the embedded Apollo state: + +- Uses `extractApolloState()` to parse `__NEXT_DATA__` from HTML +- Filters Apollo keys containing "Listing" to find ad data +- Extracts `url`, `title`, and other metadata from each listing +- Successfully scrapes listings without needing API authentication + +### Authentication Status +- **Search functionality**: No authentication required - all search and listing data accessible anonymously +- **Posting functionality**: Requires authentication (redirects to login) +- **User features**: Saved searches, messaging require authentication +- **Rate limiting**: May apply but not observed in anonymous browsing + +### Pagination Implementation +- Each page is a separate server-rendered route +- URL pattern: `/b-{location}/{keywords}/page-{number}/k0{category}l{location_id}` +- No client-side pagination API calls +- 40 results per page (observed) +- Example: `/b-canada/iphone/page-2/k0l0` for page 2 of iPhone search + +## URL Pattern Analysis + +### Search URL Structure +`https://www.kijiji.ca/b-{category_slug}/{location_slug}/{keywords}/k0c{category_id}l{location_id}` + +#### Examples Observed: +- All categories, Canada: `/b-canada/iphone/k0l0` (c0 = All Categories, l0 = Canada) +- Cell phones category: `/b-cell-phones/canada/iphone/k0c132l0` (c132 = Cell Phones) +- With pagination: `/b-canada/iphone/page-2/k0l0` + +#### URL Components: +- `c{CATEGORY_ID}`: Category ID (0 = All Categories, 132 = Cell Phones, etc.) +- `l{LOCATION_ID}`: Location ID (0 = Canada, 1700272 = GTA, etc.) +- `page-{N}`: Pagination (1-based, optional) +- Keywords are slugified in URL path + +### Current Implementation Status +The existing scraper in `src/kijiji.ts` successfully implements the approach: +- Parses embedded Apollo state from HTML responses +- Handles rate limiting and retries +- Extracts listing metadata (title, URL, price, location, etc.) +- Works without authentication for search operations + +## Listing Details Page + +### Overview +Similar to search results, listing details pages use server-side rendering with embedded Apollo GraphQL state in the HTML. No dedicated API endpoint serves individual listing data - all information is pre-rendered on the server. + +### Data Architecture +- **Server-Side Rendering**: Each listing page is fully server-rendered with data embedded in HTML +- **Embedded Apollo State**: Listing data is stored in `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__` +- **Client-Side GraphQL**: Additional data (categories, campaigns, similar listings, user profiles) fetched via GraphQL API + +### Listing Data Structure +The main listing data follows the same pattern as search results: + +```json +{ + "id": "1705585530", + "title": "We Pay top cash for iPhone 17 pro max, iPhone 17 pro, iPhone Air", + "description": "Buying All Brand new Apple iPhones sealed/Unsealed...", + "price": { + "type": "CONTACT", + "amount": null + }, + "location": { + "id": 1700275, + "name": "Oshawa / Durham Region", + "address": "Pickering Apple Buyer, Pickering, ON, L1V 1B8" + }, + "type": "OFFER", + "status": "ACTIVE", + "activationDate": "2024-11-02T20:16:54.000Z", + "endDate": "3000-01-01T00:00:00.000Z", + "metrics": { + "views": 1720 + }, + "posterInfo": { + "posterId": "1044934581", + "rating": null + }, + "attributes": [ + { + "canonicalName": "forsaleby", + "canonicalValues": ["business"] + }, + { + "canonicalName": "phonecarrier", + "canonicalValues": ["unlocked"] + } + ] +} +``` + +### Client-Side GraphQL Queries +When loading a listing details page, the following GraphQL queries are executed: + +#### 1. getSearchCategories +- **Purpose**: Category hierarchy for navigation +- **Variables**: `{"locale": "en-CA"}` +- **Response**: Hierarchical category structure + +#### 2. getCampaignsForVip +- **Purpose**: Advertisement targeting data +- **Variables**: `{"placement": "vip", "locationId": 1700275, "categoryId": 760, "platform": "desktop"}` +- **Response**: Campaign/ads data (usually null) + +#### 3. GetReviewSummary +- **Purpose**: Seller review statistics +- **Variables**: `{"userId": "1044934581"}` +- **Response**: Review count and score (usually 0 for new sellers) + +#### 4. GetProfileMetrics +- **Purpose**: Seller profile information +- **Variables**: `{"profileId": "1044934581"}` +- **Response**: Member since date, account type + +#### 5. GetListingsSimilar +- **Purpose**: Similar listings for cross-selling +- **Variables**: `{"listingId": "1705585530", "limit": 10, "isExternalId": false}` +- **Response**: Array of similar listings with basic metadata + +#### 6. GetGeocodeReverseFromIp +- **Purpose**: Geolocation-based features +- **Variables**: `{}` +- **Response**: Fails with 404 for most IPs + +### Implementation Status +The existing `parseListing()` function in `src/kijiji.ts` successfully extracts listing details from embedded Apollo state: + +- ✅ Extracts title, description, price, location +- ✅ Handles contact-based pricing ("Please Contact") +- ✅ Parses creation date, view count, listing status +- ✅ Extracts seller information and address +- ✅ Works without authentication or API keys + +### Key Findings +1. **No Dedicated Listing API**: Unlike search results, there's no separate GraphQL query for individual listing data +2. **Complete Data Available**: All listing information is embedded in the initial HTML response +3. **Additional Context Fetched**: Secondary GraphQL queries provide complementary data (reviews, similar listings) +4. **Consistent Architecture**: Same Apollo state embedding pattern as search pages + +### Current Scraper Implementation +The scraper successfully extracts listing details by: +1. Fetching the listing URL HTML +2. Parsing embedded `__NEXT_DATA__` Apollo state +3. Extracting the `Listing:{id}` object from Apollo cache +4. Mapping fields to typed `ListingDetails` interface + +This approach works reliably without requiring authentication or dealing with rate limiting on individual listing fetches. + +## Next Steps +- Explore posting/authentication APIs (requires user login) +- Investigate if GraphQL API can be used for programmatic access with proper authentication +- Test rate limiting patterns and optimal scraping strategies +- Document additional category and location ID mappings