chore: format markdown

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
This commit is contained in:
2026-05-01 11:42:54 -04:00
parent d2c3c07e7d
commit 7ab33d0b02
15 changed files with 925 additions and 417 deletions

145
KIJIJI.md
View File

@@ -1,9 +1,13 @@
# Kijiji API Findings
## Overview
Kijiji is a Canadian classifieds marketplace that uses a modern web application built with Next.js and Apollo GraphQL. The search results are powered by a GraphQL API with client-side state management.
Kijiji is a Canadian classifieds marketplace that uses a modern web application built
with Next.js and Apollo GraphQL. The search results are powered by a GraphQL API with
client-side state management.
## Initial Page Load (Homepage)
- **URL**: https://www.kijiji.ca/
- **Architecture**: Server-side rendered React application with Next.js
- **Data Sources**:
@@ -12,18 +16,27 @@ Kijiji is a Canadian classifieds marketplace that uses a modern web application
- No initial API calls for listings - data appears to be embedded in HTML
## Search Results Page
- **URL Pattern**: `https://www.kijiji.ca/b-[location]/[keywords]/k0l0`
- **Example**: `https://www.kijiji.ca/b-canada/iphone/k0l0`
- **Technology Stack**: Next.js with Apollo GraphQL client
- **Data Structure**: Uses `__APOLLO_STATE__` global object containing normalized GraphQL cache
- **Data Structure**: Uses `__APOLLO_STATE__` global object containing normalized
GraphQL cache
### GraphQL Data Structure
#### Data Location
Search results data is embedded in the Next.js page props under `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`. The data is pre-rendered on the server and sent to the client. Each page (including pagination) has its own pre-rendered data.
Search results data is embedded in the Next.js page props under
`__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`. The data is pre-rendered on the server
and sent to the client.
Each page (including pagination) has its own pre-rendered data.
#### Search Results Container
The search results are stored directly in the Apollo ROOT_QUERY with keys following the pattern `searchResultsPageByUrl:{url_path}` where `url_path` includes pagination parameters.
The search results are stored directly in the Apollo ROOT_QUERY with keys following the
pattern `searchResultsPageByUrl:{url_path}` where `url_path` includes pagination
parameters.
```json
{
@@ -33,17 +46,20 @@ The search results are stored directly in the Apollo ROOT_QUERY with keys follow
```
#### Pagination Handling
- Each page is server-side rendered with its own embedded data
- No client-side GraphQL requests for pagination
- URL parameter `?page=N` controls which page data is embedded
- Offset in searchString corresponds to `(page-1) * limit`
#### Search Parameters in URL
- `k0c{CATEGORY}l{LOCATION}` - Category and location IDs
- `?page=N` - Page number (1-based)
- Data contains `offset` and `limit` for API-style pagination
#### Individual Listing Structure
```json
{
"id": "1732061412",
@@ -90,6 +106,7 @@ The search results are stored directly in the Apollo ROOT_QUERY with keys follow
```
### URL Parameters
- `sort=MATCH` - Sort by relevance
- `order=DESC` - Descending order
- `type=OFFER` - Show offerings (not wanted ads)
@@ -102,6 +119,7 @@ The search results are stored directly in the Apollo ROOT_QUERY with keys follow
- `eaTopAdPosition=1` - ?
### Image API
- **Endpoint**: `https://media.kijiji.ca/api/v1/`
- **Pattern**: `/ca-prod-fsbo-ads/images/{uuid}?rule=kijijica-{size}-jpg`
- **Sizes**: 200, 300, 400, 500 pixels
@@ -109,10 +127,12 @@ The search results are stored directly in the Apollo ROOT_QUERY with keys follow
### Categories and Locations
#### Category Structure
Categories are hierarchical with parent-child relationships. The main categories under "Buy & Sell" include:
Categories are hierarchical with parent-child relationships.
The main categories under “Buy & Sell” include:
| ID | Name | Total Results (iPhone search) |
|----|------|------------------------------|
| --- | --- | --- |
| 10 | Buy & Sell | 19956 |
| 12 | Arts & Collectibles | 149 |
| 767 | Audio | 481 |
@@ -145,10 +165,11 @@ Categories are hierarchical with parent-child relationships. The main categories
| 26 | Other | 286 |
#### Location Structure
Locations are also hierarchical, with provinces/states under the main "Canada" location:
Locations are also hierarchical, with provinces/states under the main “Canada” location:
| ID | Name | Total Results (iPhone search) |
|----|------|------------------------------|
| --- | --- | --- |
| 0 | Canada | - |
| 9001 | Québec | 2516 |
| 9002 | Nova Scotia | 875 |
@@ -163,16 +184,20 @@ Locations are also hierarchical, with provinces/states under the main "Canada" l
| 9011 | Prince Edward Island | 31 |
#### URL Patterns
- Categories: `/b-{category-slug}/canada/{keywords}/k0c{CATEGORY_ID}l0`
- Locations: `/b-buy-sell/{location-slug}/iphone/k0c10l{LOCATION_ID}`
- Combined: `/b-{category-slug}/{location-slug}/{keywords}/k0c{CATEGORY_ID}l{LOCATION_ID}`
- Combined:
`/b-{category-slug}/{location-slug}/{keywords}/k0c{CATEGORY_ID}l{LOCATION_ID}`
### Pagination
- Uses offset-based pagination
- 40 results per page
- Total count provided in pagination metadata
## Authentication & User Management
- **Authentication System**: OAuth2-based using CIS (Customer Identity Service)
- **Identity Provider**: `id.kijiji.ca`
- **OAuth2 Flow**:
@@ -184,24 +209,30 @@ Locations are also hierarchical, with provinces/states under the main "Canada" l
- **User Features**: Saved searches, messaging, flagging require authentication
## Posting API
- **Posting Flow**: Requires authentication, redirects to login if not authenticated
- **Posting URL**: `https://www.kijiji.ca/p-post-ad.html`
- **Authentication Required**: Yes, redirects to `/consumer/login` for unauthenticated users
- **Post-Creation**: Likely uses authenticated GraphQL mutations (not observed in anonymous browsing)
- **Authentication Required**: Yes, redirects to `/consumer/login` for unauthenticated
users
- **Post-Creation**: Likely uses authenticated GraphQL mutations (not observed in
anonymous browsing)
## GraphQL API Endpoint
- **URL**: `https://www.kijiji.ca/anvil/api`
- **Method**: POST
- **Content-Type**: application/json
- **Headers**:
- `apollo-require-preflight: true`
- Standard CORS headers
- **Authentication**: No authentication required for basic queries (uses cookies for session tracking)
- **Authentication**: No authentication required for basic queries (uses cookies for
session tracking)
- **Technology**: Apollo GraphQL server
### Sample GraphQL Queries Discovered
#### Get Search Categories
```graphql
query getSearchCategories($locale: String!) {
searchCategories {
@@ -218,6 +249,7 @@ Variables: `{"locale": "en-CA"}`
Response includes hierarchical category structure with IDs and localized names.
#### Get Geocode from IP (fails for current IP)
```graphql
query GetGeocodeReverseFromIp {
geocodeReverseFromIp {
@@ -229,9 +261,11 @@ query GetGeocodeReverseFromIp {
}
```
This query fails for the current IP address, suggesting geolocation-based features may not work or require different IP ranges.
This query fails for the current IP address, suggesting geolocation-based features may
not work or require different IP ranges.
#### Get Category Path
```graphql
query GetCategoryPath($categoryId: Int!, $locale: String, $locationId: Int) {
category(id: $categoryId) {
@@ -256,25 +290,33 @@ Variables: `{"categoryId": 10, "locationId": 0, "locale": "en-CA"}`
## Latest Findings (2026-01-21)
### Client-Side GraphQL Queries Observed
- **getSearchCategories**: Retrieves category hierarchy for search filters
- **GetGeocodeReverseFromIp**: Attempts to geolocate user (fails for current IP)
### GraphQL Schema Insights
Testing direct GraphQL queries revealed:
- Field "searchResults" does not exist on Query type
- Suggested alternatives: "searchResultsPage" or "searchUrl"
- This suggests the search functionality may use different GraphQL operations than direct queries
The embedded Apollo state approach appears to be the primary method for accessing search data, with GraphQL used for auxiliary operations like categories and geolocation.
Testing direct GraphQL queries revealed:
- Field “searchResults” does not exist on Query type
- Suggested alternatives: “searchResultsPage” or “searchUrl”
- This suggests the search functionality may use different GraphQL operations than
direct queries
The embedded Apollo state approach appears to be the primary method for accessing search
data, with GraphQL used for auxiliary operations like categories and geolocation.
### Server-Side Rendering Architecture
Search results are fully server-side rendered with data embedded in HTML. Each page (including pagination) contains its own pre-rendered data. No client-side GraphQL requests are made for:
Search results are fully server-side rendered with data embedded in HTML. Each page
(including pagination) contains its own pre-rendered data.
No client-side GraphQL requests are made for:
- Initial search results
- Pagination navigation
- Search result data
### Network Analysis Findings
- GraphQL endpoint: `https://www.kijiji.ca/anvil/api`
- Method: POST
- Content-Type: application/json
@@ -282,7 +324,10 @@ Search results are fully server-side rendered with data embedded in HTML. Each p
- Cookies required for session tracking
### Embedded Data Structure
Search results data is embedded in the HTML within Next.js `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__` object. The data includes:
Search results data is embedded in the HTML within Next.js
`__NEXT_DATA__.props.pageProps.__APOLLO_STATE__` object.
The data includes:
- Individual ad listings with complete metadata
- Pagination information
@@ -290,20 +335,24 @@ Search results data is embedded in the HTML within Next.js `__NEXT_DATA__.props.
- Category/location hierarchies
### Current Scraper Implementation
The existing `src/kijiji.ts` implementation correctly parses the embedded Apollo state:
- Uses `extractApolloState()` to parse `__NEXT_DATA__` from HTML
- Filters Apollo keys containing "Listing" to find ad data
- Filters Apollo keys containing Listing to find ad data
- Extracts `url`, `title`, and other metadata from each listing
- Successfully scrapes listings without needing API authentication
### Authentication Status
- **Search functionality**: No authentication required - all search and listing data accessible anonymously
- **Search functionality**: No authentication required - all search and listing data
accessible anonymously
- **Posting functionality**: Requires authentication (redirects to login)
- **User features**: Saved searches, messaging require authentication
- **Rate limiting**: May apply but not observed in anonymous browsing
### Pagination Implementation
- Each page is a separate server-rendered route
- URL pattern: `/b-{location}/{keywords}/page-{number}/k0{category}l{location_id}`
- No client-side pagination API calls
@@ -313,20 +362,24 @@ The existing `src/kijiji.ts` implementation correctly parses the embedded Apollo
## URL Pattern Analysis
### Search URL Structure
`https://www.kijiji.ca/b-{category_slug}/{location_slug}/{keywords}/k0c{category_id}l{location_id}`
#### Examples Observed:
- All categories, Canada: `/b-canada/iphone/k0l0` (c0 = All Categories, l0 = Canada)
- Cell phones category: `/b-cell-phones/canada/iphone/k0c132l0` (c132 = Cell Phones)
- With pagination: `/b-canada/iphone/page-2/k0l0`
#### URL Components:
- `c{CATEGORY_ID}`: Category ID (0 = All Categories, 132 = Cell Phones, etc.)
- `l{LOCATION_ID}`: Location ID (0 = Canada, 1700272 = GTA, etc.)
- `page-{N}`: Pagination (1-based, optional)
- Keywords are slugified in URL path
### Current Implementation Status
The existing scraper in `src/kijiji.ts` successfully implements the approach:
- Parses embedded Apollo state from HTML responses
- Handles rate limiting and retries
@@ -336,14 +389,22 @@ The existing scraper in `src/kijiji.ts` successfully implements the approach:
## Listing Details Page
### Overview
Similar to search results, listing details pages use server-side rendering with embedded Apollo GraphQL state in the HTML. No dedicated API endpoint serves individual listing data - all information is pre-rendered on the server.
Similar to search results, listing details pages use server-side rendering with embedded
Apollo GraphQL state in the HTML. No dedicated API endpoint serves individual listing
data - all information is pre-rendered on the server.
### Data Architecture
- **Server-Side Rendering**: Each listing page is fully server-rendered with data embedded in HTML
- **Embedded Apollo State**: Listing data is stored in `__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`
- **Client-Side GraphQL**: Additional data (categories, campaigns, similar listings, user profiles) fetched via GraphQL API
- **Server-Side Rendering**: Each listing page is fully server-rendered with data
embedded in HTML
- **Embedded Apollo State**: Listing data is stored in
`__NEXT_DATA__.props.pageProps.__APOLLO_STATE__`
- **Client-Side GraphQL**: Additional data (categories, campaigns, similar listings,
user profiles) fetched via GraphQL API
### Listing Data Structure
The main listing data follows the same pattern as search results:
```json
@@ -385,40 +446,50 @@ The main listing data follows the same pattern as search results:
```
### Client-Side GraphQL Queries
When loading a listing details page, the following GraphQL queries are executed:
#### 1. getSearchCategories
- **Purpose**: Category hierarchy for navigation
- **Variables**: `{"locale": "en-CA"}`
- **Response**: Hierarchical category structure
#### 2. getCampaignsForVip
- **Purpose**: Advertisement targeting data
- **Variables**: `{"placement": "vip", "locationId": 1700275, "categoryId": 760, "platform": "desktop"}`
- **Variables**:
`{"placement": "vip", "locationId": 1700275, "categoryId": 760, "platform": "desktop"}`
- **Response**: Campaign/ads data (usually null)
#### 3. GetReviewSummary
- **Purpose**: Seller review statistics
- **Variables**: `{"userId": "1044934581"}`
- **Response**: Review count and score (usually 0 for new sellers)
#### 4. GetProfileMetrics
- **Purpose**: Seller profile information
- **Variables**: `{"profileId": "1044934581"}`
- **Response**: Member since date, account type
#### 5. GetListingsSimilar
- **Purpose**: Similar listings for cross-selling
- **Variables**: `{"listingId": "1705585530", "limit": 10, "isExternalId": false}`
- **Response**: Array of similar listings with basic metadata
#### 6. GetGeocodeReverseFromIp
- **Purpose**: Geolocation-based features
- **Variables**: `{}`
- **Response**: Fails with 404 for most IPs
### Implementation Status
The existing `parseListing()` function in `src/kijiji.ts` successfully extracts listing details from embedded Apollo state:
The existing `parseListing()` function in `src/kijiji.ts` successfully extracts listing
details from embedded Apollo state:
- ✅ Extracts title, description, price, location
- ✅ Handles contact-based pricing ("Please Contact")
@@ -427,22 +498,30 @@ The existing `parseListing()` function in `src/kijiji.ts` successfully extracts
- ✅ Works without authentication or API keys
### Key Findings
1. **No Dedicated Listing API**: Unlike search results, there's no separate GraphQL query for individual listing data
2. **Complete Data Available**: All listing information is embedded in the initial HTML response
3. **Additional Context Fetched**: Secondary GraphQL queries provide complementary data (reviews, similar listings)
1. **No Dedicated Listing API**: Unlike search results, theres no separate GraphQL
query for individual listing data
2. **Complete Data Available**: All listing information is embedded in the initial HTML
response
3. **Additional Context Fetched**: Secondary GraphQL queries provide complementary data
(reviews, similar listings)
4. **Consistent Architecture**: Same Apollo state embedding pattern as search pages
### Current Scraper Implementation
The scraper successfully extracts listing details by:
1. Fetching the listing URL HTML
2. Parsing embedded `__NEXT_DATA__` Apollo state
3. Extracting the `Listing:{id}` object from Apollo cache
4. Mapping fields to typed `ListingDetails` interface
This approach works reliably without requiring authentication or dealing with rate limiting on individual listing fetches.
This approach works reliably without requiring authentication or dealing with rate
limiting on individual listing fetches.
## Next Steps
- Explore posting/authentication APIs (requires user login)
- Investigate if GraphQL API can be used for programmatic access with proper authentication
- Investigate if GraphQL API can be used for programmatic access with proper
authentication
- Test rate limiting patterns and optimal scraping strategies
- Document additional category and location ID mappings