chore: ai agent config

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
This commit is contained in:
2026-04-21 20:19:05 -04:00
parent ffc4a2c5c5
commit 7cf21546e2
65 changed files with 10076 additions and 133 deletions

View File

@@ -0,0 +1,397 @@
# Data Transforms Reference
Patterns for cleaning, normalizing, deduplicating, and enriching
extracted web data. Apply these transforms in Phase 5 (Transform)
between extraction and validation.
---
## Automatic Transforms
Always apply these to every extraction result.
### Whitespace Cleanup
```python
# Remove leading/trailing whitespace, collapse internal whitespace
value = ' '.join(value.split())
# Remove zero-width characters
import re
value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
```
Patterns to handle:
- `\n`, `\r`, `\t` inside cell values -> single space
- Multiple consecutive spaces -> single space
- Non-breaking spaces (`&nbsp;`, `\u00a0`) -> regular space
- Zero-width characters -> remove
### HTML Entity Decode
| Entity | Character | Entity | Character |
|:------------|:----------|:-----------|:----------|
| `&amp;` | `&` | `&quot;` | `"` |
| `&lt;` | `<` | `&apos;` | `'` |
| `&gt;` | `>` | `&#39;` | `'` |
| `&nbsp;` | ` ` | `&#8217;` | (curly ') |
| `&mdash;` | `--` | `&#8212;` | `--` |
```python
import html
value = html.unescape(value)
```
### Unicode Normalization
```python
import unicodedata
value = unicodedata.normalize('NFKC', value)
```
This handles:
- Fancy quotes -> standard quotes
- Ligatures -> separate characters (e.g. `fi` -> `fi`)
- Full-width characters -> standard (e.g. `` -> `A`)
- Superscript/subscript numbers -> regular numbers
### Empty Value Standardization
| Input | Markdown Output | JSON Output |
|:------------------------|:----------------|:------------|
| `""` (empty string) | `N/A` | `null` |
| `"-"` or `"--"` | `N/A` | `null` |
| `"N/A"`, `"n/a"`, `"NA"`| `N/A` | `null` |
| `"None"`, `"null"` | `N/A` | `null` |
| `"TBD"`, `"TBA"` | `TBD` | `"TBD"` |
---
## Price Normalization
Apply when extracting product, pricing, or financial data.
### Extraction Pattern
```python
import re
def normalize_price(raw):
if not raw:
return None
# Remove currency words
cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
# Extract numeric value (handles 1,234.56 and 1.234,56 formats)
match = re.search(r'[\d.,]+', cleaned)
if not match:
return None
num_str = match.group()
# Detect format: if last separator is comma with 2 digits after, it's decimal
if re.search(r',\d{2}$', num_str):
num_str = num_str.replace('.', '').replace(',', '.')
else:
num_str = num_str.replace(',', '')
return float(num_str)
```
### Currency Detection
| Symbol/Code | Currency | Symbol/Code | Currency |
|:------------|:---------|:------------|:---------|
| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
| `€`, `EUR` | Euro | `£`, `GBP` | British Pound |
| `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee |
| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
### Output Format
```json
{
"price": 29.99,
"currency": "USD",
"rawPrice": "$29.99"
}
```
For Markdown, show formatted: `$29.99` (right-aligned in table).
---
## Date Normalization
Normalize all dates to ISO-8601 format.
### Common Formats to Handle
| Input Format | Example | Normalized |
|:------------------------|:---------------------|:-------------------|
| Full text | February 25, 2026 | 2026-02-25 |
| Short text | Feb 25, 2026 | 2026-02-25 |
| US numeric | 02/25/2026 | 2026-02-25 |
| EU numeric | 25/02/2026 | 2026-02-25 |
| ISO already | 2026-02-25 | 2026-02-25 |
| Relative | 3 days ago | (compute from now) |
| Relative | Yesterday | (compute from now) |
| Timestamp | 1740441600 | 2025-02-25 |
| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |
### Ambiguous Dates
When format is ambiguous (e.g. `03/04/2026`):
- Default to US format (MM/DD/YYYY) unless site is clearly non-US
- Check page `lang` attribute or URL TLD for locale hints
- Note ambiguity in delivery notes
### Relative Date Resolution
```python
from datetime import datetime, timedelta
import re
def resolve_relative_date(text):
text = text.lower().strip()
today = datetime.now()
if 'today' in text: return today.strftime('%Y-%m-%d')
if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
if match:
n, unit = int(match.group(1)), match.group(2)
deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
return text # Return as-is if can't parse
```
---
## URL Resolution
Convert relative URLs to absolute.
### Patterns
| Input | Base URL | Resolved |
|:-------------------------|:----------------------------|:--------------------------------------|
| `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` |
| `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` |
| `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` |
| `https://other.com/page` | (any) | `https://other.com/page` (absolute) |
### JavaScript Resolution
```javascript
function resolveUrl(relative, base) {
try { return new URL(relative, base || window.location.href).href; }
catch { return relative; }
}
```
---
## Phone Normalization
For contact mode extraction.
### Pattern
```python
import re
def normalize_phone(raw):
if not raw:
return None
# Remove all non-digit chars except leading +
digits = re.sub(r'[^\d+]', '', raw)
if not digits or len(digits) < 7:
return None
# Add + prefix if looks international
if len(digits) >= 11 and not digits.startswith('+'):
digits = '+' + digits
return digits
```
### Format by Context
| Context | Format Example |
|:-----------------|:---------------------|
| JSON output | `"+5511999998888"` |
| Markdown table | `+55 11 99999-8888` |
| CSV output | `"+5511999998888"` |
---
## Deduplication
### Exact Deduplication
```python
def deduplicate(records, key_fields=None):
"""Remove exact duplicate records.
If key_fields provided, deduplicate by those fields only.
"""
seen = set()
unique = []
for record in records:
if key_fields:
key = tuple(record.get(f) for f in key_fields)
else:
key = tuple(sorted(record.items()))
if key not in seen:
seen.add(key)
unique.append(record)
return unique, len(records) - len(unique) # returns (unique_list, removed_count)
```
### Near-Duplicate Detection
When records share key fields but differ in details:
1. Group by key fields (e.g. product name + source)
2. For each group, keep the record with fewest null values
3. If tie, keep the first occurrence
4. Report in notes: "Merged N near-duplicate records"
### Dedup Key Selection by Mode
| Mode | Key Fields |
|:---------|:----------------------------------|
| product | name + source (or name + brand) |
| contact | name + email (or name + org) |
| jobs | title + company + location |
| events | title + date + location |
| table | all fields (exact match) |
| list | first 2-3 identifying fields |
---
## Text Cleaning
### Remove Noise
Common noise patterns to strip from extracted text:
| Pattern | Action |
|:-----------------------------------|:--------------------------|
| `\[edit\]`, `\[citation needed\]` | Remove (Wikipedia) |
| `Read more...`, `See more` | Remove (truncation markers)|
| `Sponsored`, `Ad`, `Promoted` | Remove or flag |
| Cookie consent text | Remove |
| Navigation breadcrumbs | Remove |
| Footer boilerplate | Remove |
### Sentence Case Normalization
When extracting ALL-CAPS or inconsistent-case text:
```python
def normalize_case(text):
if text.isupper() and len(text) > 3:
return text.title() # ALL CAPS -> Title Case
return text
```
Only apply when: field is clearly ALL-CAPS input (common in older sites),
user requests it, or data looks better normalized.
---
## Data Type Coercion
### Automatic Type Detection
| Raw Value | Detected Type | Coerced Value |
|:--------------|:--------------|:------------------|
| `"123"` | integer | `123` |
| `"12.99"` | float | `12.99` |
| `"true"` | boolean | `true` |
| `"false"` | boolean | `false` |
| `"2026-02-25"`| date string | `"2026-02-25"` |
| `"$29.99"` | price | `29.99` + currency|
| `"4.5/5"` | rating | `4.5` |
| `"1,234"` | integer | `1234` |
### Rating Normalization
```python
import re
def normalize_rating(raw):
if not raw:
return None
match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
if match:
score = float(match.group(1))
max_score = float(match.group(2)) if match.group(2) else 5.0
return round(score / max_score * 5, 1) # Normalize to /5 scale
return None
```
---
## Enrichment Patterns
### Domain Extraction
Add domain from full URLs:
```python
from urllib.parse import urlparse
def extract_domain(url):
try:
parsed = urlparse(url)
domain = parsed.netloc.replace('www.', '')
return domain
except:
return None
```
### Word Count
For article mode:
```python
def word_count(text):
return len(text.split()) if text else 0
```
### Relative Time
Add human-readable time since date:
```python
def time_since(date_str):
from datetime import datetime
try:
dt = datetime.fromisoformat(date_str)
delta = datetime.now() - dt
if delta.days == 0: return "Today"
if delta.days == 1: return "Yesterday"
if delta.days < 7: return f"{delta.days} days ago"
if delta.days < 30: return f"{delta.days // 7} weeks ago"
if delta.days < 365: return f"{delta.days // 30} months ago"
return f"{delta.days // 365} years ago"
except:
return None
```
---
## Transform Pipeline Order
Apply transforms in this sequence:
1. **HTML entity decode** - raw text cleanup
2. **Unicode normalization** - character standardization
3. **Whitespace cleanup** - spacing normalization
4. **Empty value standardization** - null/N/A handling
5. **URL resolution** - relative to absolute
6. **Data type coercion** - strings to numbers/dates
7. **Price normalization** - if applicable
8. **Date normalization** - if applicable
9. **Phone normalization** - if applicable
10. **Text cleaning** - noise removal
11. **Deduplication** - remove duplicates
12. **Sorting** - user-requested order
13. **Enrichment** - domain, word count, etc.
Not all steps apply to every extraction. Apply only what's relevant
to the data type and extraction mode.

View File

@@ -0,0 +1,475 @@
# Extraction Patterns Reference
CSS selectors, JavaScript snippets, and domain-specific tips for
common web scraping scenarios.
---
## CSS Selector Patterns
### Tables
```css
/* Standard HTML tables */
table /* All tables */
table.data-table /* Class-based */
table[id*="result"] /* ID contains "result" */
table thead th /* Header cells */
table tbody tr /* Data rows */
table tbody tr td /* Data cells */
table tbody tr td:nth-child(2) /* Specific column (2nd) */
/* Grid layouts acting as tables */
[role="table"] /* ARIA table role */
[role="row"] /* ARIA row */
[role="gridcell"] /* ARIA grid cell */
.table-responsive table /* Bootstrap responsive wrapper */
```
### Product Listings
```css
/* E-commerce product grids */
.product-card, .product-item, .product-tile
[data-product-id] /* Data attribute markers */
.product-name, .product-title, h2.title
.price, .product-price, [data-price]
.price--sale, .price--original /* Sale vs original price */
.rating, .stars, [data-rating]
.availability, .stock-status
.product-image img, .product-thumb img
/* Common e-commerce patterns */
.search-results .result-item
.catalog-grid .catalog-item
.listing .listing-item
```
### Search Results
```css
/* Generic search result patterns */
.search-result, .result-item, .search-entry
.result-title a, .result-link
.result-snippet, .result-description
.result-url, .result-source
.result-date, .result-timestamp
.pagination a, .page-numbers a, [aria-label="Next"]
```
### Contact / Directory
```css
/* People and contact cards */
.team-member, .staff-card, .person, .contact-card
.member-name, .person-name, h3.name
.member-title, .job-title, .role
.member-email a[href^="mailto:"]
.member-phone a[href^="tel:"]
.member-bio, .person-description
.vcard /* hCard microformat */
```
### FAQ / Accordion
```css
/* FAQ and accordion patterns */
.faq-item, .accordion-item, [itemtype*="FAQPage"] [itemprop="mainEntity"]
.faq-question, .accordion-header, [itemprop="name"], summary
.faq-answer, .accordion-body, .accordion-content, [itemprop="acceptedAnswer"]
details, details > summary /* Native HTML accordion */
[role="tabpanel"] /* Tab-based FAQ */
```
### Pricing Tables
```css
/* SaaS pricing page patterns */
.pricing-table, .pricing-card, .plan-card, .pricing-tier
.plan-name, .tier-name, .pricing-title
.plan-price, .pricing-amount, .price-value
.plan-period, .billing-cycle /* monthly/annually */
.plan-features li, .feature-list li
.plan-cta, .pricing-button
[class*="popular"], [class*="recommended"], [class*="featured"] /* highlighted plan */
```
### Job Listings
```css
/* Job board patterns */
.job-listing, .job-card, .job-posting, [itemtype*="JobPosting"]
.job-title, [itemprop="title"]
.company-name, [itemprop="hiringOrganization"]
.job-location, [itemprop="jobLocation"]
.job-salary, [itemprop="baseSalary"]
.job-type, .employment-type
.job-date, [itemprop="datePosted"]
```
### Events
```css
/* Event listing patterns */
.event-card, .event-item, [itemtype*="Event"]
.event-title, [itemprop="name"]
.event-date, [itemprop="startDate"], time[datetime]
.event-location, [itemprop="location"]
.event-description, [itemprop="description"]
.event-speaker, .speaker-name
```
### Navigation / Pagination
```css
/* Pagination controls */
.pagination, .pager, nav[aria-label*="pagination"]
.pagination .next, a[rel="next"]
.pagination .prev, a[rel="prev"]
.page-numbers, .page-link
button[data-page], a[data-page]
.load-more, button.show-more
```
### Articles / Blog Posts
```css
/* Article content */
article, .post, .entry, .article-content
article h1, .post-title, .entry-title
.author, .byline, [rel="author"]
time, .date, .published, .post-date
.post-content, .entry-content, .article-body
.tags a, .categories a, .post-tags a
```
---
## JavaScript Extraction Snippets
### Generic Table Extractor
```javascript
function extractTable(selector) {
const table = document.querySelector(selector || 'table');
if (!table) return { error: 'No table found' };
const headers = Array.from(
table.querySelectorAll('thead th, tr:first-child th, tr:first-child td')
).map(el => el.textContent.trim());
const rows = Array.from(table.querySelectorAll('tbody tr, tr:not(:first-child)'))
.map(tr => {
const cells = Array.from(tr.querySelectorAll('td'))
.map(td => td.textContent.trim());
return cells.length > 0 ? cells : null;
})
.filter(Boolean);
return { headers, rows, rowCount: rows.length };
}
JSON.stringify(extractTable());
```
### Multi-Table Extractor
```javascript
function extractAllTables() {
const tables = document.querySelectorAll('table');
return Array.from(tables).map((table, idx) => {
const caption = table.querySelector('caption')?.textContent?.trim()
|| table.getAttribute('aria-label') || `Table ${idx + 1}`;
const headers = Array.from(
table.querySelectorAll('thead th, tr:first-child th')
).map(el => el.textContent.trim());
const rows = Array.from(table.querySelectorAll('tbody tr'))
.map(tr => Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim()))
.filter(r => r.length > 0);
return { caption, headers, rows, rowCount: rows.length };
});
}
JSON.stringify(extractAllTables());
```
### Generic List Extractor
```javascript
function extractList(containerSelector, itemSelector, fieldMap) {
// fieldMap: { fieldName: { selector: 'CSS', attr: 'href'|'src'|null } }
const container = document.querySelector(containerSelector);
if (!container) return { error: 'Container not found' };
const items = Array.from(container.querySelectorAll(itemSelector));
const data = items.map(item => {
const record = {};
for (const [key, config] of Object.entries(fieldMap)) {
const sel = typeof config === 'string' ? config : config.selector;
const attr = typeof config === 'object' ? config.attr : null;
const el = item.querySelector(sel);
if (!el) { record[key] = null; continue; }
record[key] = attr ? el.getAttribute(attr) : el.textContent.trim();
}
return record;
});
return { data, itemCount: data.length };
}
// Example usage:
JSON.stringify(extractList('.results', '.result-item', {
title: '.result-title',
description: '.result-snippet',
url: { selector: '.result-title a', attr: 'href' },
date: '.result-date'
}));
```
### JSON-LD Structured Data Extractor
Many pages embed structured data that's easier to parse than DOM:
```javascript
function extractJsonLd(targetType) {
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const allData = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
// Flatten @graph arrays
const flat = allData.flatMap(d => d['@graph'] || [d]);
if (targetType) {
return flat.filter(d =>
d['@type'] === targetType ||
(Array.isArray(d['@type']) && d['@type'].includes(targetType))
);
}
return flat;
}
// Extract products: extractJsonLd('Product')
// Extract articles: extractJsonLd('Article')
// Extract all: extractJsonLd()
JSON.stringify(extractJsonLd());
```
Common JSON-LD types and their useful fields:
- `Product`: name, offers.price, offers.priceCurrency, aggregateRating, brand.name
- `Article`: headline, author.name, datePublished, description, wordCount
- `Organization`: name, address, telephone, email, url
- `BreadcrumbList`: itemListElement[].name (navigation path)
- `FAQPage`: mainEntity[].name (question), mainEntity[].acceptedAnswer.text
- `JobPosting`: title, hiringOrganization.name, jobLocation, baseSalary
- `Event`: name, startDate, endDate, location, performer
### OpenGraph / Meta Tag Extractor
```javascript
function extractMeta() {
const meta = {};
document.querySelectorAll('meta[property^="og:"], meta[name^="twitter:"]')
.forEach(el => {
const key = el.getAttribute('property') || el.getAttribute('name');
meta[key] = el.getAttribute('content');
});
meta.title = document.title;
meta.description = document.querySelector('meta[name="description"]')
?.getAttribute('content');
meta.canonical = document.querySelector('link[rel="canonical"]')
?.getAttribute('href');
return meta;
}
JSON.stringify(extractMeta());
```
### Pricing Plan Extractor
```javascript
function extractPricingPlans() {
const cards = document.querySelectorAll(
'.pricing-card, .plan-card, .pricing-tier, [class*="pricing"] [class*="card"]'
);
return Array.from(cards).map(card => ({
name: card.querySelector('[class*="name"], [class*="title"], h2, h3')
?.textContent?.trim() || null,
price: card.querySelector('[class*="price"], [class*="amount"]')
?.textContent?.trim() || null,
period: card.querySelector('[class*="period"], [class*="billing"]')
?.textContent?.trim() || null,
features: Array.from(card.querySelectorAll('[class*="feature"] li, ul li'))
.map(li => li.textContent.trim()),
highlighted: card.matches('[class*="popular"], [class*="recommended"], [class*="featured"]'),
ctaText: card.querySelector('a, button')?.textContent?.trim() || null,
ctaUrl: card.querySelector('a')?.href || null,
}));
}
JSON.stringify(extractPricingPlans());
```
### FAQ Extractor
```javascript
function extractFAQ() {
// Try JSON-LD first
const ldFaq = extractJsonLd('FAQPage');
if (ldFaq.length > 0 && ldFaq[0].mainEntity) {
return ldFaq[0].mainEntity.map(q => ({
question: q.name,
answer: q.acceptedAnswer?.text || null
}));
}
// Try <details>/<summary> pattern
const details = document.querySelectorAll('details');
if (details.length > 0) {
return Array.from(details).map(d => ({
question: d.querySelector('summary')?.textContent?.trim() || null,
answer: Array.from(d.children).filter(c => c.tagName !== 'SUMMARY')
.map(c => c.textContent.trim()).join(' ')
}));
}
// Try accordion pattern
const items = document.querySelectorAll(
'.faq-item, .accordion-item, [class*="faq"] [class*="item"]'
);
return Array.from(items).map(item => ({
question: item.querySelector(
'[class*="question"], [class*="header"], [class*="title"], h3, h4'
)?.textContent?.trim() || null,
answer: item.querySelector(
'[class*="answer"], [class*="body"], [class*="content"], p'
)?.textContent?.trim() || null
}));
}
JSON.stringify(extractFAQ());
```
### Link Extractor
```javascript
function extractLinks(scope) {
const container = scope ? document.querySelector(scope) : document;
const links = Array.from(container.querySelectorAll('a[href]'))
.map(a => ({
text: a.textContent.trim(),
href: a.href,
title: a.title || null
}))
.filter(l => l.text && l.href && !l.href.startsWith('javascript:'));
return { links, count: links.length };
}
JSON.stringify(extractLinks());
```
### Image Extractor
```javascript
function extractImages(scope) {
const container = scope ? document.querySelector(scope) : document;
const images = Array.from(container.querySelectorAll('img'))
.map(img => ({
src: img.src,
alt: img.alt || null,
width: img.naturalWidth,
height: img.naturalHeight
}))
.filter(i => i.src && !i.src.includes('data:image/gif'));
return { images, count: images.length };
}
JSON.stringify(extractImages());
```
### Scroll-and-Collect Pattern
For pages with lazy-loaded content, use this pattern with Browser automation:
```javascript
// Count items before scroll
function countItems(selector) {
return document.querySelectorAll(selector).length;
}
```
Then in the workflow:
1. `javascript_tool`: `countItems('.item')` -> get initial count
2. `computer(action="scroll", scroll_direction="down")`
3. `computer(action="wait", duration=2)`
4. `javascript_tool`: `countItems('.item')` -> get new count
5. If new count > old count, repeat from step 2
6. If count unchanged after 2 scrolls, all items loaded
7. Extract all items at once
---
## Domain-Specific Tips
### E-Commerce Sites
- Check for JSON-LD `Product` schema first - often has cleaner data than DOM
- Prices may have hidden original/sale price elements
- Availability often encoded in data attributes (`data-available="true"`)
- Product variants (size, color) may require click interactions
- Review data often loaded lazily - scroll to reviews section first
- Many sites have internal APIs at `/api/products` - check Network tab
### Wikipedia
- Tables use class `.wikitable` - always prefer this selector
- Infoboxes use class `.infobox`
- References in `<sup class="reference">` - exclude from text extraction
- Table cells may contain complex nested HTML - use `.textContent.trim()`
- Sortable tables have class `.sortable` with sort buttons in headers
### News Sites
- Article body often in `<article>` or `[itemprop="articleBody"]`
- Paywall indicators: `.paywall`, `.subscribe-wall`, truncated with "Read more"
- Publication date in `<time>` element or `[itemprop="datePublished"]`
- Author in `[itemprop="author"]` or `.byline`
- JSON-LD `NewsArticle` often has complete metadata
### Government / Data Portals
- Often use HTML tables without JavaScript
- May have download links for CSV/Excel - check for `.csv`, `.xlsx` links
- Data dictionaries may be on separate pages
- Look for API endpoints in page source (`/api/`, `.json` links)
- CORS may block direct API access; use Bash curl instead
### Social Media (Public Profiles)
- Content is almost always JS-rendered - use Browser automation
- Rate limiting is aggressive - keep requests minimal
- Infinite scroll is the norm - set clear item limits
- Structure changes frequently - prefer text extraction over selectors
### SaaS Pricing Pages
- Pricing often changes dynamically (monthly vs annual toggle)
- May need to click "Annual" toggle to see annual prices
- Feature comparison tables often use checkmarks (Unicode or SVG)
- Check for hidden elements toggled by billing period selector
### Job Boards
- Most use JSON-LD `JobPosting` schema
- Salary ranges often hidden behind "View salary" buttons
- Location may include remote/hybrid indicators
- Filters are URL-parameter based - useful for pagination
---
## Anti-Patterns to Avoid
| Anti-Pattern | Why It Fails | Better Approach |
|:-------------|:-------------|:----------------|
| Selectors with generated hashes (`.css-1a2b3c`) | Change on every deploy | Use semantic selectors, ARIA roles, data attributes |
| Deeply nested paths (`div > div > div > span`) | Fragile on layout changes | Use closest meaningful class or attribute |
| Index-based (`:nth-child(3)`) for dynamic lists | Order may change | Use content-based identification |
| Selecting by inline styles | Presentation, not semantics | Use classes, IDs, or data attributes |
| Hardcoded wait times for JS content | Too short or too long | Check for content presence in a loop |
| Single selector for variant pages | Different pages differ | Test selector on multiple pages first |
## Robust Selector Priority
Prefer selectors in this order (most stable to least):
1. `[data-testid="..."]`, `[data-id="..."]` - test/data attributes
2. `#unique-id` - unique IDs
3. `[role="..."]`, `[aria-label="..."]` - ARIA attributes
4. `[itemprop="..."]`, `[itemtype="..."]` - microdata / schema.org
5. `.semantic-class` - meaningful class names
6. `tag.class` - element type + class
7. Structural selectors - last resort

View File

@@ -0,0 +1,481 @@
# Output Templates Reference
Complete formatting templates for all supported output formats.
Every output must be wrapped in a delivery envelope with metadata.
---
## Delivery Envelope (Required)
Every extraction result MUST include this metadata wrapper,
regardless of output format:
```markdown
## Extraction Results
**Source:** [Page Title](https://example.com/page)
**Date:** 2026-02-25 14:30 UTC
**Items:** 47 records
**Confidence:** HIGH
**Format:** Markdown Table
---
[DATA GOES HERE]
---
**Notes:**
- Any gaps, anomalies, or observations
- Filters or sorts applied
- Pages scraped (if paginated)
```
---
## Markdown Table Format
### Standard Table
```markdown
| Name | Price | Rating | Availability |
|:---------------|---------:|:------:|:-------------|
| Product Alpha | $29.99 | 4.5 | In Stock |
| Product Beta | $49.99 | 4.2 | In Stock |
| Product Gamma | $119.00 | 4.8 | Pre-order |
| Product Delta | $15.50 | 3.9 | Out of Stock |
```
### Alignment Rules
| Data Type | Alignment | Markdown Syntax |
|:-------------|:----------|:----------------|
| Text | Left | `:---` |
| Numbers | Right | `---:` |
| Centered | Center | `:---:` |
| Mixed/Status | Left | `:---` |
### Table with Summary Row
```markdown
| Product | Units Sold | Revenue |
|:---------------|----------:|-----------:|
| Widget A | 1,234 | $12,340 |
| Widget B | 567 | $8,505 |
| Widget C | 2,890 | $57,800 |
| **Total** | **4,691** | **$78,645**|
```
### Wide Data (Split Tables)
When data has more than 10 columns, split into logical groups:
```markdown
### Basic Information
| Name | Category | Brand | SKU |
|:--------|:---------|:--------|:---------|
| Item A | Tools | Acme | ACM-001 |
### Pricing and Availability
| Name | Price | Sale Price | Stock | Ships In |
|:--------|--------:|-----------:|:------|:---------|
| Item A | $49.99 | $39.99 | 142 | 2 days |
```
### Multi-URL Comparison Table
```markdown
| Source | Product | Price | Rating |
|:-------------|:-----------|--------:|:------:|
| store-a.com | Laptop X | $999 | 4.3 |
| store-b.com | Laptop X | $949 | 4.5 |
| store-c.com | Laptop X | $1,029 | 4.1 |
```
### Truncation Rules
For values exceeding 60 characters:
```markdown
| Title | Author |
|:------------------------------------------------------------|:--------|
| Introduction to Advanced Machine Learning Techni... | J. Smith|
```
---
## JSON Format
### Standard JSON Output
```json
{
"metadata": {
"source": "https://example.com/products",
"title": "Product Catalog - Example Store",
"extractedAt": "2026-02-25T14:30:00Z",
"itemCount": 3,
"confidence": "HIGH",
"fields": ["name", "price", "rating", "availability"],
"notes": []
},
"data": [
{
"name": "Product Alpha",
"price": 29.99,
"currency": "USD",
"rating": 4.5,
"availability": "In Stock"
},
{
"name": "Product Beta",
"price": 49.99,
"currency": "USD",
"rating": 4.2,
"availability": "In Stock"
},
{
"name": "Product Gamma",
"price": 119.00,
"currency": "USD",
"rating": 4.8,
"availability": "Pre-order"
}
]
}
```
### JSON Key Naming
| Rule | Example |
|:-----------------------|:----------------------------------|
| camelCase | `productName`, `unitPrice` |
| Numbers stay numeric | `29.99` not `"29.99"` |
| Booleans stay boolean | `true` not `"true"` |
| Missing = null | `null` not `""` or `"N/A"` |
| Arrays for multiples | `"tags": ["sale", "new"]` |
| ISO-8601 for dates | `"2026-02-25T14:30:00Z"` |
### Nested JSON (Product with Details)
```json
{
"metadata": { "..." : "..." },
"data": [
{
"name": "Laptop Pro X",
"brand": "TechCo",
"pricing": {
"current": 999.99,
"original": 1299.99,
"currency": "USD",
"discount": "23%"
},
"rating": {
"score": 4.5,
"count": 1234
},
"specifications": {
"processor": "M3 Pro",
"ram": "16 GB",
"storage": "512 GB SSD",
"display": "14.2 inch Retina"
},
"availability": {
"inStock": true,
"shipsIn": "2-3 business days"
}
}
]
}
```
### Multi-URL JSON
```json
{
"metadata": {
"sources": [
"https://store-a.com/laptop-x",
"https://store-b.com/laptop-x"
],
"extractedAt": "2026-02-25T14:30:00Z",
"itemCount": 2,
"confidence": "HIGH"
},
"data": [
{
"source": "store-a.com",
"name": "Laptop X",
"price": 999,
"currency": "USD",
"rating": 4.3
},
{
"source": "store-b.com",
"name": "Laptop X",
"price": 949,
"currency": "USD",
"rating": 4.5
}
]
}
```
---
## CSV Format
### Standard CSV
```csv
# Source: https://example.com/products
# Extracted: 2026-02-25 14:30 UTC
# Items: 3 | Confidence: HIGH
name,price,currency,rating,availability
"Product Alpha",29.99,USD,4.5,"In Stock"
"Product Beta",49.99,USD,4.2,"In Stock"
"Product Gamma",119.00,USD,4.8,"Pre-order"
```
### CSV Rules
| Rule | Example |
|:-------------------------------------|:-------------------------------|
| Always include header row | `name,price,rating` |
| Quote fields with commas | `"Smith, John"` |
| Quote fields with quotes (escape) | `"He said ""hello"""` |
| Quote fields with newlines | `"Line 1\nLine 2"` |
| UTF-8 encoding with BOM | `\xEF\xBB\xBF` prefix |
| Comma delimiter (standard) | `,` |
| Metadata as comments (# prefix) | `# Source: URL` |
| null/missing as empty field | `field1,,field3` |
### Multi-URL CSV
```csv
# Sources: store-a.com, store-b.com
# Extracted: 2026-02-25 14:30 UTC
source,name,price,currency,rating
"store-a.com","Laptop X",999,USD,4.3
"store-b.com","Laptop X",949,USD,4.5
```
---
## Summary Statistics Template
When extracted data contains numeric fields, include a summary block:
```markdown
### Summary Statistics
| Metric | Price | Rating |
|:----------|----------:|-------:|
| Count | 47 | 47 |
| Min | $12.99 | 2.1 |
| Max | $299.99 | 5.0 |
| Average | $67.42 | 4.1 |
| Median | $54.99 | 4.3 |
```
Include only when:
- Data has numeric columns
- More than 5 items extracted
- User would likely benefit from aggregate view (prices, ratings, quantities)
---
## Contact Data Template
```markdown
| Name | Title | Email | Phone |
|:---------------|:-------------------|:---------------------|:---------------|
| Jane Smith | CEO | jane@example.com | +1-555-0101 |
| John Doe | CTO | john@example.com | +1-555-0102 |
| Alice Johnson | VP Engineering | alice@example.com | N/A |
```
---
## Article Extraction Template
```markdown
## Article: [Title]
**Author:** Author Name
**Published:** YYYY-MM-DD
**Source:** [Site Name](URL)
### Summary
[2-3 sentence summary of the article content]
### Key Data Points
- [Factual data point 1]
- [Factual data point 2]
- [Statistical finding]
### Tags
`tag1` `tag2` `tag3`
```
Note: Summarize article content. Do not reproduce full article text
due to copyright.
---
## FAQ Extraction Template
```markdown
### FAQ: [Page Title]
**Source:** [Site Name](URL)
**Items:** 12 questions
| # | Question | Answer (excerpt) |
|--:|:---------|:-----------------|
| 1 | How do I reset my password? | Navigate to Settings > Security and click "Reset..." |
| 2 | What payment methods do you accept? | We accept Visa, Mastercard, PayPal, and bank transfer... |
```
Or as JSON (default for FAQ mode):
```json
{
"metadata": { "source": "URL", "itemCount": 12, "confidence": "HIGH" },
"data": [
{ "question": "How do I reset my password?", "answer": "Navigate to...", "category": "Account" },
{ "question": "What payment methods?", "answer": "We accept...", "category": "Billing" }
]
}
```
---
## Pricing Plans Template
```markdown
### Pricing: [Product Name]
**Source:** [Site Name](URL)
**Plans:** 3 tiers
| Plan | Monthly | Annual | Highlighted |
|:------------|----------:|----------:|:-----------:|
| Starter | $9/mo | $7/mo | |
| Pro | $29/mo | $24/mo | * |
| Enterprise | Custom | Custom | |
#### Feature Comparison
| Feature | Starter | Pro | Enterprise |
|:----------------------|:-------:|:---:|:----------:|
| Users | 1 | 10 | Unlimited |
| Storage | 5 GB | 50 GB | Unlimited |
| API Access | N/A | Yes | Yes |
| Priority Support | N/A | N/A | Yes |
```
---
## Job Listings Template
```markdown
| Title | Company | Location | Salary | Type | Posted |
|:-------------------|:------------|:---------------|:----------------|:----------|:-----------|
| Senior Engineer | TechCo | Remote, US | $150k - $200k | Full-time | 2026-02-20 |
| Product Manager | StartupXYZ | San Francisco | $130k - $160k | Full-time | 2026-02-18 |
| Data Analyst | DataCorp | London, UK | GBP 55k - 70k | Contract | 2026-02-22 |
```
---
## Events Template
```markdown
| Event | Date | Time | Location | Speakers |
|:-----------------------|:-----------|:--------|:------------------|:---------------|
| Opening Keynote | 2026-03-15 | 09:00 | Main Hall | J. Smith |
| Workshop: AI Basics | 2026-03-15 | 14:00 | Room 201 | A. Johnson |
| Networking Reception | 2026-03-15 | 18:00 | Rooftop Lounge | N/A |
```
---
## Differential (Diff) Output Template
When comparing current extraction with a previous run:
```markdown
## Extraction Results (Diff)
**Source:** [Page Title](URL)
**Date:** 2026-02-25 14:30 UTC
**Compared to:** 2026-02-20 10:00 UTC
**Changes:** +5 new, -2 removed, 3 modified
---
### New Items (+5)
| Name | Price | Rating |
|:---------------|--------:|:------:|
| Product Eta | $39.99 | 4.6 |
| Product Theta | $24.99 | 4.1 |
| ... | | |
### Removed Items (-2)
| Name | Price | Rating |
|:---------------|--------:|:------:|
| ~~Product Alpha~~ | ~~$29.99~~ | ~~4.5~~ |
| ~~Product Beta~~ | ~~$49.99~~ | ~~4.2~~ |
### Modified Items (3)
| Name | Field | Was | Now |
|:---------------|:--------|:-----------|:-----------|
| Product Gamma | Price | $119.00 | $109.00 |
| Product Gamma | Rating | 4.8 | 4.9 |
| Product Delta | Stock | Out of Stock | In Stock |
---
**Summary:**
- 5 new products added since last extraction
- 2 products removed (possibly discontinued)
- Product Gamma had a price drop of $10 and rating increase
- Product Delta is back in stock
```
---
## Error / Partial Result Template
When extraction partially fails:
```markdown
## Extraction Results (Partial)
**Source:** [Page Title](URL)
**Date:** 2026-02-25 14:30 UTC
**Items:** 23 of ~50 expected records
**Confidence:** LOW
**Strategy:** A (WebFetch) -> escalated to B (Browser)
---
[PARTIAL DATA]
---
**Issues:**
- 27 items could not be extracted (content behind JS rendering)
- Price field missing for 5 items (marked N/A)
- Auto-escalation from WebFetch to Browser recovered 15 additional items
**Suggestions:**
- Re-run with explicit Browser automation for complete results
- Check if site has an API endpoint for direct data access
- Try at a different time if rate-limited
```