chore: ai agent config

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
2026-04-21 20:19:05 -04:00
parent ffc4a2c5c5
commit 7cf21546e2
65 changed files with 10076 additions and 133 deletions
--- a/.claude/skills/web-scraper/references/data-transforms.md
+++ b/.claude/skills/web-scraper/references/data-transforms.md
@@ -0,0 +1,397 @@
+# Data Transforms Reference
+
+Patterns for cleaning, normalizing, deduplicating, and enriching
+extracted web data. Apply these transforms in Phase 5 (Transform)
+between extraction and validation.
+
+---
+
+## Automatic Transforms
+
+Always apply these to every extraction result.
+
+### Whitespace Cleanup
+
+```python
+# Remove leading/trailing whitespace, collapse internal whitespace
+value = ' '.join(value.split())
+
+# Remove zero-width characters
+import re
+value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
+```
+
+Patterns to handle:
+- `\n`, `\r`, `\t` inside cell values -> single space
+- Multiple consecutive spaces -> single space
+- Non-breaking spaces (`&nbsp;`, `\u00a0`) -> regular space
+- Zero-width characters -> remove
+
+### HTML Entity Decode
+
+| Entity      | Character | Entity     | Character |
+|:------------|:----------|:-----------|:----------|
+| `&amp;`     | `&`       | `&quot;`   | `"`       |
+| `&lt;`      | `<`       | `&apos;`   | `'`       |
+| `&gt;`      | `>`       | `&#39;`    | `'`       |
+| `&nbsp;`    | ` `       | `&#8217;`  | (curly ')  |
+| `&mdash;`   | `--`      | `&#8212;`  | `--`      |
+
+```python
+import html
+value = html.unescape(value)
+```
+
+### Unicode Normalization
+
+```python
+import unicodedata
+value = unicodedata.normalize('NFKC', value)
+```
+
+This handles:
+- Fancy quotes -> standard quotes
+- Ligatures -> separate characters (e.g. `ﬁ` -> `fi`)
+- Full-width characters -> standard (e.g. `Ａ` -> `A`)
+- Superscript/subscript numbers -> regular numbers
+
+### Empty Value Standardization
+
+| Input                   | Markdown Output | JSON Output |
+|:------------------------|:----------------|:------------|
+| `""` (empty string)     | `N/A`           | `null`      |
+| `"-"` or `"--"`         | `N/A`           | `null`      |
+| `"N/A"`, `"n/a"`, `"NA"`| `N/A`           | `null`      |
+| `"None"`, `"null"`      | `N/A`           | `null`      |
+| `"TBD"`, `"TBA"`        | `TBD`           | `"TBD"`     |
+
+---
+
+## Price Normalization
+
+Apply when extracting product, pricing, or financial data.
+
+### Extraction Pattern
+
+```python
+import re
+
+def normalize_price(raw):
+    if not raw:
+        return None
+    # Remove currency words
+    cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
+    # Extract numeric value (handles 1,234.56 and 1.234,56 formats)
+    match = re.search(r'[\d.,]+', cleaned)
+    if not match:
+        return None
+    num_str = match.group()
+    # Detect format: if last separator is comma with 2 digits after, it's decimal
+    if re.search(r',\d{2}$', num_str):
+        num_str = num_str.replace('.', '').replace(',', '.')
+    else:
+        num_str = num_str.replace(',', '')
+    return float(num_str)
+```
+
+### Currency Detection
+
+| Symbol/Code | Currency | Symbol/Code | Currency |
+|:------------|:---------|:------------|:---------|
+| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
+| `€`, `EUR` | Euro     | `£`, `GBP`  | British Pound |
+| `¥`, `JPY` | Yen      | `₹`, `INR`  | Indian Rupee  |
+| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
+
+### Output Format
+
+```json
+{
+  "price": 29.99,
+  "currency": "USD",
+  "rawPrice": "$29.99"
+}
+```
+
+For Markdown, show formatted: `$29.99` (right-aligned in table).
+
+---
+
+## Date Normalization
+
+Normalize all dates to ISO-8601 format.
+
+### Common Formats to Handle
+
+| Input Format            | Example              | Normalized         |
+|:------------------------|:---------------------|:-------------------|
+| Full text               | February 25, 2026    | 2026-02-25         |
+| Short text              | Feb 25, 2026         | 2026-02-25         |
+| US numeric              | 02/25/2026           | 2026-02-25         |
+| EU numeric              | 25/02/2026           | 2026-02-25         |
+| ISO already             | 2026-02-25           | 2026-02-25         |
+| Relative                | 3 days ago           | (compute from now) |
+| Relative                | Yesterday            | (compute from now) |
+| Timestamp               | 1740441600           | 2025-02-25         |
+| With time               | 2026-02-25T14:30:00Z | 2026-02-25 14:30   |
+
+### Ambiguous Dates
+
+When format is ambiguous (e.g. `03/04/2026`):
+- Default to US format (MM/DD/YYYY) unless site is clearly non-US
+- Check page `lang` attribute or URL TLD for locale hints
+- Note ambiguity in delivery notes
+
+### Relative Date Resolution
+
+```python
+from datetime import datetime, timedelta
+import re
+
+def resolve_relative_date(text):
+    text = text.lower().strip()
+    today = datetime.now()
+
+    if 'today' in text: return today.strftime('%Y-%m-%d')
+    if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
+
+    match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
+    if match:
+        n, unit = int(match.group(1)), match.group(2)
+        deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
+        return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
+
+    return text  # Return as-is if can't parse
+```
+
+---
+
+## URL Resolution
+
+Convert relative URLs to absolute.
+
+### Patterns
+
+| Input                    | Base URL                    | Resolved                              |
+|:-------------------------|:----------------------------|:--------------------------------------|
+| `/products/item-1`       | `https://example.com/shop`  | `https://example.com/products/item-1` |
+| `item-1`                 | `https://example.com/shop/` | `https://example.com/shop/item-1`     |
+| `//cdn.example.com/img`  | `https://example.com`       | `https://cdn.example.com/img`         |
+| `https://other.com/page` | (any)                       | `https://other.com/page` (absolute)   |
+
+### JavaScript Resolution
+
+```javascript
+function resolveUrl(relative, base) {
+  try { return new URL(relative, base || window.location.href).href; }
+  catch { return relative; }
+}
+```
+
+---
+
+## Phone Normalization
+
+For contact mode extraction.
+
+### Pattern
+
+```python
+import re
+
+def normalize_phone(raw):
+    if not raw:
+        return None
+    # Remove all non-digit chars except leading +
+    digits = re.sub(r'[^\d+]', '', raw)
+    if not digits or len(digits) < 7:
+        return None
+    # Add + prefix if looks international
+    if len(digits) >= 11 and not digits.startswith('+'):
+        digits = '+' + digits
+    return digits
+```
+
+### Format by Context
+
+| Context          | Format Example       |
+|:-----------------|:---------------------|
+| JSON output      | `"+5511999998888"`   |
+| Markdown table   | `+55 11 99999-8888`  |
+| CSV output       | `"+5511999998888"`   |
+
+---
+
+## Deduplication
+
+### Exact Deduplication
+
+```python
+def deduplicate(records, key_fields=None):
+    """Remove exact duplicate records.
+    If key_fields provided, deduplicate by those fields only.
+    """
+    seen = set()
+    unique = []
+    for record in records:
+        if key_fields:
+            key = tuple(record.get(f) for f in key_fields)
+        else:
+            key = tuple(sorted(record.items()))
+        if key not in seen:
+            seen.add(key)
+            unique.append(record)
+    return unique, len(records) - len(unique)  # returns (unique_list, removed_count)
+```
+
+### Near-Duplicate Detection
+
+When records share key fields but differ in details:
+1. Group by key fields (e.g. product name + source)
+2. For each group, keep the record with fewest null values
+3. If tie, keep the first occurrence
+4. Report in notes: "Merged N near-duplicate records"
+
+### Dedup Key Selection by Mode
+
+| Mode     | Key Fields                        |
+|:---------|:----------------------------------|
+| product  | name + source (or name + brand)   |
+| contact  | name + email (or name + org)      |
+| jobs     | title + company + location        |
+| events   | title + date + location           |
+| table    | all fields (exact match)          |
+| list     | first 2-3 identifying fields      |
+
+---
+
+## Text Cleaning
+
+### Remove Noise
+
+Common noise patterns to strip from extracted text:
+
+| Pattern                            | Action                    |
+|:-----------------------------------|:--------------------------|
+| `\[edit\]`, `\[citation needed\]`  | Remove (Wikipedia)        |
+| `Read more...`, `See more`         | Remove (truncation markers)|
+| `Sponsored`, `Ad`, `Promoted`      | Remove or flag            |
+| Cookie consent text                | Remove                    |
+| Navigation breadcrumbs             | Remove                    |
+| Footer boilerplate                 | Remove                    |
+
+### Sentence Case Normalization
+
+When extracting ALL-CAPS or inconsistent-case text:
+
+```python
+def normalize_case(text):
+    if text.isupper() and len(text) > 3:
+        return text.title()  # ALL CAPS -> Title Case
+    return text
+```
+
+Only apply when: field is clearly ALL-CAPS input (common in older sites),
+user requests it, or data looks better normalized.
+
+---
+
+## Data Type Coercion
+
+### Automatic Type Detection
+
+| Raw Value     | Detected Type | Coerced Value     |
+|:--------------|:--------------|:------------------|
+| `"123"`       | integer       | `123`             |
+| `"12.99"`     | float         | `12.99`           |
+| `"true"`      | boolean       | `true`            |
+| `"false"`     | boolean       | `false`           |
+| `"2026-02-25"`| date string   | `"2026-02-25"`    |
+| `"$29.99"`    | price         | `29.99` + currency|
+| `"4.5/5"`     | rating        | `4.5`             |
+| `"1,234"`     | integer       | `1234`            |
+
+### Rating Normalization
+
+```python
+import re
+
+def normalize_rating(raw):
+    if not raw:
+        return None
+    match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
+    if match:
+        score = float(match.group(1))
+        max_score = float(match.group(2)) if match.group(2) else 5.0
+        return round(score / max_score * 5, 1)  # Normalize to /5 scale
+    return None
+```
+
+---
+
+## Enrichment Patterns
+
+### Domain Extraction
+
+Add domain from full URLs:
+```python
+from urllib.parse import urlparse
+
+def extract_domain(url):
+    try:
+        parsed = urlparse(url)
+        domain = parsed.netloc.replace('www.', '')
+        return domain
+    except:
+        return None
+```
+
+### Word Count
+
+For article mode:
+```python
+def word_count(text):
+    return len(text.split()) if text else 0
+```
+
+### Relative Time
+
+Add human-readable time since date:
+```python
+def time_since(date_str):
+    from datetime import datetime
+    try:
+        dt = datetime.fromisoformat(date_str)
+        delta = datetime.now() - dt
+        if delta.days == 0: return "Today"
+        if delta.days == 1: return "Yesterday"
+        if delta.days < 7: return f"{delta.days} days ago"
+        if delta.days < 30: return f"{delta.days // 7} weeks ago"
+        if delta.days < 365: return f"{delta.days // 30} months ago"
+        return f"{delta.days // 365} years ago"
+    except:
+        return None
+```
+
+---
+
+## Transform Pipeline Order
+
+Apply transforms in this sequence:
+
+1. **HTML entity decode** - raw text cleanup
+2. **Unicode normalization** - character standardization
+3. **Whitespace cleanup** - spacing normalization
+4. **Empty value standardization** - null/N/A handling
+5. **URL resolution** - relative to absolute
+6. **Data type coercion** - strings to numbers/dates
+7. **Price normalization** - if applicable
+8. **Date normalization** - if applicable
+9. **Phone normalization** - if applicable
+10. **Text cleaning** - noise removal
+11. **Deduplication** - remove duplicates
+12. **Sorting** - user-requested order
+13. **Enrichment** - domain, word count, etc.
+
+Not all steps apply to every extraction. Apply only what's relevant
+to the data type and extraction mode.
--- a/.claude/skills/web-scraper/references/extraction-patterns.md
+++ b/.claude/skills/web-scraper/references/extraction-patterns.md
@@ -0,0 +1,475 @@
+# Extraction Patterns Reference
+
+CSS selectors, JavaScript snippets, and domain-specific tips for
+common web scraping scenarios.
+
+---
+
+## CSS Selector Patterns
+
+### Tables
+
+```css
+/* Standard HTML tables */
+table                               /* All tables */
+table.data-table                    /* Class-based */
+table[id*="result"]                 /* ID contains "result" */
+table thead th                      /* Header cells */
+table tbody tr                      /* Data rows */
+table tbody tr td                   /* Data cells */
+table tbody tr td:nth-child(2)      /* Specific column (2nd) */
+
+/* Grid layouts acting as tables */
+[role="table"]                      /* ARIA table role */
+[role="row"]                        /* ARIA row */
+[role="gridcell"]                   /* ARIA grid cell */
+.table-responsive table             /* Bootstrap responsive wrapper */
+```
+
+### Product Listings
+
+```css
+/* E-commerce product grids */
+.product-card, .product-item, .product-tile
+[data-product-id]                   /* Data attribute markers */
+.product-name, .product-title, h2.title
+.price, .product-price, [data-price]
+.price--sale, .price--original      /* Sale vs original price */
+.rating, .stars, [data-rating]
+.availability, .stock-status
+.product-image img, .product-thumb img
+
+/* Common e-commerce patterns */
+.search-results .result-item
+.catalog-grid .catalog-item
+.listing .listing-item
+```
+
+### Search Results
+
+```css
+/* Generic search result patterns */
+.search-result, .result-item, .search-entry
+.result-title a, .result-link
+.result-snippet, .result-description
+.result-url, .result-source
+.result-date, .result-timestamp
+.pagination a, .page-numbers a, [aria-label="Next"]
+```
+
+### Contact / Directory
+
+```css
+/* People and contact cards */
+.team-member, .staff-card, .person, .contact-card
+.member-name, .person-name, h3.name
+.member-title, .job-title, .role
+.member-email a[href^="mailto:"]
+.member-phone a[href^="tel:"]
+.member-bio, .person-description
+.vcard                              /* hCard microformat */
+```
+
+### FAQ / Accordion
+
+```css
+/* FAQ and accordion patterns */
+.faq-item, .accordion-item, [itemtype*="FAQPage"] [itemprop="mainEntity"]
+.faq-question, .accordion-header, [itemprop="name"], summary
+.faq-answer, .accordion-body, .accordion-content, [itemprop="acceptedAnswer"]
+details, details > summary          /* Native HTML accordion */
+[role="tabpanel"]                   /* Tab-based FAQ */
+```
+
+### Pricing Tables
+
+```css
+/* SaaS pricing page patterns */
+.pricing-table, .pricing-card, .plan-card, .pricing-tier
+.plan-name, .tier-name, .pricing-title
+.plan-price, .pricing-amount, .price-value
+.plan-period, .billing-cycle        /* monthly/annually */
+.plan-features li, .feature-list li
+.plan-cta, .pricing-button
+[class*="popular"], [class*="recommended"], [class*="featured"]  /* highlighted plan */
+```
+
+### Job Listings
+
+```css
+/* Job board patterns */
+.job-listing, .job-card, .job-posting, [itemtype*="JobPosting"]
+.job-title, [itemprop="title"]
+.company-name, [itemprop="hiringOrganization"]
+.job-location, [itemprop="jobLocation"]
+.job-salary, [itemprop="baseSalary"]
+.job-type, .employment-type
+.job-date, [itemprop="datePosted"]
+```
+
+### Events
+
+```css
+/* Event listing patterns */
+.event-card, .event-item, [itemtype*="Event"]
+.event-title, [itemprop="name"]
+.event-date, [itemprop="startDate"], time[datetime]
+.event-location, [itemprop="location"]
+.event-description, [itemprop="description"]
+.event-speaker, .speaker-name
+```
+
+### Navigation / Pagination
+
+```css
+/* Pagination controls */
+.pagination, .pager, nav[aria-label*="pagination"]
+.pagination .next, a[rel="next"]
+.pagination .prev, a[rel="prev"]
+.page-numbers, .page-link
+button[data-page], a[data-page]
+.load-more, button.show-more
+```
+
+### Articles / Blog Posts
+
+```css
+/* Article content */
+article, .post, .entry, .article-content
+article h1, .post-title, .entry-title
+.author, .byline, [rel="author"]
+time, .date, .published, .post-date
+.post-content, .entry-content, .article-body
+.tags a, .categories a, .post-tags a
+```
+
+---
+
+## JavaScript Extraction Snippets
+
+### Generic Table Extractor
+
+```javascript
+function extractTable(selector) {
+  const table = document.querySelector(selector || 'table');
+  if (!table) return { error: 'No table found' };
+
+  const headers = Array.from(
+    table.querySelectorAll('thead th, tr:first-child th, tr:first-child td')
+  ).map(el => el.textContent.trim());
+
+  const rows = Array.from(table.querySelectorAll('tbody tr, tr:not(:first-child)'))
+    .map(tr => {
+      const cells = Array.from(tr.querySelectorAll('td'))
+        .map(td => td.textContent.trim());
+      return cells.length > 0 ? cells : null;
+    })
+    .filter(Boolean);
+
+  return { headers, rows, rowCount: rows.length };
+}
+JSON.stringify(extractTable());
+```
+
+### Multi-Table Extractor
+
+```javascript
+function extractAllTables() {
+  const tables = document.querySelectorAll('table');
+  return Array.from(tables).map((table, idx) => {
+    const caption = table.querySelector('caption')?.textContent?.trim()
+      || table.getAttribute('aria-label') || `Table ${idx + 1}`;
+    const headers = Array.from(
+      table.querySelectorAll('thead th, tr:first-child th')
+    ).map(el => el.textContent.trim());
+    const rows = Array.from(table.querySelectorAll('tbody tr'))
+      .map(tr => Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim()))
+      .filter(r => r.length > 0);
+    return { caption, headers, rows, rowCount: rows.length };
+  });
+}
+JSON.stringify(extractAllTables());
+```
+
+### Generic List Extractor
+
+```javascript
+function extractList(containerSelector, itemSelector, fieldMap) {
+  // fieldMap: { fieldName: { selector: 'CSS', attr: 'href'|'src'|null } }
+  const container = document.querySelector(containerSelector);
+  if (!container) return { error: 'Container not found' };
+
+  const items = Array.from(container.querySelectorAll(itemSelector));
+  const data = items.map(item => {
+    const record = {};
+    for (const [key, config] of Object.entries(fieldMap)) {
+      const sel = typeof config === 'string' ? config : config.selector;
+      const attr = typeof config === 'object' ? config.attr : null;
+      const el = item.querySelector(sel);
+      if (!el) { record[key] = null; continue; }
+      record[key] = attr ? el.getAttribute(attr) : el.textContent.trim();
+    }
+    return record;
+  });
+  return { data, itemCount: data.length };
+}
+
+// Example usage:
+JSON.stringify(extractList('.results', '.result-item', {
+  title: '.result-title',
+  description: '.result-snippet',
+  url: { selector: '.result-title a', attr: 'href' },
+  date: '.result-date'
+}));
+```
+
+### JSON-LD Structured Data Extractor
+
+Many pages embed structured data that's easier to parse than DOM:
+
+```javascript
+function extractJsonLd(targetType) {
+  const scripts = document.querySelectorAll('script[type="application/ld+json"]');
+  const allData = Array.from(scripts).map(s => {
+    try { return JSON.parse(s.textContent); } catch { return null; }
+  }).filter(Boolean);
+
+  // Flatten @graph arrays
+  const flat = allData.flatMap(d => d['@graph'] || [d]);
+
+  if (targetType) {
+    return flat.filter(d =>
+      d['@type'] === targetType ||
+      (Array.isArray(d['@type']) && d['@type'].includes(targetType))
+    );
+  }
+  return flat;
+}
+// Extract products: extractJsonLd('Product')
+// Extract articles: extractJsonLd('Article')
+// Extract all: extractJsonLd()
+JSON.stringify(extractJsonLd());
+```
+
+Common JSON-LD types and their useful fields:
+- `Product`: name, offers.price, offers.priceCurrency, aggregateRating, brand.name
+- `Article`: headline, author.name, datePublished, description, wordCount
+- `Organization`: name, address, telephone, email, url
+- `BreadcrumbList`: itemListElement[].name (navigation path)
+- `FAQPage`: mainEntity[].name (question), mainEntity[].acceptedAnswer.text
+- `JobPosting`: title, hiringOrganization.name, jobLocation, baseSalary
+- `Event`: name, startDate, endDate, location, performer
+
+### OpenGraph / Meta Tag Extractor
+
+```javascript
+function extractMeta() {
+  const meta = {};
+  document.querySelectorAll('meta[property^="og:"], meta[name^="twitter:"]')
+    .forEach(el => {
+      const key = el.getAttribute('property') || el.getAttribute('name');
+      meta[key] = el.getAttribute('content');
+    });
+  meta.title = document.title;
+  meta.description = document.querySelector('meta[name="description"]')
+    ?.getAttribute('content');
+  meta.canonical = document.querySelector('link[rel="canonical"]')
+    ?.getAttribute('href');
+  return meta;
+}
+JSON.stringify(extractMeta());
+```
+
+### Pricing Plan Extractor
+
+```javascript
+function extractPricingPlans() {
+  const cards = document.querySelectorAll(
+    '.pricing-card, .plan-card, .pricing-tier, [class*="pricing"] [class*="card"]'
+  );
+  return Array.from(cards).map(card => ({
+    name: card.querySelector('[class*="name"], [class*="title"], h2, h3')
+      ?.textContent?.trim() || null,
+    price: card.querySelector('[class*="price"], [class*="amount"]')
+      ?.textContent?.trim() || null,
+    period: card.querySelector('[class*="period"], [class*="billing"]')
+      ?.textContent?.trim() || null,
+    features: Array.from(card.querySelectorAll('[class*="feature"] li, ul li'))
+      .map(li => li.textContent.trim()),
+    highlighted: card.matches('[class*="popular"], [class*="recommended"], [class*="featured"]'),
+    ctaText: card.querySelector('a, button')?.textContent?.trim() || null,
+    ctaUrl: card.querySelector('a')?.href || null,
+  }));
+}
+JSON.stringify(extractPricingPlans());
+```
+
+### FAQ Extractor
+
+```javascript
+function extractFAQ() {
+  // Try JSON-LD first
+  const ldFaq = extractJsonLd('FAQPage');
+  if (ldFaq.length > 0 && ldFaq[0].mainEntity) {
+    return ldFaq[0].mainEntity.map(q => ({
+      question: q.name,
+      answer: q.acceptedAnswer?.text || null
+    }));
+  }
+
+  // Try <details>/<summary> pattern
+  const details = document.querySelectorAll('details');
+  if (details.length > 0) {
+    return Array.from(details).map(d => ({
+      question: d.querySelector('summary')?.textContent?.trim() || null,
+      answer: Array.from(d.children).filter(c => c.tagName !== 'SUMMARY')
+        .map(c => c.textContent.trim()).join(' ')
+    }));
+  }
+
+  // Try accordion pattern
+  const items = document.querySelectorAll(
+    '.faq-item, .accordion-item, [class*="faq"] [class*="item"]'
+  );
+  return Array.from(items).map(item => ({
+    question: item.querySelector(
+      '[class*="question"], [class*="header"], [class*="title"], h3, h4'
+    )?.textContent?.trim() || null,
+    answer: item.querySelector(
+      '[class*="answer"], [class*="body"], [class*="content"], p'
+    )?.textContent?.trim() || null
+  }));
+}
+JSON.stringify(extractFAQ());
+```
+
+### Link Extractor
+
+```javascript
+function extractLinks(scope) {
+  const container = scope ? document.querySelector(scope) : document;
+  const links = Array.from(container.querySelectorAll('a[href]'))
+    .map(a => ({
+      text: a.textContent.trim(),
+      href: a.href,
+      title: a.title || null
+    }))
+    .filter(l => l.text && l.href && !l.href.startsWith('javascript:'));
+  return { links, count: links.length };
+}
+JSON.stringify(extractLinks());
+```
+
+### Image Extractor
+
+```javascript
+function extractImages(scope) {
+  const container = scope ? document.querySelector(scope) : document;
+  const images = Array.from(container.querySelectorAll('img'))
+    .map(img => ({
+      src: img.src,
+      alt: img.alt || null,
+      width: img.naturalWidth,
+      height: img.naturalHeight
+    }))
+    .filter(i => i.src && !i.src.includes('data:image/gif'));
+  return { images, count: images.length };
+}
+JSON.stringify(extractImages());
+```
+
+### Scroll-and-Collect Pattern
+
+For pages with lazy-loaded content, use this pattern with Browser automation:
+
+```javascript
+// Count items before scroll
+function countItems(selector) {
+  return document.querySelectorAll(selector).length;
+}
+```
+
+Then in the workflow:
+1. `javascript_tool`: `countItems('.item')` -> get initial count
+2. `computer(action="scroll", scroll_direction="down")`
+3. `computer(action="wait", duration=2)`
+4. `javascript_tool`: `countItems('.item')` -> get new count
+5. If new count > old count, repeat from step 2
+6. If count unchanged after 2 scrolls, all items loaded
+7. Extract all items at once
+
+---
+
+## Domain-Specific Tips
+
+### E-Commerce Sites
+- Check for JSON-LD `Product` schema first - often has cleaner data than DOM
+- Prices may have hidden original/sale price elements
+- Availability often encoded in data attributes (`data-available="true"`)
+- Product variants (size, color) may require click interactions
+- Review data often loaded lazily - scroll to reviews section first
+- Many sites have internal APIs at `/api/products` - check Network tab
+
+### Wikipedia
+- Tables use class `.wikitable` - always prefer this selector
+- Infoboxes use class `.infobox`
+- References in `<sup class="reference">` - exclude from text extraction
+- Table cells may contain complex nested HTML - use `.textContent.trim()`
+- Sortable tables have class `.sortable` with sort buttons in headers
+
+### News Sites
+- Article body often in `<article>` or `[itemprop="articleBody"]`
+- Paywall indicators: `.paywall`, `.subscribe-wall`, truncated with "Read more"
+- Publication date in `<time>` element or `[itemprop="datePublished"]`
+- Author in `[itemprop="author"]` or `.byline`
+- JSON-LD `NewsArticle` often has complete metadata
+
+### Government / Data Portals
+- Often use HTML tables without JavaScript
+- May have download links for CSV/Excel - check for `.csv`, `.xlsx` links
+- Data dictionaries may be on separate pages
+- Look for API endpoints in page source (`/api/`, `.json` links)
+- CORS may block direct API access; use Bash curl instead
+
+### Social Media (Public Profiles)
+- Content is almost always JS-rendered - use Browser automation
+- Rate limiting is aggressive - keep requests minimal
+- Infinite scroll is the norm - set clear item limits
+- Structure changes frequently - prefer text extraction over selectors
+
+### SaaS Pricing Pages
+- Pricing often changes dynamically (monthly vs annual toggle)
+- May need to click "Annual" toggle to see annual prices
+- Feature comparison tables often use checkmarks (Unicode or SVG)
+- Check for hidden elements toggled by billing period selector
+
+### Job Boards
+- Most use JSON-LD `JobPosting` schema
+- Salary ranges often hidden behind "View salary" buttons
+- Location may include remote/hybrid indicators
+- Filters are URL-parameter based - useful for pagination
+
+---
+
+## Anti-Patterns to Avoid
+
+| Anti-Pattern | Why It Fails | Better Approach |
+|:-------------|:-------------|:----------------|
+| Selectors with generated hashes (`.css-1a2b3c`) | Change on every deploy | Use semantic selectors, ARIA roles, data attributes |
+| Deeply nested paths (`div > div > div > span`) | Fragile on layout changes | Use closest meaningful class or attribute |
+| Index-based (`:nth-child(3)`) for dynamic lists | Order may change | Use content-based identification |
+| Selecting by inline styles | Presentation, not semantics | Use classes, IDs, or data attributes |
+| Hardcoded wait times for JS content | Too short or too long | Check for content presence in a loop |
+| Single selector for variant pages | Different pages differ | Test selector on multiple pages first |
+
+## Robust Selector Priority
+
+Prefer selectors in this order (most stable to least):
+
+1. `[data-testid="..."]`, `[data-id="..."]` - test/data attributes
+2. `#unique-id` - unique IDs
+3. `[role="..."]`, `[aria-label="..."]` - ARIA attributes
+4. `[itemprop="..."]`, `[itemtype="..."]` - microdata / schema.org
+5. `.semantic-class` - meaningful class names
+6. `tag.class` - element type + class
+7. Structural selectors - last resort
--- a/.claude/skills/web-scraper/references/output-templates.md
+++ b/.claude/skills/web-scraper/references/output-templates.md
@@ -0,0 +1,481 @@
+# Output Templates Reference
+
+Complete formatting templates for all supported output formats.
+Every output must be wrapped in a delivery envelope with metadata.
+
+---
+
+## Delivery Envelope (Required)
+
+Every extraction result MUST include this metadata wrapper,
+regardless of output format:
+
+```markdown
+## Extraction Results
+
+**Source:** [Page Title](https://example.com/page)
+**Date:** 2026-02-25 14:30 UTC
+**Items:** 47 records
+**Confidence:** HIGH
+**Format:** Markdown Table
+
+---
+
+[DATA GOES HERE]
+
+---
+
+**Notes:**
+- Any gaps, anomalies, or observations
+- Filters or sorts applied
+- Pages scraped (if paginated)
+```
+
+---
+
+## Markdown Table Format
+
+### Standard Table
+
+```markdown
+| Name           | Price    | Rating | Availability |
+|:---------------|---------:|:------:|:-------------|
+| Product Alpha  |   $29.99 |  4.5   | In Stock     |
+| Product Beta   |   $49.99 |  4.2   | In Stock     |
+| Product Gamma  |  $119.00 |  4.8   | Pre-order    |
+| Product Delta  |   $15.50 |  3.9   | Out of Stock |
+```
+
+### Alignment Rules
+
+| Data Type    | Alignment | Markdown Syntax |
+|:-------------|:----------|:----------------|
+| Text         | Left      | `:---`          |
+| Numbers      | Right     | `---:`          |
+| Centered     | Center    | `:---:`         |
+| Mixed/Status | Left      | `:---`          |
+
+### Table with Summary Row
+
+```markdown
+| Product        | Units Sold | Revenue    |
+|:---------------|----------:|-----------:|
+| Widget A       |     1,234 |  $12,340   |
+| Widget B       |       567 |   $8,505   |
+| Widget C       |     2,890 |  $57,800   |
+| **Total**      | **4,691** | **$78,645**|
+```
+
+### Wide Data (Split Tables)
+
+When data has more than 10 columns, split into logical groups:
+
+```markdown
+### Basic Information
+
+| Name    | Category | Brand   | SKU      |
+|:--------|:---------|:--------|:---------|
+| Item A  | Tools    | Acme    | ACM-001  |
+
+### Pricing and Availability
+
+| Name    | Price   | Sale Price | Stock | Ships In |
+|:--------|--------:|-----------:|:------|:---------|
+| Item A  | $49.99  |    $39.99  | 142   | 2 days   |
+```
+
+### Multi-URL Comparison Table
+
+```markdown
+| Source       | Product    | Price   | Rating |
+|:-------------|:-----------|--------:|:------:|
+| store-a.com  | Laptop X   | $999    |  4.3   |
+| store-b.com  | Laptop X   | $949    |  4.5   |
+| store-c.com  | Laptop X   | $1,029  |  4.1   |
+```
+
+### Truncation Rules
+
+For values exceeding 60 characters:
+```markdown
+| Title                                                       | Author  |
+|:------------------------------------------------------------|:--------|
+| Introduction to Advanced Machine Learning Techni...         | J. Smith|
+```
+
+---
+
+## JSON Format
+
+### Standard JSON Output
+
+```json
+{
+  "metadata": {
+    "source": "https://example.com/products",
+    "title": "Product Catalog - Example Store",
+    "extractedAt": "2026-02-25T14:30:00Z",
+    "itemCount": 3,
+    "confidence": "HIGH",
+    "fields": ["name", "price", "rating", "availability"],
+    "notes": []
+  },
+  "data": [
+    {
+      "name": "Product Alpha",
+      "price": 29.99,
+      "currency": "USD",
+      "rating": 4.5,
+      "availability": "In Stock"
+    },
+    {
+      "name": "Product Beta",
+      "price": 49.99,
+      "currency": "USD",
+      "rating": 4.2,
+      "availability": "In Stock"
+    },
+    {
+      "name": "Product Gamma",
+      "price": 119.00,
+      "currency": "USD",
+      "rating": 4.8,
+      "availability": "Pre-order"
+    }
+  ]
+}
+```
+
+### JSON Key Naming
+
+| Rule                   | Example                           |
+|:-----------------------|:----------------------------------|
+| camelCase              | `productName`, `unitPrice`        |
+| Numbers stay numeric   | `29.99` not `"29.99"`             |
+| Booleans stay boolean  | `true` not `"true"`               |
+| Missing = null         | `null` not `""` or `"N/A"`        |
+| Arrays for multiples   | `"tags": ["sale", "new"]`         |
+| ISO-8601 for dates     | `"2026-02-25T14:30:00Z"`         |
+
+### Nested JSON (Product with Details)
+
+```json
+{
+  "metadata": { "..." : "..." },
+  "data": [
+    {
+      "name": "Laptop Pro X",
+      "brand": "TechCo",
+      "pricing": {
+        "current": 999.99,
+        "original": 1299.99,
+        "currency": "USD",
+        "discount": "23%"
+      },
+      "rating": {
+        "score": 4.5,
+        "count": 1234
+      },
+      "specifications": {
+        "processor": "M3 Pro",
+        "ram": "16 GB",
+        "storage": "512 GB SSD",
+        "display": "14.2 inch Retina"
+      },
+      "availability": {
+        "inStock": true,
+        "shipsIn": "2-3 business days"
+      }
+    }
+  ]
+}
+```
+
+### Multi-URL JSON
+
+```json
+{
+  "metadata": {
+    "sources": [
+      "https://store-a.com/laptop-x",
+      "https://store-b.com/laptop-x"
+    ],
+    "extractedAt": "2026-02-25T14:30:00Z",
+    "itemCount": 2,
+    "confidence": "HIGH"
+  },
+  "data": [
+    {
+      "source": "store-a.com",
+      "name": "Laptop X",
+      "price": 999,
+      "currency": "USD",
+      "rating": 4.3
+    },
+    {
+      "source": "store-b.com",
+      "name": "Laptop X",
+      "price": 949,
+      "currency": "USD",
+      "rating": 4.5
+    }
+  ]
+}
+```
+
+---
+
+## CSV Format
+
+### Standard CSV
+
+```csv
+# Source: https://example.com/products
+# Extracted: 2026-02-25 14:30 UTC
+# Items: 3 | Confidence: HIGH
+name,price,currency,rating,availability
+"Product Alpha",29.99,USD,4.5,"In Stock"
+"Product Beta",49.99,USD,4.2,"In Stock"
+"Product Gamma",119.00,USD,4.8,"Pre-order"
+```
+
+### CSV Rules
+
+| Rule                                 | Example                        |
+|:-------------------------------------|:-------------------------------|
+| Always include header row            | `name,price,rating`            |
+| Quote fields with commas             | `"Smith, John"`                |
+| Quote fields with quotes (escape)    | `"He said ""hello"""`          |
+| Quote fields with newlines           | `"Line 1\nLine 2"`            |
+| UTF-8 encoding with BOM             | `\xEF\xBB\xBF` prefix         |
+| Comma delimiter (standard)           | `,`                            |
+| Metadata as comments (# prefix)      | `# Source: URL`                |
+| null/missing as empty field          | `field1,,field3`               |
+
+### Multi-URL CSV
+
+```csv
+# Sources: store-a.com, store-b.com
+# Extracted: 2026-02-25 14:30 UTC
+source,name,price,currency,rating
+"store-a.com","Laptop X",999,USD,4.3
+"store-b.com","Laptop X",949,USD,4.5
+```
+
+---
+
+## Summary Statistics Template
+
+When extracted data contains numeric fields, include a summary block:
+
+```markdown
+### Summary Statistics
+
+| Metric    | Price     | Rating |
+|:----------|----------:|-------:|
+| Count     |        47 |     47 |
+| Min       |    $12.99 |    2.1 |
+| Max       |   $299.99 |    5.0 |
+| Average   |    $67.42 |    4.1 |
+| Median    |    $54.99 |    4.3 |
+```
+
+Include only when:
+- Data has numeric columns
+- More than 5 items extracted
+- User would likely benefit from aggregate view (prices, ratings, quantities)
+
+---
+
+## Contact Data Template
+
+```markdown
+| Name           | Title              | Email                | Phone          |
+|:---------------|:-------------------|:---------------------|:---------------|
+| Jane Smith     | CEO                | jane@example.com     | +1-555-0101    |
+| John Doe       | CTO                | john@example.com     | +1-555-0102    |
+| Alice Johnson  | VP Engineering     | alice@example.com    | N/A            |
+```
+
+---
+
+## Article Extraction Template
+
+```markdown
+## Article: [Title]
+
+**Author:** Author Name
+**Published:** YYYY-MM-DD
+**Source:** [Site Name](URL)
+
+### Summary
+[2-3 sentence summary of the article content]
+
+### Key Data Points
+- [Factual data point 1]
+- [Factual data point 2]
+- [Statistical finding]
+
+### Tags
+`tag1` `tag2` `tag3`
+```
+
+Note: Summarize article content. Do not reproduce full article text
+due to copyright.
+
+---
+
+## FAQ Extraction Template
+
+```markdown
+### FAQ: [Page Title]
+
+**Source:** [Site Name](URL)
+**Items:** 12 questions
+
+| # | Question | Answer (excerpt) |
+|--:|:---------|:-----------------|
+| 1 | How do I reset my password? | Navigate to Settings > Security and click "Reset..." |
+| 2 | What payment methods do you accept? | We accept Visa, Mastercard, PayPal, and bank transfer... |
+```
+
+Or as JSON (default for FAQ mode):
+```json
+{
+  "metadata": { "source": "URL", "itemCount": 12, "confidence": "HIGH" },
+  "data": [
+    { "question": "How do I reset my password?", "answer": "Navigate to...", "category": "Account" },
+    { "question": "What payment methods?", "answer": "We accept...", "category": "Billing" }
+  ]
+}
+```
+
+---
+
+## Pricing Plans Template
+
+```markdown
+### Pricing: [Product Name]
+
+**Source:** [Site Name](URL)
+**Plans:** 3 tiers
+
+| Plan        | Monthly   | Annual    | Highlighted |
+|:------------|----------:|----------:|:-----------:|
+| Starter     |    $9/mo  |   $7/mo   |             |
+| Pro         |   $29/mo  |  $24/mo   |     *       |
+| Enterprise  |  Custom   |  Custom   |             |
+
+#### Feature Comparison
+
+| Feature               | Starter | Pro | Enterprise |
+|:----------------------|:-------:|:---:|:----------:|
+| Users                 | 1       | 10  | Unlimited  |
+| Storage               | 5 GB    | 50 GB | Unlimited |
+| API Access            | N/A     | Yes | Yes        |
+| Priority Support      | N/A     | N/A | Yes        |
+```
+
+---
+
+## Job Listings Template
+
+```markdown
+| Title              | Company     | Location       | Salary          | Type      | Posted     |
+|:-------------------|:------------|:---------------|:----------------|:----------|:-----------|
+| Senior Engineer    | TechCo      | Remote, US     | $150k - $200k   | Full-time | 2026-02-20 |
+| Product Manager    | StartupXYZ  | San Francisco  | $130k - $160k   | Full-time | 2026-02-18 |
+| Data Analyst       | DataCorp    | London, UK     | GBP 55k - 70k   | Contract  | 2026-02-22 |
+```
+
+---
+
+## Events Template
+
+```markdown
+| Event                  | Date       | Time    | Location          | Speakers       |
+|:-----------------------|:-----------|:--------|:------------------|:---------------|
+| Opening Keynote        | 2026-03-15 | 09:00   | Main Hall         | J. Smith       |
+| Workshop: AI Basics    | 2026-03-15 | 14:00   | Room 201          | A. Johnson     |
+| Networking Reception   | 2026-03-15 | 18:00   | Rooftop Lounge    | N/A            |
+```
+
+---
+
+## Differential (Diff) Output Template
+
+When comparing current extraction with a previous run:
+
+```markdown
+## Extraction Results (Diff)
+
+**Source:** [Page Title](URL)
+**Date:** 2026-02-25 14:30 UTC
+**Compared to:** 2026-02-20 10:00 UTC
+**Changes:** +5 new, -2 removed, 3 modified
+
+---
+
+### New Items (+5)
+
+| Name           | Price    | Rating |
+|:---------------|--------:|:------:|
+| Product Eta    |  $39.99 |  4.6   |
+| Product Theta  |  $24.99 |  4.1   |
+| ...            |         |        |
+
+### Removed Items (-2)
+
+| Name           | Price    | Rating |
+|:---------------|--------:|:------:|
+| ~~Product Alpha~~ | ~~$29.99~~ | ~~4.5~~ |
+| ~~Product Beta~~  | ~~$49.99~~ | ~~4.2~~ |
+
+### Modified Items (3)
+
+| Name           | Field   | Was        | Now        |
+|:---------------|:--------|:-----------|:-----------|
+| Product Gamma  | Price   | $119.00    | $109.00    |
+| Product Gamma  | Rating  | 4.8        | 4.9        |
+| Product Delta  | Stock   | Out of Stock | In Stock |
+
+---
+
+**Summary:**
+- 5 new products added since last extraction
+- 2 products removed (possibly discontinued)
+- Product Gamma had a price drop of $10 and rating increase
+- Product Delta is back in stock
+```
+
+---
+
+## Error / Partial Result Template
+
+When extraction partially fails:
+
+```markdown
+## Extraction Results (Partial)
+
+**Source:** [Page Title](URL)
+**Date:** 2026-02-25 14:30 UTC
+**Items:** 23 of ~50 expected records
+**Confidence:** LOW
+**Strategy:** A (WebFetch) -> escalated to B (Browser)
+
+---
+
+[PARTIAL DATA]
+
+---
+
+**Issues:**
+- 27 items could not be extracted (content behind JS rendering)
+- Price field missing for 5 items (marked N/A)
+- Auto-escalation from WebFetch to Browser recovered 15 additional items
+
+**Suggestions:**
+- Re-run with explicit Browser automation for complete results
+- Check if site has an API endpoint for direct data access
+- Try at a different time if rate-limited
+```