chore: ai agent config

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
2026-04-21 20:19:05 -04:00
parent ffc4a2c5c5
commit 7cf21546e2
65 changed files with 10076 additions and 133 deletions
--- a/.claude/skills/web-scraper/references/data-transforms.md
+++ b/.claude/skills/web-scraper/references/data-transforms.md
@@ -0,0 +1,397 @@
+# Data Transforms Reference
+
+Patterns for cleaning, normalizing, deduplicating, and enriching
+extracted web data. Apply these transforms in Phase 5 (Transform)
+between extraction and validation.
+
+---
+
+## Automatic Transforms
+
+Always apply these to every extraction result.
+
+### Whitespace Cleanup
+
+```python
+# Remove leading/trailing whitespace, collapse internal whitespace
+value = ' '.join(value.split())
+
+# Remove zero-width characters
+import re
+value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
+```
+
+Patterns to handle:
+- `\n`, `\r`, `\t` inside cell values -> single space
+- Multiple consecutive spaces -> single space
+- Non-breaking spaces (`&nbsp;`, `\u00a0`) -> regular space
+- Zero-width characters -> remove
+
+### HTML Entity Decode
+
+| Entity      | Character | Entity     | Character |
+|:------------|:----------|:-----------|:----------|
+| `&amp;`     | `&`       | `&quot;`   | `"`       |
+| `&lt;`      | `<`       | `&apos;`   | `'`       |
+| `&gt;`      | `>`       | `&#39;`    | `'`       |
+| `&nbsp;`    | ` `       | `&#8217;`  | (curly ')  |
+| `&mdash;`   | `--`      | `&#8212;`  | `--`      |
+
+```python
+import html
+value = html.unescape(value)
+```
+
+### Unicode Normalization
+
+```python
+import unicodedata
+value = unicodedata.normalize('NFKC', value)
+```
+
+This handles:
+- Fancy quotes -> standard quotes
+- Ligatures -> separate characters (e.g. `ﬁ` -> `fi`)
+- Full-width characters -> standard (e.g. `Ａ` -> `A`)
+- Superscript/subscript numbers -> regular numbers
+
+### Empty Value Standardization
+
+| Input                   | Markdown Output | JSON Output |
+|:------------------------|:----------------|:------------|
+| `""` (empty string)     | `N/A`           | `null`      |
+| `"-"` or `"--"`         | `N/A`           | `null`      |
+| `"N/A"`, `"n/a"`, `"NA"`| `N/A`           | `null`      |
+| `"None"`, `"null"`      | `N/A`           | `null`      |
+| `"TBD"`, `"TBA"`        | `TBD`           | `"TBD"`     |
+
+---
+
+## Price Normalization
+
+Apply when extracting product, pricing, or financial data.
+
+### Extraction Pattern
+
+```python
+import re
+
+def normalize_price(raw):
+    if not raw:
+        return None
+    # Remove currency words
+    cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
+    # Extract numeric value (handles 1,234.56 and 1.234,56 formats)
+    match = re.search(r'[\d.,]+', cleaned)
+    if not match:
+        return None
+    num_str = match.group()
+    # Detect format: if last separator is comma with 2 digits after, it's decimal
+    if re.search(r',\d{2}$', num_str):
+        num_str = num_str.replace('.', '').replace(',', '.')
+    else:
+        num_str = num_str.replace(',', '')
+    return float(num_str)
+```
+
+### Currency Detection
+
+| Symbol/Code | Currency | Symbol/Code | Currency |
+|:------------|:---------|:------------|:---------|
+| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
+| `€`, `EUR` | Euro     | `£`, `GBP`  | British Pound |
+| `¥`, `JPY` | Yen      | `₹`, `INR`  | Indian Rupee  |
+| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
+
+### Output Format
+
+```json
+{
+  "price": 29.99,
+  "currency": "USD",
+  "rawPrice": "$29.99"
+}
+```
+
+For Markdown, show formatted: `$29.99` (right-aligned in table).
+
+---
+
+## Date Normalization
+
+Normalize all dates to ISO-8601 format.
+
+### Common Formats to Handle
+
+| Input Format            | Example              | Normalized         |
+|:------------------------|:---------------------|:-------------------|
+| Full text               | February 25, 2026    | 2026-02-25         |
+| Short text              | Feb 25, 2026         | 2026-02-25         |
+| US numeric              | 02/25/2026           | 2026-02-25         |
+| EU numeric              | 25/02/2026           | 2026-02-25         |
+| ISO already             | 2026-02-25           | 2026-02-25         |
+| Relative                | 3 days ago           | (compute from now) |
+| Relative                | Yesterday            | (compute from now) |
+| Timestamp               | 1740441600           | 2025-02-25         |
+| With time               | 2026-02-25T14:30:00Z | 2026-02-25 14:30   |
+
+### Ambiguous Dates
+
+When format is ambiguous (e.g. `03/04/2026`):
+- Default to US format (MM/DD/YYYY) unless site is clearly non-US
+- Check page `lang` attribute or URL TLD for locale hints
+- Note ambiguity in delivery notes
+
+### Relative Date Resolution
+
+```python
+from datetime import datetime, timedelta
+import re
+
+def resolve_relative_date(text):
+    text = text.lower().strip()
+    today = datetime.now()
+
+    if 'today' in text: return today.strftime('%Y-%m-%d')
+    if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
+
+    match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
+    if match:
+        n, unit = int(match.group(1)), match.group(2)
+        deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
+        return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
+
+    return text  # Return as-is if can't parse
+```
+
+---
+
+## URL Resolution
+
+Convert relative URLs to absolute.
+
+### Patterns
+
+| Input                    | Base URL                    | Resolved                              |
+|:-------------------------|:----------------------------|:--------------------------------------|
+| `/products/item-1`       | `https://example.com/shop`  | `https://example.com/products/item-1` |
+| `item-1`                 | `https://example.com/shop/` | `https://example.com/shop/item-1`     |
+| `//cdn.example.com/img`  | `https://example.com`       | `https://cdn.example.com/img`         |
+| `https://other.com/page` | (any)                       | `https://other.com/page` (absolute)   |
+
+### JavaScript Resolution
+
+```javascript
+function resolveUrl(relative, base) {
+  try { return new URL(relative, base || window.location.href).href; }
+  catch { return relative; }
+}
+```
+
+---
+
+## Phone Normalization
+
+For contact mode extraction.
+
+### Pattern
+
+```python
+import re
+
+def normalize_phone(raw):
+    if not raw:
+        return None
+    # Remove all non-digit chars except leading +
+    digits = re.sub(r'[^\d+]', '', raw)
+    if not digits or len(digits) < 7:
+        return None
+    # Add + prefix if looks international
+    if len(digits) >= 11 and not digits.startswith('+'):
+        digits = '+' + digits
+    return digits
+```
+
+### Format by Context
+
+| Context          | Format Example       |
+|:-----------------|:---------------------|
+| JSON output      | `"+5511999998888"`   |
+| Markdown table   | `+55 11 99999-8888`  |
+| CSV output       | `"+5511999998888"`   |
+
+---
+
+## Deduplication
+
+### Exact Deduplication
+
+```python
+def deduplicate(records, key_fields=None):
+    """Remove exact duplicate records.
+    If key_fields provided, deduplicate by those fields only.
+    """
+    seen = set()
+    unique = []
+    for record in records:
+        if key_fields:
+            key = tuple(record.get(f) for f in key_fields)
+        else:
+            key = tuple(sorted(record.items()))
+        if key not in seen:
+            seen.add(key)
+            unique.append(record)
+    return unique, len(records) - len(unique)  # returns (unique_list, removed_count)
+```
+
+### Near-Duplicate Detection
+
+When records share key fields but differ in details:
+1. Group by key fields (e.g. product name + source)
+2. For each group, keep the record with fewest null values
+3. If tie, keep the first occurrence
+4. Report in notes: "Merged N near-duplicate records"
+
+### Dedup Key Selection by Mode
+
+| Mode     | Key Fields                        |
+|:---------|:----------------------------------|
+| product  | name + source (or name + brand)   |
+| contact  | name + email (or name + org)      |
+| jobs     | title + company + location        |
+| events   | title + date + location           |
+| table    | all fields (exact match)          |
+| list     | first 2-3 identifying fields      |
+
+---
+
+## Text Cleaning
+
+### Remove Noise
+
+Common noise patterns to strip from extracted text:
+
+| Pattern                            | Action                    |
+|:-----------------------------------|:--------------------------|
+| `\[edit\]`, `\[citation needed\]`  | Remove (Wikipedia)        |
+| `Read more...`, `See more`         | Remove (truncation markers)|
+| `Sponsored`, `Ad`, `Promoted`      | Remove or flag            |
+| Cookie consent text                | Remove                    |
+| Navigation breadcrumbs             | Remove                    |
+| Footer boilerplate                 | Remove                    |
+
+### Sentence Case Normalization
+
+When extracting ALL-CAPS or inconsistent-case text:
+
+```python
+def normalize_case(text):
+    if text.isupper() and len(text) > 3:
+        return text.title()  # ALL CAPS -> Title Case
+    return text
+```
+
+Only apply when: field is clearly ALL-CAPS input (common in older sites),
+user requests it, or data looks better normalized.
+
+---
+
+## Data Type Coercion
+
+### Automatic Type Detection
+
+| Raw Value     | Detected Type | Coerced Value     |
+|:--------------|:--------------|:------------------|
+| `"123"`       | integer       | `123`             |
+| `"12.99"`     | float         | `12.99`           |
+| `"true"`      | boolean       | `true`            |
+| `"false"`     | boolean       | `false`           |
+| `"2026-02-25"`| date string   | `"2026-02-25"`    |
+| `"$29.99"`    | price         | `29.99` + currency|
+| `"4.5/5"`     | rating        | `4.5`             |
+| `"1,234"`     | integer       | `1234`            |
+
+### Rating Normalization
+
+```python
+import re
+
+def normalize_rating(raw):
+    if not raw:
+        return None
+    match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
+    if match:
+        score = float(match.group(1))
+        max_score = float(match.group(2)) if match.group(2) else 5.0
+        return round(score / max_score * 5, 1)  # Normalize to /5 scale
+    return None
+```
+
+---
+
+## Enrichment Patterns
+
+### Domain Extraction
+
+Add domain from full URLs:
+```python
+from urllib.parse import urlparse
+
+def extract_domain(url):
+    try:
+        parsed = urlparse(url)
+        domain = parsed.netloc.replace('www.', '')
+        return domain
+    except:
+        return None
+```
+
+### Word Count
+
+For article mode:
+```python
+def word_count(text):
+    return len(text.split()) if text else 0
+```
+
+### Relative Time
+
+Add human-readable time since date:
+```python
+def time_since(date_str):
+    from datetime import datetime
+    try:
+        dt = datetime.fromisoformat(date_str)
+        delta = datetime.now() - dt
+        if delta.days == 0: return "Today"
+        if delta.days == 1: return "Yesterday"
+        if delta.days < 7: return f"{delta.days} days ago"
+        if delta.days < 30: return f"{delta.days // 7} weeks ago"
+        if delta.days < 365: return f"{delta.days // 30} months ago"
+        return f"{delta.days // 365} years ago"
+    except:
+        return None
+```
+
+---
+
+## Transform Pipeline Order
+
+Apply transforms in this sequence:
+
+1. **HTML entity decode** - raw text cleanup
+2. **Unicode normalization** - character standardization
+3. **Whitespace cleanup** - spacing normalization
+4. **Empty value standardization** - null/N/A handling
+5. **URL resolution** - relative to absolute
+6. **Data type coercion** - strings to numbers/dates
+7. **Price normalization** - if applicable
+8. **Date normalization** - if applicable
+9. **Phone normalization** - if applicable
+10. **Text cleaning** - noise removal
+11. **Deduplication** - remove duplicates
+12. **Sorting** - user-requested order
+13. **Enrichment** - domain, word count, etc.
+
+Not all steps apply to every extraction. Apply only what's relevant
+to the data type and extraction mode.