chore: ai agent config
Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
This commit is contained in:
397
.claude/skills/web-scraper/references/data-transforms.md
Normal file
397
.claude/skills/web-scraper/references/data-transforms.md
Normal file
@@ -0,0 +1,397 @@
|
||||
# Data Transforms Reference
|
||||
|
||||
Patterns for cleaning, normalizing, deduplicating, and enriching
|
||||
extracted web data. Apply these transforms in Phase 5 (Transform)
|
||||
between extraction and validation.
|
||||
|
||||
---
|
||||
|
||||
## Automatic Transforms
|
||||
|
||||
Always apply these to every extraction result.
|
||||
|
||||
### Whitespace Cleanup
|
||||
|
||||
```python
|
||||
# Remove leading/trailing whitespace, collapse internal whitespace
|
||||
value = ' '.join(value.split())
|
||||
|
||||
# Remove zero-width characters
|
||||
import re
|
||||
value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
|
||||
```
|
||||
|
||||
Patterns to handle:
|
||||
- `\n`, `\r`, `\t` inside cell values -> single space
|
||||
- Multiple consecutive spaces -> single space
|
||||
- Non-breaking spaces (` `, `\u00a0`) -> regular space
|
||||
- Zero-width characters -> remove
|
||||
|
||||
### HTML Entity Decode
|
||||
|
||||
| Entity | Character | Entity | Character |
|
||||
|:------------|:----------|:-----------|:----------|
|
||||
| `&` | `&` | `"` | `"` |
|
||||
| `<` | `<` | `'` | `'` |
|
||||
| `>` | `>` | `'` | `'` |
|
||||
| ` ` | ` ` | `’` | (curly ') |
|
||||
| `—` | `--` | `—` | `--` |
|
||||
|
||||
```python
|
||||
import html
|
||||
value = html.unescape(value)
|
||||
```
|
||||
|
||||
### Unicode Normalization
|
||||
|
||||
```python
|
||||
import unicodedata
|
||||
value = unicodedata.normalize('NFKC', value)
|
||||
```
|
||||
|
||||
This handles:
|
||||
- Fancy quotes -> standard quotes
|
||||
- Ligatures -> separate characters (e.g. `fi` -> `fi`)
|
||||
- Full-width characters -> standard (e.g. `A` -> `A`)
|
||||
- Superscript/subscript numbers -> regular numbers
|
||||
|
||||
### Empty Value Standardization
|
||||
|
||||
| Input | Markdown Output | JSON Output |
|
||||
|:------------------------|:----------------|:------------|
|
||||
| `""` (empty string) | `N/A` | `null` |
|
||||
| `"-"` or `"--"` | `N/A` | `null` |
|
||||
| `"N/A"`, `"n/a"`, `"NA"`| `N/A` | `null` |
|
||||
| `"None"`, `"null"` | `N/A` | `null` |
|
||||
| `"TBD"`, `"TBA"` | `TBD` | `"TBD"` |
|
||||
|
||||
---
|
||||
|
||||
## Price Normalization
|
||||
|
||||
Apply when extracting product, pricing, or financial data.
|
||||
|
||||
### Extraction Pattern
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def normalize_price(raw):
|
||||
if not raw:
|
||||
return None
|
||||
# Remove currency words
|
||||
cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
|
||||
# Extract numeric value (handles 1,234.56 and 1.234,56 formats)
|
||||
match = re.search(r'[\d.,]+', cleaned)
|
||||
if not match:
|
||||
return None
|
||||
num_str = match.group()
|
||||
# Detect format: if last separator is comma with 2 digits after, it's decimal
|
||||
if re.search(r',\d{2}$', num_str):
|
||||
num_str = num_str.replace('.', '').replace(',', '.')
|
||||
else:
|
||||
num_str = num_str.replace(',', '')
|
||||
return float(num_str)
|
||||
```
|
||||
|
||||
### Currency Detection
|
||||
|
||||
| Symbol/Code | Currency | Symbol/Code | Currency |
|
||||
|:------------|:---------|:------------|:---------|
|
||||
| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
|
||||
| `€`, `EUR` | Euro | `£`, `GBP` | British Pound |
|
||||
| `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee |
|
||||
| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
|
||||
|
||||
### Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"price": 29.99,
|
||||
"currency": "USD",
|
||||
"rawPrice": "$29.99"
|
||||
}
|
||||
```
|
||||
|
||||
For Markdown, show formatted: `$29.99` (right-aligned in table).
|
||||
|
||||
---
|
||||
|
||||
## Date Normalization
|
||||
|
||||
Normalize all dates to ISO-8601 format.
|
||||
|
||||
### Common Formats to Handle
|
||||
|
||||
| Input Format | Example | Normalized |
|
||||
|:------------------------|:---------------------|:-------------------|
|
||||
| Full text | February 25, 2026 | 2026-02-25 |
|
||||
| Short text | Feb 25, 2026 | 2026-02-25 |
|
||||
| US numeric | 02/25/2026 | 2026-02-25 |
|
||||
| EU numeric | 25/02/2026 | 2026-02-25 |
|
||||
| ISO already | 2026-02-25 | 2026-02-25 |
|
||||
| Relative | 3 days ago | (compute from now) |
|
||||
| Relative | Yesterday | (compute from now) |
|
||||
| Timestamp | 1740441600 | 2025-02-25 |
|
||||
| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |
|
||||
|
||||
### Ambiguous Dates
|
||||
|
||||
When format is ambiguous (e.g. `03/04/2026`):
|
||||
- Default to US format (MM/DD/YYYY) unless site is clearly non-US
|
||||
- Check page `lang` attribute or URL TLD for locale hints
|
||||
- Note ambiguity in delivery notes
|
||||
|
||||
### Relative Date Resolution
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
import re
|
||||
|
||||
def resolve_relative_date(text):
|
||||
text = text.lower().strip()
|
||||
today = datetime.now()
|
||||
|
||||
if 'today' in text: return today.strftime('%Y-%m-%d')
|
||||
if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
|
||||
|
||||
match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
|
||||
if match:
|
||||
n, unit = int(match.group(1)), match.group(2)
|
||||
deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
|
||||
return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
|
||||
|
||||
return text # Return as-is if can't parse
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## URL Resolution
|
||||
|
||||
Convert relative URLs to absolute.
|
||||
|
||||
### Patterns
|
||||
|
||||
| Input | Base URL | Resolved |
|
||||
|:-------------------------|:----------------------------|:--------------------------------------|
|
||||
| `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` |
|
||||
| `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` |
|
||||
| `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` |
|
||||
| `https://other.com/page` | (any) | `https://other.com/page` (absolute) |
|
||||
|
||||
### JavaScript Resolution
|
||||
|
||||
```javascript
|
||||
function resolveUrl(relative, base) {
|
||||
try { return new URL(relative, base || window.location.href).href; }
|
||||
catch { return relative; }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phone Normalization
|
||||
|
||||
For contact mode extraction.
|
||||
|
||||
### Pattern
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def normalize_phone(raw):
|
||||
if not raw:
|
||||
return None
|
||||
# Remove all non-digit chars except leading +
|
||||
digits = re.sub(r'[^\d+]', '', raw)
|
||||
if not digits or len(digits) < 7:
|
||||
return None
|
||||
# Add + prefix if looks international
|
||||
if len(digits) >= 11 and not digits.startswith('+'):
|
||||
digits = '+' + digits
|
||||
return digits
|
||||
```
|
||||
|
||||
### Format by Context
|
||||
|
||||
| Context | Format Example |
|
||||
|:-----------------|:---------------------|
|
||||
| JSON output | `"+5511999998888"` |
|
||||
| Markdown table | `+55 11 99999-8888` |
|
||||
| CSV output | `"+5511999998888"` |
|
||||
|
||||
---
|
||||
|
||||
## Deduplication
|
||||
|
||||
### Exact Deduplication
|
||||
|
||||
```python
|
||||
def deduplicate(records, key_fields=None):
|
||||
"""Remove exact duplicate records.
|
||||
If key_fields provided, deduplicate by those fields only.
|
||||
"""
|
||||
seen = set()
|
||||
unique = []
|
||||
for record in records:
|
||||
if key_fields:
|
||||
key = tuple(record.get(f) for f in key_fields)
|
||||
else:
|
||||
key = tuple(sorted(record.items()))
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
unique.append(record)
|
||||
return unique, len(records) - len(unique) # returns (unique_list, removed_count)
|
||||
```
|
||||
|
||||
### Near-Duplicate Detection
|
||||
|
||||
When records share key fields but differ in details:
|
||||
1. Group by key fields (e.g. product name + source)
|
||||
2. For each group, keep the record with fewest null values
|
||||
3. If tie, keep the first occurrence
|
||||
4. Report in notes: "Merged N near-duplicate records"
|
||||
|
||||
### Dedup Key Selection by Mode
|
||||
|
||||
| Mode | Key Fields |
|
||||
|:---------|:----------------------------------|
|
||||
| product | name + source (or name + brand) |
|
||||
| contact | name + email (or name + org) |
|
||||
| jobs | title + company + location |
|
||||
| events | title + date + location |
|
||||
| table | all fields (exact match) |
|
||||
| list | first 2-3 identifying fields |
|
||||
|
||||
---
|
||||
|
||||
## Text Cleaning
|
||||
|
||||
### Remove Noise
|
||||
|
||||
Common noise patterns to strip from extracted text:
|
||||
|
||||
| Pattern | Action |
|
||||
|:-----------------------------------|:--------------------------|
|
||||
| `\[edit\]`, `\[citation needed\]` | Remove (Wikipedia) |
|
||||
| `Read more...`, `See more` | Remove (truncation markers)|
|
||||
| `Sponsored`, `Ad`, `Promoted` | Remove or flag |
|
||||
| Cookie consent text | Remove |
|
||||
| Navigation breadcrumbs | Remove |
|
||||
| Footer boilerplate | Remove |
|
||||
|
||||
### Sentence Case Normalization
|
||||
|
||||
When extracting ALL-CAPS or inconsistent-case text:
|
||||
|
||||
```python
|
||||
def normalize_case(text):
|
||||
if text.isupper() and len(text) > 3:
|
||||
return text.title() # ALL CAPS -> Title Case
|
||||
return text
|
||||
```
|
||||
|
||||
Only apply when: field is clearly ALL-CAPS input (common in older sites),
|
||||
user requests it, or data looks better normalized.
|
||||
|
||||
---
|
||||
|
||||
## Data Type Coercion
|
||||
|
||||
### Automatic Type Detection
|
||||
|
||||
| Raw Value | Detected Type | Coerced Value |
|
||||
|:--------------|:--------------|:------------------|
|
||||
| `"123"` | integer | `123` |
|
||||
| `"12.99"` | float | `12.99` |
|
||||
| `"true"` | boolean | `true` |
|
||||
| `"false"` | boolean | `false` |
|
||||
| `"2026-02-25"`| date string | `"2026-02-25"` |
|
||||
| `"$29.99"` | price | `29.99` + currency|
|
||||
| `"4.5/5"` | rating | `4.5` |
|
||||
| `"1,234"` | integer | `1234` |
|
||||
|
||||
### Rating Normalization
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def normalize_rating(raw):
|
||||
if not raw:
|
||||
return None
|
||||
match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
|
||||
if match:
|
||||
score = float(match.group(1))
|
||||
max_score = float(match.group(2)) if match.group(2) else 5.0
|
||||
return round(score / max_score * 5, 1) # Normalize to /5 scale
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Enrichment Patterns
|
||||
|
||||
### Domain Extraction
|
||||
|
||||
Add domain from full URLs:
|
||||
```python
|
||||
from urllib.parse import urlparse
|
||||
|
||||
def extract_domain(url):
|
||||
try:
|
||||
parsed = urlparse(url)
|
||||
domain = parsed.netloc.replace('www.', '')
|
||||
return domain
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
### Word Count
|
||||
|
||||
For article mode:
|
||||
```python
|
||||
def word_count(text):
|
||||
return len(text.split()) if text else 0
|
||||
```
|
||||
|
||||
### Relative Time
|
||||
|
||||
Add human-readable time since date:
|
||||
```python
|
||||
def time_since(date_str):
|
||||
from datetime import datetime
|
||||
try:
|
||||
dt = datetime.fromisoformat(date_str)
|
||||
delta = datetime.now() - dt
|
||||
if delta.days == 0: return "Today"
|
||||
if delta.days == 1: return "Yesterday"
|
||||
if delta.days < 7: return f"{delta.days} days ago"
|
||||
if delta.days < 30: return f"{delta.days // 7} weeks ago"
|
||||
if delta.days < 365: return f"{delta.days // 30} months ago"
|
||||
return f"{delta.days // 365} years ago"
|
||||
except:
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Transform Pipeline Order
|
||||
|
||||
Apply transforms in this sequence:
|
||||
|
||||
1. **HTML entity decode** - raw text cleanup
|
||||
2. **Unicode normalization** - character standardization
|
||||
3. **Whitespace cleanup** - spacing normalization
|
||||
4. **Empty value standardization** - null/N/A handling
|
||||
5. **URL resolution** - relative to absolute
|
||||
6. **Data type coercion** - strings to numbers/dates
|
||||
7. **Price normalization** - if applicable
|
||||
8. **Date normalization** - if applicable
|
||||
9. **Phone normalization** - if applicable
|
||||
10. **Text cleaning** - noise removal
|
||||
11. **Deduplication** - remove duplicates
|
||||
12. **Sorting** - user-requested order
|
||||
13. **Enrichment** - domain, word count, etc.
|
||||
|
||||
Not all steps apply to every extraction. Apply only what's relevant
|
||||
to the data type and extraction mode.
|
||||
Reference in New Issue
Block a user