chore: ai agent config

Signed-off-by: Dmytro Stanchiev <git@dmytros.dev>
2026-04-21 20:19:05 -04:00
parent ffc4a2c5c5
commit 7cf21546e2
65 changed files with 10076 additions and 133 deletions
--- a/.claude/skills/web-scraper/SKILL.md
+++ b/.claude/skills/web-scraper/SKILL.md
@@ -0,0 +1,757 @@
+---
+name: web-scraper
+description: Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
+risk: safe
+source: community
+date_added: '2026-03-06'
+author: renat
+tags:
+- scraping
+- data-extraction
+- automation
+- csv
+tools:
+- claude-code
+- antigravity
+- cursor
+- gemini-cli
+- codex-cli
+---
+
+# Web Scraper
+
+## Overview
+
+Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
+
+## When to Use This Skill
+
+- When the user mentions "scraper" or related topics
+- When the user mentions "scraping" or related topics
+- When the user mentions "extrair dados web" or related topics
+- When the user mentions "web scraping" or related topics
+- When the user mentions "raspar dados" or related topics
+- When the user mentions "coletar dados site" or related topics
+
+## Do Not Use This Skill When
+
+- The task is unrelated to web scraper
+- A simpler, more specific tool can handle the request
+- The user needs general-purpose assistance without domain expertise
+
+## How It Works
+
+Execute phases in strict order. Each phase feeds the next.
+
+```
+1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT
+```
+
+Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
+
+**Fast path**: If user provides URL + clear data target + the request is simple
+(single page, one data type), compress Phases 1-3 into a single action:
+fetch, classify, and extract in one WebFetch call. Still validate and format.
+
+---
+
+## Capabilities
+
+- **Multi-strategy**: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
+- **Extraction modes**: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
+- **Output formats**: Markdown tables (default), JSON, CSV
+- **Pagination**: auto-detect and follow (page numbers, infinite scroll, load-more)
+- **Multi-URL**: extract same structure across sources with comparison and diff
+- **Validation**: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
+- **Auto-escalation**: WebFetch fails silently -> automatic Browser fallback
+- **Data transforms**: cleaning, normalization, deduplication, enrichment
+- **Differential mode**: detect changes between scraping runs
+
+## Web Scraper
+
+Multi-strategy web data extraction with intelligent approach selection,
+automatic fallback escalation, data transformation, and structured output.
+
+## Phase 1: Clarify
+
+Establish extraction parameters before touching any URL.
+
+## Required Parameters
+
+| Parameter     | Resolve                              | Default        |
+|:--------------|:-------------------------------------|:---------------|
+| Target URL(s) | Which page(s) to scrape?             | *(required)*   |
+| Data Target   | What specific data to extract?       | *(required)*   |
+| Output Format | Markdown table, JSON, CSV, or text?  | Markdown table |
+| Scope         | Single page, paginated, or multi-URL?| Single page    |
+
+## Optional Parameters
+
+| Parameter     | Resolve                                | Default      |
+|:--------------|:---------------------------------------|:-------------|
+| Pagination    | Follow pagination? Max pages?          | No, 1 page   |
+| Max Items     | Maximum number of items to collect?    | Unlimited    |
+| Filters       | Data to exclude or include?            | None         |
+| Sort Order    | How to sort results?                   | Source order  |
+| Save Path     | Save to file? Which path?              | Display only |
+| Language      | Respond in which language?             | User's lang  |
+| Diff Mode     | Compare with previous run?             | No           |
+
+## Clarification Rules
+
+- If user provides a URL and clear data target, proceed directly to Phase 2.
+  Do NOT ask unnecessary questions.
+- If request is ambiguous (e.g. "scrape this site"), ask ONLY:
+  "What specific data do you want me to extract from this page?"
+- Default to Markdown table output. Mention alternatives only if relevant.
+- Accept requests in any language. Always respond in the user's language.
+- If user says "everything" or "all data", perform recon first, then present
+  what's available and let user choose.
+
+## Discovery Mode
+
+When user has a topic but no specific URL:
+1. Use WebSearch to find the most relevant pages
+2. Present top 3-5 URLs with descriptions
+3. Let user choose which to scrape, or scrape all
+4. Proceed to Phase 2 with selected URL(s)
+
+Example: "find and extract pricing data for CRM tools"
+-> WebSearch("CRM tools pricing comparison 2026")
+-> Present top results -> User selects -> Extract
+
+---
+
+## Phase 2: Reconnaissance
+
+Analyze the target page before extraction.
+
+## Step 2.1: Initial Fetch
+
+Use WebFetch to retrieve and analyze the page structure:
+
+```
+WebFetch(
+  url = TARGET_URL,
+  prompt = "Analyze this page structure and report:
+    1. Page type: article, product listing, search results, data table,
+       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
+    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
+       accordion/collapsible sections, tabs
+    3. Approximate number of distinct data items visible
+    4. JavaScript rendering indicators: empty containers, loading spinners,
+       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
+    5. Pagination: next/prev links, page numbers, load-more buttons,
+       infinite scroll indicators, total results count
+    6. Data density: how much structured, extractable data exists
+    7. List the main data fields/columns available for extraction
+    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
+    9. Available download links: CSV, Excel, PDF, API endpoints"
+)
+```
+
+## Step 2.2: Evaluate Fetch Quality
+
+| Signal                                      | Interpretation                    | Action                    |
+|:--------------------------------------------|:----------------------------------|:--------------------------|
+| Rich content with data clearly visible      | Static page                       | Strategy A (WebFetch)     |
+| Empty containers, "loading...", minimal text | JS-rendered                       | Strategy B (Browser)      |
+| Login wall, CAPTCHA, 403/401 response       | Blocked                           | Report to user            |
+| Content present but poorly structured       | Needs precision                   | Strategy B (Browser)      |
+| JSON or XML response body                   | API endpoint                      | Strategy C (Bash/curl)    |
+| Download links for CSV/Excel available      | Direct data file                  | Strategy C (download)     |
+
+## Step 2.3: Content Classification
+
+Classify into an extraction mode:
+
+| Mode       | Indicators                                 | Examples                          |
+|:-----------|:-------------------------------------------|:----------------------------------|
+| `table`    | HTML `<table>`, grid layout with headers   | Price comparison, statistics, specs|
+| `list`     | Repeated similar elements, card grids      | Search results, product listings  |
+| `article`  | Long-form text with headings/paragraphs    | Blog post, news article, docs     |
+| `product`  | Product name, price, specs, images, rating | E-commerce product page           |
+| `contact`  | Names, emails, phones, addresses, roles    | Team page, staff directory        |
+| `faq`      | Question-answer pairs, accordions          | FAQ page, help center             |
+| `pricing`  | Plan names, prices, features, tiers        | SaaS pricing page                 |
+| `events`   | Dates, locations, titles, descriptions     | Event listings, conferences       |
+| `jobs`     | Titles, companies, locations, salaries     | Job boards, career pages          |
+| `custom`   | User specified CSS selectors or fields     | Anything not matching above       |
+
+Record: **page type**, **extraction mode**, **JS rendering needed (yes/no)**,
+**available fields**, **structured data present (JSON-LD etc.)**.
+
+If user asked for "everything", present the available fields and let them choose.
+
+---
+
+## Phase 3: Strategy Selection
+
+Choose the extraction approach based on recon results.
+
+## Decision Tree
+
+```
+Structured data (JSON-LD, microdata) has what we need?
+ |
+ +-- YES --> STRATEGY E: Extract structured data directly
+ |
+ +-- NO: Content fully visible in WebFetch?
+      |
+      +-- YES: Need precise element targeting?
+      |    |
+      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
+      |    +-- YES --> STRATEGY B: Browser automation
+      |
+      +-- NO: JavaScript rendering detected?
+           |
+           +-- YES --> STRATEGY B: Browser automation
+           +-- NO:  API/JSON/XML endpoint or download link?
+                |
+                +-- YES --> STRATEGY C: Bash (curl + jq)
+                +-- NO  --> Report access issue to user
+```
+
+## Strategy A: Webfetch With Ai Extraction
+
+**Best for**: Static pages, articles, simple tables, well-structured HTML.
+
+Use WebFetch with a targeted extraction prompt tailored to the mode:
+
+```
+WebFetch(
+  url = URL,
+  prompt = "Extract [DATA_TARGET] from this page.
+    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
+    Rules:
+    - If a value is missing or unclear, use 'N/A'
+    - Do not include navigation, ads, footers, or unrelated content
+    - Preserve original values exactly (numbers, currencies, dates)
+    - Include ALL matching items, not just the first few
+    - For each item, also extract the URL/link if available"
+)
+```
+
+**Auto-escalation**: If WebFetch returns suspiciously few items (less than
+50% of expected from recon), or mostly empty fields, automatically escalate
+to Strategy B without asking user. Log the escalation in notes.
+
+## Strategy B: Browser Automation
+
+**Best for**: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
+
+Sequence:
+1. Get tab context: `tabs_context_mcp(createIfEmpty=true)` -> get tabId
+2. Navigate to URL: `navigate(url=TARGET_URL, tabId=TAB)`
+3. Wait for content to load: `computer(action="wait", duration=3, tabId=TAB)`
+4. Check for cookie/consent banners: `find(query="cookie consent or accept button", tabId=TAB)`
+   - If found, dismiss it (prefer privacy-preserving option)
+5. Read page structure: `read_page(tabId=TAB)` or `get_page_text(tabId=TAB)`
+6. Locate target elements: `find(query="[DESCRIPTION]", tabId=TAB)`
+7. Extract with JavaScript for precise data via `javascript_tool`
+
+```javascript
+// Table extraction
+const rows = document.querySelectorAll('TABLE_SELECTOR tr');
+const data = Array.from(rows).map(row => {
+  const cells = row.querySelectorAll('td, th');
+  return Array.from(cells).map(c => c.textContent.trim());
+});
+JSON.stringify(data);
+```
+
+```javascript
+// List/card extraction
+const items = document.querySelectorAll('ITEM_SELECTOR');
+const data = Array.from(items).map(item => ({
+  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
+  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
+  link: item.querySelector('a')?.href || null,
+}));
+JSON.stringify(data);
+```
+
+8. For lazy-loaded content, scroll and re-extract:
+   `computer(action="scroll", scroll_direction="down", tabId=TAB)`
+   then `computer(action="wait", duration=2, tabId=TAB)`
+
+## Strategy C: Bash (Curl + Jq)
+
+**Best for**: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.
+
+```bash
+
+## Json Api
+
+curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
+
+## Csv Download
+
+curl -s "CSV_URL" -o /tmp/scraped_data.csv
+
+## Xml Parsing
+
+curl -s "XML_URL" | python3 -c "
+import xml.etree.ElementTree as ET, json, sys
+tree = ET.parse(sys.stdin)
+
+## ... Parse And Output Json
+
+"
+```
+
+## Strategy D: Hybrid
+
+When a single strategy is insufficient, combine:
+1. WebSearch to discover relevant URLs
+2. WebFetch for initial content assessment
+3. Browser automation for JS-heavy sections
+4. Bash for post-processing (jq, python for data cleaning)
+
+## Strategy E: Structured Data Extraction
+
+When JSON-LD, microdata, or OpenGraph is present:
+1. Use Browser `javascript_tool` to extract structured data:
+```javascript
+const scripts = document.querySelectorAll('script[type="application/ld+json"]');
+const data = Array.from(scripts).map(s => {
+  try { return JSON.parse(s.textContent); } catch { return null; }
+}).filter(Boolean);
+JSON.stringify(data);
+```
+2. This often provides cleaner, more reliable data than DOM scraping
+3. Fall back to DOM extraction only for fields not in structured data
+
+## Pagination Handling
+
+When pagination is detected and user wants multiple pages:
+
+**Page-number pagination (any strategy):**
+1. Extract data from current page
+2. Identify URL pattern (e.g. `?page=N`, `/page/N`, `&offset=N`)
+3. Iterate through pages up to user's max (default: 5 pages)
+4. Show progress: "Extracting page 2/5..."
+5. Concatenate all results, deduplicate if needed
+
+**Infinite scroll (Browser only):**
+1. Extract currently visible data
+2. Record item count
+3. Scroll down: `computer(action="scroll", scroll_direction="down", tabId=TAB)`
+4. Wait: `computer(action="wait", duration=2, tabId=TAB)`
+5. Extract newly loaded data
+6. Compare count - if no new items after 2 scrolls, stop
+7. Repeat until no new content or max iterations (default: 5)
+
+**"Load More" button (Browser only):**
+1. Extract currently visible data
+2. Find button: `find(query="load more button", tabId=TAB)`
+3. Click it: `computer(action="left_click", ref=REF, tabId=TAB)`
+4. Wait and extract new content
+5. Repeat until button disappears or max iterations reached
+
+---
+
+## Phase 4: Extract
+
+Execute the selected strategy using mode-specific patterns.
+See [references/extraction-patterns.md](references/extraction-patterns.md)
+for CSS selectors and JavaScript snippets.
+
+## Table Mode
+
+WebFetch prompt:
+```
+"Extract ALL rows from the table(s) on this page.
+Return as a markdown table with exact column headers.
+Include every row - do not truncate or summarize.
+Preserve numeric precision, currencies, and units."
+```
+
+## List Mode
+
+WebFetch prompt:
+```
+"Extract each [ITEM_TYPE] from this page.
+For each item, extract: [FIELD_LIST].
+Return as a JSON array of objects with these keys: [KEY_LIST].
+Include ALL items, not just the first few. Include link/URL for each item if available."
+```
+
+## Article Mode
+
+WebFetch prompt:
+```
+"Extract article metadata:
+- title, author, date, tags/categories, word count estimate
+- Key factual data points, statistics, and named entities
+Return as structured markdown. Summarize the content; do not reproduce full text."
+```
+
+## Product Mode
+
+WebFetch prompt:
+```
+"Extract product data with these exact fields:
+- name, brand, price, currency, originalPrice (if discounted),
+  availability, description (first 200 chars), rating, reviewCount,
+  specifications (as key-value pairs), productUrl, imageUrl
+Return as JSON. Use null for missing fields."
+```
+
+Also check for JSON-LD `Product` schema (Strategy E) first.
+
+## Contact Mode
+
+WebFetch prompt:
+```
+"Extract contact information for each person/entity:
+- name, title, role, email, phone, address, organization, website, linkedinUrl
+Return as a markdown table. Only extract real contacts visible on the page."
+```
+
+## Faq Mode
+
+WebFetch prompt:
+```
+"Extract all question-answer pairs from this page.
+For each FAQ item extract:
+- question: the exact question text
+- answer: the answer text (first 300 chars if long)
+- category: the section/category if grouped
+Return as a JSON array of objects."
+```
+
+## Pricing Mode
+
+WebFetch prompt:
+```
+"Extract all pricing plans/tiers from this page.
+For each plan extract:
+- planName, monthlyPrice, annualPrice, currency
+- features (array of included features)
+- limitations (array of limits or excluded features)
+- ctaText (call-to-action button text)
+- highlighted (true if marked as recommended/popular)
+Return as JSON. Use null for missing fields."
+```
+
+## Events Mode
+
+WebFetch prompt:
+```
+"Extract all events/sessions from this page.
+For each event extract:
+- title, date, time, endTime, location, description (first 200 chars)
+- speakers (array of names), category, registrationUrl
+Return as JSON. Use null for missing fields."
+```
+
+## Jobs Mode
+
+WebFetch prompt:
+```
+"Extract all job listings from this page.
+For each job extract:
+- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
+- postedDate, description (first 200 chars), applyUrl, tags
+Return as JSON. Use null for missing fields."
+```
+
+## Custom Mode
+
+When user provides specific selectors or field descriptions:
+- Use Browser automation with `javascript_tool` and user's CSS selectors
+- Or use WebFetch with a prompt built from user's field descriptions
+- Always confirm extracted schema with user before proceeding to multi-URL
+
+## Multi-Url Extraction
+
+When extracting from multiple URLs:
+1. Extract from the **first URL** to establish the data schema
+2. Show user the first results and confirm the schema is correct
+3. Extract from remaining URLs using the same schema
+4. Add a `source` column/field to every record with the origin URL
+5. Combine all results into a single output
+6. Show progress: "Extracting 3/7 URLs..."
+
+---
+
+## Phase 5: Transform
+
+Clean, normalize, and enrich extracted data before validation.
+See [references/data-transforms.md](references/data-transforms.md) for patterns.
+
+## Automatic Transforms (Always Apply)
+
+| Transform              | Action                                               |
+|:-----------------------|:-----------------------------------------------------|
+| Whitespace cleanup     | Trim, collapse multiple spaces, remove `\n` in cells |
+| HTML entity decode     | `&amp;` -> `&`, `&lt;` -> `<`, `&#39;` -> `'`       |
+| Unicode normalization  | NFKC normalization for consistent characters          |
+| Empty string to null   | `""` -> `null` (for JSON), `""` -> `N/A` (for tables)|
+
+## Conditional Transforms (Apply When Relevant)
+
+| Transform             | When                         | Action                                  |
+|:----------------------|:-----------------------------|:----------------------------------------|
+| Price normalization   | Product/pricing modes        | Extract numeric value + currency symbol |
+| Date normalization    | Any dates found              | Normalize to ISO-8601 (YYYY-MM-DD)      |
+| URL resolution        | Relative URLs extracted      | Convert to absolute URLs                |
+| Phone normalization   | Contact mode                 | Standardize to E.164 format if possible |
+| Deduplication         | Multi-page or multi-URL      | Remove exact duplicate rows             |
+| Sorting               | User requested or natural    | Sort by user-specified field            |
+
+## Data Enrichment (Only When Useful)
+
+| Enrichment             | When                         | Action                                |
+|:-----------------------|:-----------------------------|:--------------------------------------|
+| Currency conversion    | User asks for single currency| Note original + convert (approximate) |
+| Domain extraction      | URLs in data                 | Add domain column from full URLs      |
+| Word count             | Article mode                 | Count words in extracted text         |
+| Relative dates         | Dates present                | Add "X days ago" column if useful     |
+
+## Deduplication Strategy
+
+When combining data from multiple pages or URLs:
+1. Exact match: rows with identical values in all fields -> keep first
+2. Near match: rows with same key fields (name+source) but different details
+   -> keep most complete (fewer nulls), flag in notes
+3. Report: "Removed N duplicate rows" in delivery notes
+
+---
+
+## Phase 6: Validate
+
+Verify extraction quality before delivering results.
+
+## Validation Checks
+
+| Check                | Action                                              |
+|:---------------------|:----------------------------------------------------|
+| Item count           | Compare extracted count to expected count from recon |
+| Empty fields         | Count N/A or null values per field                   |
+| Data type consistency| Numbers should be numeric, dates parseable           |
+| Duplicates           | Flag exact duplicate rows (post-dedup)               |
+| Encoding             | Check for HTML entities, garbled characters           |
+| Completeness         | All user-requested fields present in output          |
+| Truncation           | Verify data wasn't cut off (check last items)        |
+| Outliers             | Flag values that seem anomalous (e.g. $0.00 price)  |
+
+## Confidence Rating
+
+Assign to every extraction:
+
+| Rating     | Criteria                                                        |
+|:-----------|:----------------------------------------------------------------|
+| **HIGH**   | All fields populated, count matches expected, no anomalies      |
+| **MEDIUM** | Minor gaps (<10% empty fields) or count slightly differs        |
+| **LOW**    | Significant gaps (>10% empty), structural issues, partial data  |
+
+Always report confidence with specifics:
+> Confidence: **HIGH** - 47 items extracted, all 6 fields populated,
+> matches expected count from page analysis.
+
+## Auto-Recovery (Try Before Reporting Issues)
+
+| Issue              | Auto-Recovery Action                                  |
+|:-------------------|:------------------------------------------------------|
+| Missing data       | Re-attempt with Browser if WebFetch was used          |
+| Encoding problems  | Apply HTML entity decode + unicode normalization      |
+| Incomplete results | Check for pagination or lazy-loading, fetch more      |
+| Count mismatch     | Scroll/paginate to find remaining items               |
+| All fields empty   | Page likely JS-rendered, switch to Browser strategy   |
+| Partial fields     | Try JSON-LD extraction as supplement                  |
+
+Log all recovery attempts in delivery notes.
+Inform user of any irrecoverable gaps with specific details.
+
+---
+
+## Phase 7: Format And Deliver
+
+Structure results according to user preference.
+See [references/output-templates.md](references/output-templates.md)
+for complete formatting templates.
+
+## Delivery Envelope
+
+ALWAYS wrap results with this metadata header:
+
+```markdown
+
+## Extraction Results
+
+**Source:** [Page Title](http://example.com)
+**Date:** YYYY-MM-DD HH:MM UTC
+**Items:** N records (M fields each)
+**Confidence:** HIGH | MEDIUM | LOW
+**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
+**Format:** Markdown Table | JSON | CSV
+
+---
+
+[DATA HERE]
+
+---
+
+**Notes:**
+- [Any gaps, issues, or observations]
+- [Transforms applied: deduplication, normalization, etc.]
+- [Pages scraped if paginated: "Pages 1-5 of 12"]
+- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
+```
+
+## Markdown Table Rules
+
+- Left-align text columns (`:---`), right-align numbers (`---:`)
+- Consistent column widths for readability
+- Include summary row for numeric data when useful (totals, averages)
+- Maximum 10 columns per table; split wider data into multiple tables
+  or suggest JSON format
+- Truncate long cell values to 60 chars with `...` indicator
+- Use `N/A` for missing values, never leave cells empty
+- For multi-page results, show combined table (not per-page)
+
+## Json Rules
+
+- Use camelCase for keys (e.g. `productName`, `unitPrice`)
+- Wrap in metadata envelope:
+  ```json
+  {
+    "metadata": {
+      "source": "URL",
+      "title": "Page Title",
+      "extractedAt": "ISO-8601",
+      "itemCount": 47,
+      "fieldCount": 6,
+      "confidence": "HIGH",
+      "strategy": "A",
+      "transforms": ["deduplication", "priceNormalization"],
+      "notes": []
+    },
+    "data": [ ... ]
+  }
+  ```
+- Pretty-print with 2-space indentation
+- Numbers as numbers (not strings), booleans as booleans
+- null for missing values (not empty strings)
+
+## Csv Rules
+
+- First row is always headers
+- Quote any field containing commas, quotes, or newlines
+- UTF-8 encoding with BOM for Excel compatibility
+- Use `,` as delimiter (standard)
+- Include metadata as comments: `# Source: URL`
+
+## File Output
+
+When user requests file save:
+- Markdown: `.md` extension
+- JSON: `.json` extension
+- CSV: `.csv` extension
+- Confirm path before writing
+- Report full file path and item count after saving
+
+## Multi-Url Comparison Format
+
+When comparing data across multiple sources:
+- Add `Source` as the first column/field
+- Use short identifiers for sources (domain name or user label)
+- Group by source or interleave based on user preference
+- Highlight differences if user asks for comparison
+- Include summary: "Best price: $X at store-b.com"
+
+## Differential Output
+
+When user requests change detection (diff mode):
+- Compare current extraction with previous run
+- Mark new items with `[NEW]`
+- Mark removed items with `[REMOVED]`
+- Mark changed values with `[WAS: old_value]`
+- Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"
+
+---
+
+## Rate Limiting
+
+- Maximum 1 request per 2 seconds for sequential page fetches
+- For multi-URL jobs, process sequentially with pauses
+- If a site returns 429 (Too Many Requests), stop and report to user
+
+## Access Respect
+
+- If a page blocks access (403, CAPTCHA, login wall), report to user
+- Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
+- Do NOT scrape behind authentication unless user explicitly provides access
+- Respect robots.txt directives when known
+
+## Copyright
+
+- Do NOT reproduce large blocks of copyrighted article text
+- For articles: extract factual data, statistics, and structured info;
+  summarize narrative content
+- Always include source attribution (http://example.com) in output
+
+## Data Scope
+
+- Extract ONLY what the user explicitly requested
+- Warn user before collecting potentially sensitive data at scale
+  (emails, phone numbers, personal information)
+- Do not store or transmit extracted data beyond what the user sees
+
+## Failure Protocol
+
+When extraction fails or is blocked:
+1. Explain the specific reason (JS rendering, bot detection, login, etc.)
+2. Suggest alternatives (different URL, API if available, manual approach)
+3. Never retry aggressively or escalate access attempts
+
+---
+
+## Quick Reference: Mode Cheat Sheet
+
+| User Says...                         | Mode      | Strategy  | Output Default   |
+|:-------------------------------------|:----------|:----------|:-----------------|
+| "extract the table"                  | table     | A or B    | Markdown table   |
+| "get all products/prices"            | product   | E then A  | Markdown table   |
+| "scrape the listings"                | list      | A or B    | Markdown table   |
+| "extract contact info / team page"   | contact   | A         | Markdown table   |
+| "get the article data"               | article   | A         | Markdown text    |
+| "extract the FAQ"                    | faq       | A or B    | JSON             |
+| "get pricing plans"                  | pricing   | A or B    | Markdown table   |
+| "scrape job listings"                | jobs      | A or B    | Markdown table   |
+| "get event schedule"                 | events    | A or B    | Markdown table   |
+| "find and extract [topic]"           | discovery | WebSearch | Markdown table   |
+| "compare prices across sites"        | multi-URL | A or B    | Comparison table |
+| "what changed since last time"       | diff      | any       | Diff format      |
+
+---
+
+## References
+
+- **Extraction patterns**: [references/extraction-patterns.md](references/extraction-patterns.md)
+  CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
+
+- **Output templates**: [references/output-templates.md](references/output-templates.md)
+  Markdown, JSON, CSV templates with complete examples.
+
+- **Data transforms**: [references/data-transforms.md](references/data-transforms.md)
+  Cleaning, normalization, deduplication, enrichment patterns.
+
+## Best Practices
+
+- Provide clear, specific context about your project and requirements
+- Review all suggestions before applying them to production code
+- Combine with other complementary skills for comprehensive analysis
+
+## Common Pitfalls
+
+- Using this skill for tasks outside its domain expertise
+- Applying recommendations without understanding your specific context
+- Not providing enough project context for accurate analysis
+
+## Limitations
+- Use this skill only when the task clearly matches the scope described above.
+- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
+- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.