| name | x-sql |
| description | Queries the web with SQL — loads pages, selects DOM elements with CSS selectors, and extracts structured data (text, HTML, attributes, links, images, numbers) into tables and charts. Use when the user asks to scrape, extract, crawl, or monitor web content, or when they ask to convert a web page into structured data. |
| metadata | {"displayName":"X-SQL Web Query","version":"1.0.0","author":"Browser4","tags":"scraping, extraction, web, sql, dom, crawling, data-mining","dependencies":""} |
X-SQL Skill
Description
X-SQL is a SQL dialect for querying the web. It extends H2 SQL with User-Defined Functions (UDFs) that load web pages, traverse the DOM, select elements with CSS selectors, and extract structured data — all within a single declarative SQL statement.
Use this skill whenever you need to turn a web page into structured data: product listings, news headlines, search results, pricing tables, article metadata, or any other content that can be targeted with CSS selectors.
When to Use This Skill
Activate this skill when the user asks to:
- Scrape or extract data from a web page ("extract all products from...", "get the headlines from...")
- Crawl linked pages ("follow all links on the blog index and extract article titles")
- Monitor content changes ("check the price of...", "track when... changes")
- Convert a web page into a table, CSV, or structured JSON
- Query the web with SQL-like syntax ("show me all images larger than 400px on...")
- Analyze page structure ("find the most common CSS classes on...", "list all links on...")
Core Mental Model
Think of X-SQL as a pipeline where each SQL clause maps to a step in the scraping process:
FROM → Load the page and select row-level elements
SELECT → Extract fields from each row element using DOM functions
WHERE → Filter rows by extracted values
JOIN → Combine data from nested elements or linked pages
Every X-SQL query starts with a table function in the FROM clause that loads a web page and produces rows. Each row contains a DOM column (the selected element) and optionally a DOC column (the full document). You then apply DOM functions in the SELECT clause to extract specific values from each row.
Key Rules for Writing X-SQL
Rule 1: Every query starts with a table function in FROM
The entry point is always a table function. Use load_and_select for most cases:
SELECT ... FROM load_and_select('<url>', '<cssSelector>' [, offset [, limit]]);
offset is 1-based (the first element is 1, not 0)
limit controls how many rows to return
Rule 2: The DOM column is your row-level element
load_and_select returns two columns: DOM (each matched element) and DOC (the full document). Almost all DOM functions take dom as their first argument.
SELECT
dom_first_text(dom, 'h2') AS title,
dom_first_text(dom, '.price') AS price
FROM load_and_select('https://example.com/products', '.product-item');
Rule 3: Use dom_first_* for scalar values, dom_all_* for arrays
DOM selection functions follow a consistent naming pattern:
| Prefix | Returns | Use case |
|---|
dom_first_* | Single value | One field per row |
dom_all_* | Array of values | Multiple values per row (use with explode) |
dom_nth_* | Single value from the nth match | When you need a specific sibling |
Rule 4: Function names are case-insensitive and ignore underscores
All of these are identical: dom_first_text, DOM_FIRST_TEXT, domFirstText, dOm___FiRsT___tExT.
The canonical form (used in this spec) is lowercase with underscores.
Rule 5: Namespaces can be omitted for shortcut functions
Functions declared with hasShortcut = true can be called without their namespace prefix. For example, load_and_select belongs to the DOM namespace but the dom_ prefix is optional.
Rule 6: URLs can carry load options
Append load options to the URL to control caching and fetch behavior:
'https://example.com/page -i 1d -njr 3'
Common options: -i <duration> (cache duration), -njr (no JavaScript rendering).
Core Table Functions
Choose the right table function for your task:
load_and_select — Single page, element rows
The workhorse. Load one page, select matching elements as rows.
SELECT dom_first_text(dom, 'h2 a') AS title
FROM load_and_select('https://news.ycombinator.com', 'tr.athing', 1, 30);
load_all — Multiple known URLs
Load several pages in parallel. The urls parameter is an array.
SELECT dom_base_uri(dom) AS url, dom_first_text(dom, 'title') AS title
FROM load_all(ARRAY('https://example.com/1', 'https://example.com/2'));
load_out_pages — Portal → linked pages → rows
Load a portal page, extract all outgoing links, then load each linked page as a row. Ideal for crawling blog indexes, product listings with detail pages, or search result pages.
SELECT dom_base_uri(dom) AS url, dom_first_text(dom, 'h1') AS title
FROM load_out_pages('https://example.com/blog', 'main', 1, 20);
load_out_pages_and_select — Portal → linked pages → selected elements
Like load_out_pages but selects elements from each linked page.
SELECT dom_base_uri(dom) AS url, dom_first_text(dom, 'p') AS paragraph
FROM load_out_pages_and_select('https://example.com/articles', 'main', 1, 10, 'p.content');
load_and_get_links — Just extract links
SELECT * FROM load_and_get_links('https://example.com', 'nav', 1, 100);
select — Two-level iteration (parent → children)
Used as a joined table to iterate over child elements:
SELECT
dom_first_text(cat.dom, 'h3') AS category,
dom_first_text(item.dom, '.name') AS product,
dom_first_float(item.dom, '.price', 0.0) AS price
FROM load_and_select('https://shop.example.com', '.category') cat
JOIN select(cat.dom, '.product-item') item;
Core DOM Functions
Content Extraction (most commonly used)
| Function | Returns | Use for |
|---|
dom_first_text(dom, css) | String | Visible text of the first matched element |
dom_all_texts(dom, css) | Array | Visible text of all matched elements |
dom_first_own_text(dom, css) | String | Text of the element itself, excluding children |
dom_first_slim_html(dom, css) | String | Cleaned HTML of the first match |
dom_first_whole_text(dom, css) | String | Text including hidden nodes |
dom_first_integer(dom, css [, default]) | Int | Parse first integer from matched text |
dom_first_float(dom, css [, default]) | Float | Parse first float from matched text |
Attribute & Media Extraction
| Function | Returns | Use for |
|---|
dom_first_attr(dom, css, attr) | String | Any attribute value |
dom_first_href(dom [, css]) | String | First <a> href (absolute URL) |
dom_all_hrefs(dom [, css]) | Array | All <a> hrefs |
dom_first_img(dom [, css]) | String | First <img> src (absolute URL) |
dom_all_imgs(dom [, css]) | Array | All <img> srcs |
Element Properties
| Function | Returns | Use for |
|---|
dom_base_uri(dom) | String | The page URL |
dom_tag_name(dom) | String | HTML tag name |
dom_attr(dom, name) | String | Named attribute of the row element |
dom_id(dom) | String | Element's id |
dom_class_name(dom) | String | Element's class |
dom_has_class(dom, name) | Boolean | Check if element has a CSS class |
dom_text(dom) | String | Full text of the row element |
dom_outer_html(dom) | String | Outer HTML of the row element |
dom_slim_html(dom) | String | Slimmed HTML |
Position & Size
| Function | Returns | Use for |
|---|
dom_width(dom) | Double | Element width in pixels |
dom_height(dom) | Double | Element height in pixels |
dom_area(dom) | Double | width × height |
dom_top(dom) | Double | Y-coordinate |
dom_left(dom) | Double | X-coordinate |
Regex Extraction
| Function | Returns | Use for |
|---|
dom_re1(dom, regex) | String | First capture group |
dom_first_re1(dom, css, regex) | String | First capture group after CSS selection |
dom_first_re1(dom, css, regex, group) | String | Nth capture group after CSS selection |
dom_re2(dom, regex) | Array | Two capture groups as (key, value) |
Core String Functions (STR namespace)
Use these to clean and transform extracted text. All are null-safe (null input → null output).
| Function | Use for |
|---|
str_substring_after(text, sep) | Extract text after a delimiter |
str_substring_before(text, sep) | Extract text before a delimiter |
str_substring_between(text, open, close) | Extract text between delimiters |
str_trim(text) | Remove leading/trailing whitespace |
str_normalize_space(text) | Collapse whitespace |
str_first_integer(text, default) | Parse first integer |
str_first_float(text, default) | Parse first float |
str_replace_chars(text, search, replacement) | Replace characters |
str_delete_whitespace(text) | Remove all whitespace |
str_is_blank(text) / str_is_not_blank(text) | Check for content |
str_lower_case(text) / str_upper_case(text) | Change case |
str_split(text, sep) | Split into array |
str_length(text) | String length |
LLM Functions
When CSS selectors alone aren't enough, use the LLM to extract structured data from page content.
| Function | Use for |
|---|
llm_chat(prompt) | Ask the LLM a question |
llm_chat(dom, prompt) | Ask the LLM about a DOM element's content |
llm_extract(dom, rules) | Extract structured fields as JSON |
SELECT llm_extract(dom,
'name: the product name, ' ||
'price: the price as a number, ' ||
'rating: the average rating out of 5'
) AS product_info
FROM load_and_select('https://example.com/product/123', 'body');
Array Helpers
| Function | Use for |
|---|
explode(array [, col]) | Turn an array into rows (use in FROM) |
posexplode(array [, col]) | Like explode but includes 1-based position |
array_join_to_string(array, sep) | Join array elements into a string |
array_first_not_blank(array) | First non-blank value |
System Helpers
| Function | Use for |
|---|
loadOptions() | List all URL load options |
xsqlHelp() | List all registered X-SQL functions |
map(k1, v1, k2, v2, ...) | Create a key-value ResultSet |
Query Patterns
Pattern 1: Simple extraction (single page, one row per item)
The most common pattern. Each matched element becomes a row.
SELECT
dom_first_text(dom, '.title') AS title,
dom_first_href(dom, '.title a') AS link,
dom_first_float(dom, '.price', 0.0) AS price,
dom_first_img(dom, 'img.thumb') AS thumbnail
FROM load_and_select('https://example.com/products', '.product-card', 1, 50);
Pattern 2: Flatten an array into rows with explode
When a single page element contains a list, use dom_all_* + explode.
SELECT col AS image_url
FROM load_and_select('https://example.com/product/123', 'body') t
JOIN explode(dom_all_imgs(dom, '#gallery img')) img;
Pattern 3: Two-level nested extraction (categories → products)
SELECT
dom_first_text(cat.dom, 'h3.category-name') AS category,
dom_first_text(item.dom, '.product-name') AS product,
dom_first_float(item.dom, '.product-price', 0.0) AS price
FROM load_and_select('https://shop.example.com', '.category-section') cat
JOIN select(cat.dom, '.product-item') item;
Pattern 4: Crawl and extract from linked pages
SELECT
dom_base_uri(dom) AS article_url,
dom_first_text(dom, 'h1') AS title,
dom_first_text(dom, 'article p:first-child') AS lede,
dom_first_text(dom, 'time.publish-date') AS published
FROM load_out_pages('https://example.com/blog/index', 'main', 1, 10);
Pattern 5: Regex extraction from element text
SELECT
dom_first_text(dom, '.specs') AS raw_specs,
dom_first_re1(dom, '.specs', '(\d+)\s*GB') AS memory_gb,
dom_first_re1(dom, '.specs', '(\d+)\s*MP', 1) AS camera_mp
FROM load_and_select('https://example.com/phones', '.phone-specs');
Pattern 6: Filter rows with string checks
SELECT dom_first_text(dom, 'h2') AS headline
FROM load_and_select('https://news.example.com', 'article')
WHERE str_is_not_blank(dom_first_text(dom, 'h2'))
AND str_contains_any(str_lower_case(dom_first_text(dom, 'h2')), 'ai ml artificial');
Pattern 7: LLM-powered extraction for unstructured content
SELECT
dom_base_uri(dom) AS url,
llm_extract(dom,
'title: the article headline, ' ||
'author: the author name, ' ||
'summary: a 2-sentence summary, ' ||
'sentiment: positive, negative, or neutral'
) AS extracted
FROM load_and_select('https://example.com/news/article-42', 'body');
CSS Selector Tips
X-SQL uses Jsoup's CSS selector engine. Here are the most useful selectors:
| Selector | Meaning | Example |
|---|
tag | Element by tag | h1, article |
.class | Element by class | .product, .price |
#id | Element by ID | #main-content |
[attr] | Has attribute | [data-price] |
[attr=value] | Attribute equals | [data-category=electronics] |
[attr^=value] | Attribute starts with | [href^=/products/] |
[attr$=value] | Attribute ends with | [src$=.jpg] |
[attr*=value] | Attribute contains | [class*=product] |
parent > child | Direct child | ul > li |
ancestor descendant | Any descendant | article p |
prev + next | Adjacent sibling | h2 + p |
prev ~ siblings | All following siblings | h2 ~ p |
:contains(text) | Contains text | a:contains(Read more) |
:matches(regex) | Text matches regex | td:matches(^\d+\.\d+$), span:matches(Price) |
:has(selector) | Contains matching child | div:has(.price) |
:not(selector) | Negation | li:not(.sold-out) |
:nth-child(n) | Nth child | tr:nth-child(1) |
:first-child / :last-child | First/last child | li:first-child |
:expr(js-expression) | X-SQL extension — evaluate a JS expression | img:expr(width > 400) |
Selector Strategy
- Be specific but not brittle —
.product-title is better than div:nth-child(3) > span:nth-child(1)
- Use semantic selectors first —
[data-testid], [itemprop], [property] are more stable than layout classes
- Use
:expr() for computed properties — img:expr(width > 400) filters by actual rendered size, not just markup
- Use
:matches() for text patterns — td:matches(^\$\d+\.\d{2}) matches currency-formatted cells
- Combine with
:has() for context — div:has(.price) selects containers that contain a price
Step-by-Step Approach for Any Scraping Task
When given a scraping task, follow this process:
Step 1: Understand the target
Ask yourself:
- What is the page URL?
- What constitutes a "row" (product, article, search result)?
- What fields need to be extracted per row?
Step 2: Identify the row selector
Choose the CSS selector that matches each item. Look for repeating container elements:
- Product list:
.product-card, [data-product], li.product
- News feed:
article, tr.athing, .post
- Search results:
.g, .result, [data-result]
Step 3: Identify field selectors within each row
For each field, determine the CSS selector relative to the row:
- Title:
h2, .title, a:first-child
- Price:
.price, [data-price], span:contains($)
- Link:
a (use dom_first_href)
- Image:
img (use dom_first_img)
- Date:
time, .date, [datetime]
Step 4: Choose the right table function
- Single page, list of items →
load_and_select(url, rowSelector)
- Multiple known URLs →
load_all(ARRAY(url1, url2, ...))
- Crawl from a portal/index page →
load_out_pages(portalUrl, linkAreaCss)
- Two-level nesting →
load_and_select + JOIN select()
Step 5: Compose the SELECT clause
Map each desired field to a DOM function:
- Text →
dom_first_text(dom, css)
- Number →
dom_first_float(dom, css, default) or dom_first_integer(dom, css, default)
- Link →
dom_first_href(dom, css)
- Image →
dom_first_img(dom, css)
- Attribute →
dom_first_attr(dom, css, attrName)
- HTML →
dom_first_slim_html(dom, css)
Step 6: Add cleanup with STR functions
Post-process extracted values:
str_substring_after(price, '$') to strip currency symbols
str_trim(title) to remove extra whitespace
str_first_float(text, 0.0) to parse numbers from free text
Step 7: Add WHERE clause (optional)
Filter rows by business rules:
WHERE str_is_not_blank(title)
WHERE price > 0
WHERE str_contains_any(str_lower_case(title), 'keyword')
Parameters
| Parameter | Type | Required | Default | Description |
|---|
| query | String | Yes | - | The X-SQL query to execute |
| url | String | No | - | Convenience: URL to scrape (used to construct a simple query) |
| selector | String | No | - | Convenience: CSS row selector (used with url to construct a query) |
| fields | Map<String, String> | No | - | Convenience: Field name → CSS selector mapping for quick extraction |
Return Value
Returns a SkillResult with the following structure:
{
"query": "the executed X-SQL query",
"columns": ["col1", "col2", "..."],
"rows": [
{"col1": "value1", "col2": "value2"},
{"col1": "value3", "col2": "value4"}
],
"rowCount": 42,
"truncated": false
}
Usage Examples
Execute a raw X-SQL query
val result = registry.execute(
skillId = "x-sql",
context = context,
params = mapOf(
"query" to """
SELECT
dom_first_text(dom, 'h2 a') AS title,
dom_first_href(dom, 'h2 a') AS link,
dom_first_text(dom, '.score') AS score
FROM load_and_select('https://news.ycombinator.com', 'tr.athing', 1, 30)
""".trimIndent()
)
)
Quick extraction with field mapping
val result = registry.execute(
skillId = "x-sql",
context = context,
params = mapOf(
"url" to "https://books.toscrape.com",
"selector" to "article.product_pod",
"fields" to mapOf(
"title" to "h3 a",
"price" to ".price_color",
"image" to "img",
"link" to "h3 a"
)
)
)
Crawl and extract from linked pages
val result = registry.execute(
skillId = "x-sql",
context = context,
params = mapOf(
"query" to """
SELECT
dom_base_uri(dom) AS article_url,
dom_first_text(dom, 'h1') AS title,
dom_first_text(dom, 'article .content') AS body
FROM load_out_pages('https://example.com/blog', 'main a.post-link', 1, 10)
""".trimIndent()
)
)
LLM-based extraction
val result = registry.execute(
skillId = "x-sql",
context = context,
params = mapOf(
"query" to """
SELECT llm_extract(dom,
'name: product name, ' ||
'price: price as number, ' ||
'features: key features as a list'
) AS product_data
FROM load_and_select('https://example.com/product/42', 'body')
""".trimIndent()
)
)
Error Handling
The skill returns a failure result in the following cases:
| Error | Cause |
|---|
Missing required parameter query | Neither query nor (url + selector) provided |
| Invalid URL | URL format is malformed or unsupported |
| CSS selector matches nothing | The row selector returned zero elements (page structure may have changed) |
| Page load failure | Network error, timeout, or the target server rejected the request |
| SQL syntax error | Malformed X-SQL query |
| Namespace or function not found | Referenced a non-existent function |
Lifecycle Hooks
onBeforeExecute
Validates that at least query or (url + selector) is present. If url + selector + fields are provided, auto-generates the X-SQL query.
onAfterExecute
Records execution metrics (query duration, row count) and caches page content if caching is enabled.
validate
Checks that the X-SQL engine is available and the H2 database is initialized.
Implementation Notes
- X-SQL is built on the H2 database engine; queries execute within an H2 SQL session
- Page loading is handled by PulsarSession, which manages caching, fetch policies, and rate limiting
- DOM parsing uses Jsoup under the hood
- The
:expr() pseudo-class evaluates JavaScript expressions on each element's computed properties (size, position, styles)
- All DOM functions are null-safe: if a selector matches nothing, they return empty strings, 0, or NIL rather than throwing
- Parallel page loads (
load_all, load_out_pages) respect the session's concurrency limits
- LLM functions require an LLM backend to be configured in the session
Best Practices
- Be specific with CSS selectors —
.product-title is more maintainable than div > span:nth-child(3)
- Limit rows during development — start with
limit=5 to test your selector logic before scaling up
- Use cache options for repeated queries — append
-i 1d to the URL to cache pages for a day
- Prefer
dom_first_* over dom_all_* + explode — unless you genuinely need array flattening
- Extract once, clean with STR functions — use
str_trim, str_substring_after, etc. rather than complex regex in selectors
- Use
:matches() for content-based filtering — td:matches(^Price) is clearer than extracting then filtering with WHERE
- Handle missing data with defaults —
dom_first_float(dom, '.price', 0.0) returns 0.0 rather than NULL
- Prefer LLM extraction only when CSS selectors aren't practical — CSS is faster, cheaper, and deterministic
See Also