with one click
dom-extraction-lxml-xpath
// XPath 1.0 structured extraction with lxml.etree — namespaces, compiled expressions, EXSLT regex, and smart strings
// XPath 1.0 structured extraction with lxml.etree — namespaces, compiled expressions, EXSLT regex, and smart strings
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | dom-extraction-lxml-xpath |
| description | XPath 1.0 structured extraction with lxml.etree — namespaces, compiled expressions, EXSLT regex, and smart strings |
| tech_stack | ["web"] |
| language | ["python"] |
| capability | ["http-client"] |
| version | lxml 6.1.0 |
| collected_at | "2026-04-17T00:00:00.000Z" |
Source: https://lxml.de/xpathxslt.html, https://lxml.de/tutorial.html, https://lxml.de/
lxml.etree provides full XPath 1.0 querying backed by libxml2/libxslt C libraries — fast, feature-complete, and Pythonic. It handles namespaces, compiled expressions, variables, EXSLT regex extensions, and smart string introspection that tracks text origin in the DOM.
ancestor::, following-sibling::, etc.)lxml.cssselect or selectolax insteadfrom lxml import etree
xml = '<foo><bar>Text</bar></foo>'
tree = etree.fromstring(xml) # for XML; use etree.HTML() for HTML
# Absolute path
tree.xpath('/foo/bar') # [<Element bar>]
tree.xpath('/foo/bar/text()') # ['Text']
# Relative path from current element
tree.xpath('bar') # [<Element bar>]
# String value
tree.xpath('string(/foo/bar)') # 'Text'
# ElementPath (faster, simpler, incremental):
tree.find('bar') # first match
tree.findall('bar') # all matches (list)
tree.iterfind('bar') # incremental iterator
# XPath (full power, conditions, functions, axes):
tree.xpath('//bar[@id="main"]') # conditional
tree.xpath('//foo/ancestor::*') # axes
Rule: Use .find*() for simple tag paths. Use .xpath() when you need conditions, functions, axes, or text/attribute extraction in one expression.
| Method | Description |
|---|---|
elem.xpath(expr, namespaces=..., smart_strings=..., **vars) | One-shot XPath evaluation |
etree.XPath(expr, namespaces=..., regexp=..., smart_strings=...) | Compiled XPath — returns callable, compile once evaluate many |
etree.XPathEvaluator(elem) | Efficient evaluator for multiple different XPaths on same element |
etree.ETXPath(expr) | XPath with Clark notation {ns}name — no prefix mapping needed |
tree.getpath(elem) | Generate structural absolute XPath to an element |
| Expression yields | Python type |
|---|---|
Boolean (true()/false()) | True / False |
Number (count(), position()) | float |
String (string(), concat()) | plain str (no parent reference) |
Text nodes (text(), //text()) | 'smart' string with getparent(), is_text, is_tail |
Attribute values (@attr) | 'smart' string with getparent(), is_attribute |
Nodes (//element) | list of Element objects |
| Namespace declarations | (prefix, URI) tuples |
texts = tree.xpath('//text()')
texts[0] # 'Hello'
texts[0].getparent().tag # owning element's tag
texts[0].is_text # True
texts[0].is_tail # False
# Disable to reduce memory (no parent references kept):
tree.xpath('//text()', smart_strings=False)
# This FAILS — empty prefix undefined in XPath:
xml = '<root xmlns="http://example.com/ns"><child/></root>'
root = etree.fromstring(xml)
# root.xpath('/root') # XPathEvalError!
# Correct: define explicit prefix
root.xpath('/n:root', namespaces={'n': 'http://example.com/ns'})
xml = '<a:foo xmlns:a="http://example.com/ns1"><b:bar xmlns:b="http://example.com/ns2">Text</b:bar></a:foo>'
doc = etree.fromstring(xml)
# Your prefixes are YOUR choice — only URIs must match:
doc.xpath('/x:foo/y:bar', namespaces={
'x': 'http://example.com/ns1', # document uses 'a', we use 'x'
'y': 'http://example.com/ns2' # document uses 'b', we use 'y'
})[0].text # 'Text'
# ETXPath: use {uri}tagname directly
etree.ETXPath('//{http://example.com/ns}bar')(root)
# Also visible in tag names after parsing:
doc.xpath('/x:foo')[0].tag # '{http://example.com/ns1}foo'
# Match by local name only, ignoring namespace:
tree.xpath('//*[local-name() = "bar"]')
find_b = etree.XPath("//b")
find_b(root) # callable — evaluate many times
# With namespaces:
find_ns = etree.XPath("//n:b", namespaces={'n': 'http://example.com/ns'})
# Dynamic element matching:
count = etree.XPath("count(//*[local-name() = $name])")
count(root, name="foo") # 1.0
count(root, name="bar") # 2.0
# String variable:
tree.xpath("$text", text="Hello World!") # 'Hello World!'
root = etree.fromstring(xml)
records = []
for product in root.xpath('//product'):
records.append({
'sku': product.get('sku'),
'name': product.xpath('string(name)'),
'price': float(product.xpath('string(price)')),
'categories': product.xpath('categories/category/text()'),
})
from lxml import html
root = html.fromstring("<html><body>Hello<br/>World</body></html>")
root.xpath('//text()') # ['Hello', 'World'] — separate chunks
root.xpath('string()') # 'HelloWorld' — concatenated
root.xpath('//item/@id') # all ids
root.xpath('//item[@id="2"]/@name') # filtered
root.xpath('//item/@id')[0].getparent() # owner element
find = etree.XPath(
"//*[re:test(., '^abc$', 'i')]",
namespaces={'re': 'http://exslt.org/regular-expressions'}
)
# Matches elements whose text matches regex (case-insensitive)
root.xpath('//title/ancestor::chapter') # up the tree
root.xpath('//title/following-sibling::para') # siblings after
root.xpath('//title/parent::*') # immediate parent
root.xpath('//chapter/descendant::*') # all descendants
Smart strings keep the XML tree alive via getparent(). For large documents where you only need string values, always use smart_strings=False. Functions string() and concat() return plain strings — safe for memory.
Unlike iterfind() which yields incrementally, .xpath() builds the complete result list in memory. Not suitable for streaming huge documents.
No fn:matches() (use EXSLT re:test()), no for expressions, no sequences. EXSLT regex is enabled by default (regexp=True on XPath constructor).
# XPath class: compile-time vs runtime errors separated
try:
compiled = etree.XPath(expr) # XPathSyntaxError here
compiled(root) # XPathEvalError here
except etree.XPathError: # catch-all for both
...
# xpath() method: all errors are XPathEvalError
tree.xpath(expr) # XPathEvalError for everything
# For HTML (malformed, unclosed tags, etc.):
from lxml import html
root = html.fromstring(html_str) # handles real-world HTML
root = etree.HTML(html_str) # equivalent
# For well-formed XML:
root = etree.fromstring(xml_str) # strict XML parsing
root = etree.XML(xml_str) # equivalent
<xsl:strip-space elements="*"/> can crash due to a libxslt bug — avoid it.
lxml.html.fromstring(), query with .xpath(), fall back to .cssselect() for simple class/id selectionETXPath with Clark notation to avoid maintaining prefix mappingsXPathEvaluator; for the same query across documents: use compiled XPathxpath() for node selection with string() or text() for extracting values in one passxpath(), then use relative xpath() calls on each element for fields