It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
It's based on the great and simple scraping tool written by Jeroen Janssens.
You can install scrape-cli using several methods:
pipx install scrape-cli# Install as a global CLI tool (recommended)
uv tool install scrape-cli
# Or install with uv pip
uv pip install scrape-cli
# Or run temporarily without installing
uvx scrape-cli --helppip install scrape-cliOr install from source:
git clone https://github.com/aborruso/scrape-cli
cd scrape-cli
pip install -e .- Python >=3.6
- requests
- lxml
- cssselect
In the resources directory you'll find a test.html file that you can use to test various scraping scenarios.
Note: You can also test directly from the URL without cloning the repository:
scrape -e "h1" https://raw.githubusercontent.com/aborruso/scrape-cli/refs/heads/master/resources/test.htmlHere are some examples:
- Extract all table data:
# CSS
scrape -e "table.data-table td" resources/test.html
# XPath
scrape -e "//table[contains(@class, 'data-table')]//td" resources/test.html- Get all list items:
# CSS
scrape -e "ul.items-list li" resources/test.html
# XPath
scrape -e "//ul[contains(@class, 'items-list')]/li" resources/test.html- Extract specific attributes:
# CSS
scrape -e "a.external-link" -a href resources/test.html
# XPath
scrape -e "//a[contains(@class, 'external-link')]/@href" resources/test.html- Check if an element exists:
# CSS
scrape -e "#main-title" --check-existence resources/test.html
# XPath
scrape -e "//h1[@id='main-title']" --check-existence resources/test.html- Extract nested elements:
# CSS
scrape -e ".nested-elements p" resources/test.html
# XPath
scrape -e "//div[contains(@class, 'nested-elements')]//p" resources/test.html- Get elements with specific attributes:
# CSS
scrape -e "[data-test]" resources/test.html
# XPath
scrape -e "//*[@data-test]" resources/test.html- Additional XPath examples:
# Get all links with href attribute
scrape -e "//a[@href]" resources/test.html
# Get checked input elements
scrape -e "//input[@checked]" resources/test.html
# Get elements with multiple classes
scrape -e "//div[contains(@class, 'class1') and contains(@class, 'class2')]" resources/test.html
# Get text content of specific element
scrape -e "//h1[@id='main-title']/text()" resources/test.htmlA CSS selector query like this
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'Note: When using both -b and -e options together, they must be specified in the order -be (body first, then expression). Using -eb will not work correctly.
or an XPATH query like this one:
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be "//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a"gives you back:
<html>
<head>
</head>
<body>
<a href="/wiki/Afghanistan" title="Afghanistan">
Afghanistan
</a>
<a href="/wiki/Albania" title="Albania">
Albania
</a>
<a href="/wiki/Algeria" title="Algeria">
Algeria
</a>
<a href="/wiki/Andorra" title="Andorra">
Andorra
</a>
<a href="/wiki/Angola" title="Angola">
Angola
</a>
<a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
Antigua and Barbuda
</a>
<a href="/wiki/Argentina" title="Argentina">
Argentina
</a>
<a href="/wiki/Armenia" title="Armenia">
Armenia
</a>
...
...
</body>
</html>You can extract only the text content (without HTML tags) using the -t option, which is particularly useful for LLMs and text processing:
# Extract all text content from a page
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -t
# Extract text from specific elements
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -te 'table.wikitable td'
# Extract text from headings only
scrape -te 'h1, h2, h3' resources/test.htmlThe -t option automatically excludes text from <script> and <style> tags and cleans up whitespace for better readability.
You can integrate scrape-cli with xq (part of yq) to convert HTML output to structured JSON:
# Extract and convert to JSON (requires -b for complete HTML)
scrape -be "a.external-link" resources/test.html | xq .Output:
{
"html": {
"body": {
"a": {
"@href": "https://example.com",
"@class": "external-link",
"#text": "Example Link"
}
}
}
}Table extraction example:
scrape -be "table.data-table td" resources/test.html | xq .Output:
{
"html": {
"body": {
"td": [
"1",
"John Doe",
"john@example.com",
"2",
"Jane Smith",
"jane@example.com"
]
}
}
}Note: The -b flag is mandatory to produce valid HTML with <html>, <head> and <body> tags.
Useful for JSON-based pipelines, APIs, databases, and processing with jq/DuckDB.
Some notes on the commands:
-eto set the query-bto add<html>,<head>and<body>tags to the HTML output-tto extract only text content (useful for LLMs and text processing)