Parsel is a BSD-licensed Python library to extract data from HTML, JSON, and XML documents.
It supports:
- CSS and XPath expressions for HTML and XML documents
- JMESPath expressions for JSON documents
- Regular expressions
Find the Parsel online documentation at https://parsel.readthedocs.org.
Example (open online demo):
>>> from parsel import Selector
>>> text = """
... <html>
... <body>
... <h1>Hello, Parsel!</h1>
... <ul>
... <li><a href="http://example.com">Link 1</a></li>
... <li><a href="http://scrapy.org">Link 2</a></li>
... </ul>
... <script type="application/json">{"a": ["b", "c"]}</script>
... </body>
... </html>"""
>>> selector = Selector(text=text)
>>> selector.css("h1::text").get()
'Hello, Parsel!'
>>> selector.xpath("//h1/text()").re(r"\w+")
['Hello', 'Parsel']
>>> for li in selector.css("ul > li"):
... print(li.xpath(".//@href").get())
...
http://example.com
http://scrapy.org
>>> selector.css("script::text").jmespath("a").get()
'b'
>>> selector.css("script::text").jmespath("a").getall()
['b', 'c']