Skip to content

Unicode astral character confusing pyquery #46

@kesinger

Description

@kesinger

This is more of a pyquery bug but I found it while using tapas-dl.

In the comments to the first installment of
https://tapas.io/series/talesofthehangman
there's a "🤩" character and something about that is messing up pyquery:


  File "/Users/jake/Library/Caches/pypoetry/virtualenvs/tapas-comic-downloader-Iag5BTTj-py3.9/lib/python3.9/site-packages/pyquery/pyquery.py", line 57, in fromstring
    result = getattr(etree, meth)(context)
  File "src/lxml/etree.pyx", line 3254, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 2, column 1

The workaround I found is to replace pq(pageReqest.text) with

    prt  = "".join([x for x in pageReqest.text if ord(x) < 128])
    page = pq(prt)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions