Skip to content

Text nodes with only numeric characters are treated as JSON nodes (despite not being valid JSON and being a child node of valid HTML) #310

@neirar

Description

@neirar

Description

Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.

While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.

However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to xpath() raises an error: ValueError: Cannot use xpath on a Selector of type 'json'.

The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).

Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.

Steps to Reproduce

from scrapy import Selector
text_marked_as_json = Selector(text='20')
print(f'text_marked_as_json.type = {text_marked_as_json.type}')
# Prints: text_marked_as_json.type = json. # Why is this type json?

text_marked_as_html = Selector(text='20 hello')
print(f'text_marked_as_html.type = {text_marked_as_html.type}')
# Prints: text_marked_as_json.type = html

Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.

Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.

Reproduces how often:
All the time.

Versions

Please paste here the output of executing scrapy version --verbose in the command line.
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O

Additional context

Screenshot from Google Colab:
Screenshot 2024-12-29 at 6 11 16 PM

!scrapy version --verbose in Google Colab (form the screenshot)

Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.2 3 Sep 2024)
cryptography : 43.0.3
Platform : Linux-6.1.85+-x86_64-with-glibc2.35

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions