-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Description
Thanks in advance for looking into this. I ran into this issue and couldn't find an existing report for it. Also, apologies if I'm not using the standard terminology (as I'm new Python and scrapy). I'm a software engineer though.
While scraping the text out of a webpage, I use the function:
node.xpath('name()').get(): # Returns None for text nodes
to check if it's just text or another HTML node. This function call is part of a recursive function to iterate through a section of a webpage and extract just the text.
However, for certain HTML nodes, such as nodes that only contain numeric characters, the internal parser interprets them as JSON and the call to xpath()
raises an error: ValueError: Cannot use xpath on a Selector of type 'json'
.
The node that is interpreted as JSON is not JSON (unless you consider it a JSON fragment; like an integer value without a key).
Also, this node came while parsing a parent node that is HTML, so I'm not sure how the parser arrived at the conclusion that it is JSON. Shouldn't it get a hint about the type from the parent node? E.g. my parent is HTML so I'm likely HTML.
Steps to Reproduce
from scrapy import Selector
text_marked_as_json = Selector(text='20')
print(f'text_marked_as_json.type = {text_marked_as_json.type}')
# Prints: text_marked_as_json.type = json. # Why is this type json?
text_marked_as_html = Selector(text='20 hello')
print(f'text_marked_as_html.type = {text_marked_as_html.type}')
# Prints: text_marked_as_json.type = html
Expected behavior:
If an HTML element contains only numeric characters, its type should be HTML.
Actual behavior:
If an HTML element contains only numeric characters, its type is JSON, even though it is not a valid JSON string.
Reproduces how often:
All the time.
Versions
Please paste here the output of executing scrapy version --verbose
in the command line.
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.13.1 (v3.13.1:06714517797, Dec 3 2024, 14:00:22) [Clang 15.0.0 (clang-1500.3.9.4)]
pyOpenSSL : 24.3.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : macOS-15.1.1-arm64-arm-64bit-Mach-O
Additional context
!scrapy version --verbose
in Google Colab (form the screenshot)
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.12.9
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.2 3 Sep 2024)
cryptography : 43.0.3
Platform : Linux-6.1.85+-x86_64-with-glibc2.35