Parse partial HTML as-is? #160

maxcorbeau · 2025-04-19T10:06:08Z

maxcorbeau
Apr 19, 2025

When I try to parse HTML, selectolax does a few extra things (not shocking for an HTML parser):

adds extra tags (html/head/body from what I can see)
strips invalid tags (e.g. a <tr> encountered outside a <table>)

from selectolax.parser import HTMLParser,parse_fragment

sample = "<tr><i>foo</i><i>bar</i></tr>"
print(f"{HTMLParser(sample).html=}")
# =><html><head></head><body><i>foo</i><i>bar</i></body></html>
# <tr> stripped because not part of a table
print(f"{[x.html for x in parse_fragment(sample)]=}")
# ['<i>foo</i>', '<i>bar</i>']
# <tr> is lost
# we get a list of nodes and not a tree anymore

Is there a way to use selectolax in loose mode (i.e. don't remove/add any tags)?

Reason I wanted to use selectolax is because of speed (I get ~3x to 4x vs. lxml, ~20x vs. bs4)

I think I'm going to end up using some Rust pure XML parser if selectolax can't do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parse partial HTML as-is? #160

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Parse partial HTML as-is? #160

Uh oh!

Uh oh!

maxcorbeau Apr 19, 2025

Replies: 0 comments

maxcorbeau
Apr 19, 2025