Skip to content

Commit 8a0efce

Browse files
committed
More docs on html-py-ever
1 parent 799f0e9 commit 8a0efce

File tree

4 files changed

+95
-8
lines changed

4 files changed

+95
-8
lines changed

html-py-ever/README.md

Lines changed: 77 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,83 @@
11
# html-py-ever
22

3-
Using [html5ever](https://github.com/servo/html5ever) through [kuchiki](https://github.com/kuchiki-rs/kuchiki) to speed up html parsing and css-selecting.
3+
Demoing hot to use [html5ever](https://github.com/servo/html5ever) through [kuchiki](https://github.com/kuchiki-rs/kuchiki) to speed up html parsing and css-selecting.
4+
5+
## Usage
6+
7+
`parse_file` and `parse_text` return a parsed `Document`, which then lets you select elements by css selectors using the `select` method. All elements are returned as strings
48

59
## Benchmarking
610

711
Create a python 3.6+ venv and activate it. Install html-py-ever in there (`python setup.py install`). To get a readable benchmark, run `test/run_all.py`. To get a real benchmark, run `pytest test_parsing.py` or `pytest test_selector.py`. Both have a `--benchmark-histogram` option.
12+
13+
## Example benchmark results
14+
15+
Running on Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz with python 3.6 and rustc 1.30.0-nightly (aaa170beb 2018-08-31)
16+
17+
**run_all.py**
18+
19+
```
20+
monty-python.html 1400
21+
Parse lxml 0.013675s 0.114107s 8.344x
22+
Parse py 0.013675s 0.191262s 13.986x
23+
Select lxml 0.004283s 0.001122s 3.818x
24+
Select py 0.004047s 0.001122s 3.608x
25+
empty.html 0
26+
Parse lxml 0.000050s 0.000250s 5.027x
27+
Parse py 0.000050s 0.000091s 1.834x
28+
Select lxml 0.000047s 0.000011s 4.452x
29+
Select py 0.000034s 0.000011s 3.263x
30+
small.html 0
31+
Parse lxml 0.000050s 0.000408s 8.221x
32+
Parse py 0.000050s 0.000341s 6.860x
33+
Select lxml 0.000048s 0.000006s 7.700x
34+
Select py 0.000116s 0.000006s 18.739x
35+
rust.html 733
36+
Parse lxml 0.034088s 0.269182s 7.897x
37+
Parse py 0.034088s 0.423923s 12.436x
38+
Select lxml 0.006814s 0.004962s 1.373x
39+
Select py 0.006792s 0.004962s 1.369x
40+
python.html 1518
41+
Parse lxml 0.134979s 1.440968s 10.675x
42+
Parse py 0.134979s 2.271023s 16.825x
43+
Select lxml 0.036732s 0.006711s 5.474x
44+
Select py 0.036882s 0.006711s 5.496x
45+
```
46+
47+
**test_parsing.py**
48+
49+
```
50+
------------------------------------------------------------------------------------------------------------------- benchmark: 10 tests -------------------------------------------------------------------------------------------------------------------
51+
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
52+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
53+
test_bench_parsing_rust[empty.html] 6.1110 (1.0) 513.7940 (1.0) 7.4792 (1.0) 9.5990 (1.0) 6.3950 (1.0) 0.2948 (1.0) 649;4746 133,704.3206 (1.0) 27203 1
54+
test_bench_parsing_rust[small.html] 19.3520 (3.17) 788.8010 (1.54) 22.1472 (2.96) 16.8692 (1.76) 19.8700 (3.11) 0.5373 (1.82) 393;1818 45,152.4211 (0.34) 16177 1
55+
test_bench_parsing_python[empty.html] 57.6250 (9.43) 38,060.2320 (74.08) 72.3809 (9.68) 457.3842 (47.65) 59.6890 (9.33) 3.0377 (10.31) 11;948 13,815.7902 (0.10) 6939 1
56+
test_bench_parsing_python[small.html] 290.9070 (47.60) 2,750.8890 (5.35) 345.1972 (46.15) 178.1737 (18.56) 301.0480 (47.08) 26.8838 (91.21) 103;362 2,896.8951 (0.02) 2477 1
57+
test_bench_parsing_rust[monty-python.html] 12,943.2440 (>1000.0) 21,217.3930 (41.30) 13,930.9700 (>1000.0) 1,687.9115 (175.84) 13,393.0260 (>1000.0) 493.4407 (>1000.0) 6;7 71.7825 (0.00) 65 1
58+
test_bench_parsing_rust[rust.html] 27,254.8300 (>1000.0) 44,283.6160 (86.19) 29,939.0300 (>1000.0) 3,770.0365 (392.75) 28,366.1800 (>1000.0) 2,199.8490 (>1000.0) 4;4 33.4012 (0.00) 30 1
59+
test_bench_parsing_rust[python.html] 117,097.9310 (>1000.0) 139,946.1370 (272.38) 124,982.5736 (>1000.0) 7,679.8512 (800.07) 124,375.9720 (>1000.0) 10,055.3265 (>1000.0) 2;0 8.0011 (0.00) 8 1
60+
test_bench_parsing_python[monty-python.html] 181,122.6270 (>1000.0) 221,371.7280 (430.86) 191,845.8776 (>1000.0) 16,849.9999 (>1000.0) 186,777.4470 (>1000.0) 15,766.5518 (>1000.0) 1;1 5.2125 (0.00) 5 1
61+
test_bench_parsing_python[rust.html] 384,658.8340 (>1000.0) 423,217.7400 (823.71) 406,878.9022 (>1000.0) 17,625.0831 (>1000.0) 413,173.2850 (>1000.0) 31,943.3840 (>1000.0) 1;0 2.4577 (0.00) 5 1
62+
test_bench_parsing_python[python.html] 2,195,261.3770 (>1000.0) 2,249,598.2990 (>1000.0) 2,221,196.6530 (>1000.0) 23,091.9237 (>1000.0) 2,212,574.4390 (>1000.0) 38,692.2310 (>1000.0) 2;0 0.4502 (0.00) 5 1
63+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
64+
```
65+
66+
**test_selector.py**
67+
68+
```
69+
------------------------------------------------------------------------------------------------------------ benchmark: 10 tests -------------------------------------------------------------------------------------------------------------
70+
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
71+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
72+
test_bench_selector_rust[empty.html] 1.3180 (1.0) 63.8790 (1.0) 1.5361 (1.0) 0.9402 (1.0) 1.4420 (1.0) 0.0630 (1.0) 1884;6084 651,005.8803 (1.0) 84775 1
73+
test_bench_selector_rust[small.html] 1.5300 (1.16) 112.5220 (1.76) 1.7647 (1.15) 1.0319 (1.10) 1.6590 (1.15) 0.0630 (1.00) 2215;7135 566,666.9515 (0.87) 96507 1
74+
test_bench_selector_python[empty.html] 20.1260 (15.27) 532.0720 (8.33) 22.9150 (14.92) 12.8876 (13.71) 20.8190 (14.44) 0.5280 (8.38) 818;1965 43,639.4426 (0.07) 18434 1
75+
test_bench_selector_python[small.html] 26.5540 (20.15) 890.5700 (13.94) 29.7362 (19.36) 14.8236 (15.77) 27.4300 (19.02) 0.7265 (11.53) 762;2109 33,629.0076 (0.05) 17413 1
76+
test_bench_selector_rust[monty-python.html] 691.8140 (524.90) 2,925.7400 (45.80) 851.7575 (554.50) 222.7539 (236.93) 802.9160 (556.81) 79.2970 (>1000.0) 43;69 1,174.0430 (0.00) 843 1
77+
test_bench_selector_rust[rust.html] 1,220.5940 (926.10) 6,789.2340 (106.28) 1,509.8102 (982.90) 540.7908 (575.20) 1,352.9600 (938.25) 361.6030 (>1000.0) 8;6 662.3349 (0.00) 240 1
78+
test_bench_selector_python[monty-python.html] 3,851.9600 (>1000.0) 8,077.7510 (126.45) 4,260.0542 (>1000.0) 675.4977 (718.48) 4,063.3380 (>1000.0) 216.4488 (>1000.0) 20;26 234.7388 (0.00) 245 1
79+
test_bench_selector_python[rust.html] 6,437.3910 (>1000.0) 11,348.6070 (177.66) 7,033.6536 (>1000.0) 1,050.6394 (>1000.0) 6,739.6810 (>1000.0) 363.3680 (>1000.0) 12;13 142.1736 (0.00) 151 1
80+
test_bench_selector_rust[python.html] 6,504.3130 (>1000.0) 12,934.9650 (202.49) 7,557.5249 (>1000.0) 1,398.7101 (>1000.0) 6,976.7700 (>1000.0) 965.8090 (>1000.0) 17;16 132.3185 (0.00) 143 1
81+
test_bench_selector_python[python.html] 36,145.0260 (>1000.0) 46,582.5100 (729.23) 38,058.3009 (>1000.0) 2,960.4055 (>1000.0) 36,630.3450 (>1000.0) 1,389.9710 (>1000.0) 4;5 26.2755 (0.00) 23 1
82+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
83+
```

html-py-ever/html_py_ever/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
from .html_py_ever import *
1+
from .html_py_ever import *

html-py-ever/requirements-dev.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ setuptools-rust
33
wheel
44
pytest-benchmark[historgram]
55
pytest
6-
6+
beautifulsoup4
7+
lxml

html-py-ever/test/run_all.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,11 @@ def rust(filename: str) -> Tuple[int, float, float]:
1919
return len(links), end_load - start_load, end_search - start_search
2020

2121

22-
def python(filename: str) -> Tuple[int, float, float]:
22+
def python(filename: str, parser: str) -> Tuple[int, float, float]:
2323
start_load = perf_counter()
2424
with open(filename) as fp:
2525
text = fp.read()
26-
soup = BeautifulSoup(text, "html.parser")
26+
soup = BeautifulSoup(text, parser)
2727

2828
end_load = perf_counter()
2929
start_search = perf_counter()
@@ -37,11 +37,21 @@ def python(filename: str) -> Tuple[int, float, float]:
3737
def main():
3838
for filename in glob("*.html"):
3939
count_rs, parse_rs, select_rs = rust(filename)
40-
count_py, parse_py, select_py = python(filename)
40+
count_lxml, parse_lxml, select_lxml = python(filename, "lxml")
41+
count_py, parse_py, select_py = python(filename, "html.parser")
42+
assert count_rs == count_lxml
4143
assert count_rs == count_py
4244
print(f"{filename} {count_rs}")
43-
print(f"Parse {parse_rs:6f}s {parse_py:6f}s {parse_py/parse_rs:6.3f}x")
44-
print(f"Select {select_py:6f}s {select_rs:6f}s {select_py/select_rs:6.3f}x")
45+
print(
46+
f"Parse lxml {parse_rs:6f}s {parse_lxml:6f}s {parse_lxml/parse_rs:6.3f}x"
47+
)
48+
print(f"Parse py {parse_rs:6f}s {parse_py:6f}s {parse_py/parse_rs:6.3f}x")
49+
print(
50+
f"Select lxml {select_lxml:6f}s {select_rs:6f}s {select_lxml/select_rs:6.3f}x"
51+
)
52+
print(
53+
f"Select py {select_py:6f}s {select_rs:6f}s {select_py/select_rs:6.3f}x"
54+
)
4555

4656

4757
if __name__ == "__main__":

0 commit comments

Comments
 (0)