GitHub - fancyspeed/sf-extractor: Html content extractor: cx-extractor in python and sf-extractor

This project contains 2 Html Conent Extractors:

cx-extractor in python. Reference: https://github.com/amumu/cx-extractor
sf-extractor, a new extractor according to dynamic block segmentation and statistics. Steps:

2.1. remove newline characters

2.2. remove tags, replace with newlines

2.3. get blocks

2.4. evaluate blocks via text/stopword/link/punctuation densities

2.5. get the best block

2.6. merge it's neighbours iteratively

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
cx_extractor.py		cx_extractor.py
sf_extractor.py		sf_extractor.py

Provide feedback