Skip to content

fancyspeed/sf-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

This project contains 2 Html Conent Extractors:

  1. cx-extractor in python. Reference: https://github.com/amumu/cx-extractor

  2. sf-extractor, a new extractor according to dynamic block segmentation and statistics. Steps:

2.1. remove newline characters

2.2. remove tags, replace with newlines

2.3. get blocks

2.4. evaluate blocks via text/stopword/link/punctuation densities

2.5. get the best block

2.6. merge it's neighbours iteratively

About

Html content extractor: cx-extractor in python and sf-extractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages