Skip to content

A web-scraping tool to match and score queries with blocks of text from webpages using the Python Natural Language Tool Kit.

Notifications You must be signed in to change notification settings

neeldatta/DocSim-NLTK-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Project:

  1. JupiterOne policies to Cyber Asset semantic similarity matching. Main files are:
  • Policy Matching Tool.ipynb => NLTK tool that takes in xml sheet of URLs and tries to match and score all the text from each website to a specific query/queries.

  • DocSim vs. TF-IDF.ipynb => Tested both NLTK tools to see which gave more accurate results.

  • 3 Queries + 3 Cyber .ipynb => Tested Tf-IDF tool on three different example cases to make sure the scores were accurate across different queries.

  • 10,000 AWS.ipynb => Tested tool on an xml with 10,000 sub-links and visualized data to find the cut off for scores with meaningful correlations to the queries.

About

A web-scraping tool to match and score queries with blocks of text from webpages using the Python Natural Language Tool Kit.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages