content-extraction

Use this research Paper to Understand and implementation below task Carey, Howard & Manic, Milos. (2016). HTML web content extraction using paragraph tags. 1099-1105. 10.1109/ISIE.2016.7745047.

content-extraction

Extracting HTML web content using paragraph tags and Natural language Processing:

python wikipedia_nlp.py

Wikipedia Natural Language Processing

This Python script uses several popular libraries to extract and analyze the text content of a Wikipedia page on natural language processing. The script downloads the HTML content of the page, extracts the text content of each paragraph using BeautifulSoup, and then uses spaCy to identify named entities of organization type.

Requirements

To run the script, you will need to have the following libraries installed:

requests
beautifulsoup4
nltk
spacy

You will also need to download the following NLTK resources using the nltk.download() method:

stopwords
punkt
averaged_perceptron_tagger

Usage

To use the script, simply run the following command in your terminal:

python wikipedia_nlp.py

The script will print out information about the named entities of organization type found in each paragraph of the Wikipedia page on natural language processing. Specifically, the script will print out the following information:

Text: The original text content of each paragraph that contains named entities of organization type.
Organizations: The named entities of organization type that were identified in each paragraph.
Named Entities: The named entities that were identified in each paragraph, regardless of their type.
POS tags: The part-of-speech tags for the cleaned tokens in each paragraph.
NER tags: The named entity recognition tags for the cleaned tokens in each paragraph.

Output

The output of the script can be used to gain insights into the types of organizations mentioned in the Wikipedia page on natural language processing, as well as the most common named entities mentioned throughout the page. The output can also be customized to include additional information, such as the frequency of each named entity or the co-occurrence of named entities with certain keywords.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
Report work.pdf		Report work.pdf
Source code.ipynb		Source code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

content-extraction

Wikipedia Natural Language Processing

Requirements

Usage

Output

About

Uh oh!

Releases

Packages

Languages

bhaveshbohra/content-extraction

Folders and files

Latest commit

History

Repository files navigation

content-extraction

Wikipedia Natural Language Processing

Requirements

Usage

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages