An Information Extraction system based on the Google Custom Search API which uses SpanBERT and Google Gemini API to extract tuples with the Iterative Set Expansion (ISE) algorithm.
- Abhishek Paul (ap4623)
- Puja Singla (ps3467)
- pytorch_pretrained_bert --> Folder that contains helper methods for the SpanBert model
- transcripts --> Folder that contains all the transcript text files for the 2 given test cases using spanbert and gemini
- spanbert.txt --> Transcript for seed query "bill gates microsoft" over relation "Work_For" with a confidence threshold of 0.7 and k-value of 10 (using the spanbert model)
- gemini.txt --> Transcript for seed query "bill gates microsoft" over relation "Work_For" and k-value of 10 (using the gemini model)
- .gitignore --> gitignore file for python projects
- download_finetuned.sh --> shell script to install the pretrained SpanBert model
- extract_gemini.py --> helper file to perform information extraction using gemini
- extract_spanbert.py --> helper file to perform information extraction using spanbert
- google_search.py --> sub-routine for performing the google search
- LICENSE --> MIT License
- main.py --> the main entry point of the application
- README.md --> Project Readme file
- requirements.txt --> list of dependencies and external libraries used
- scrape_text.py --> sub-routine for scraping text for information extraction from the urls returned by the google search
- spacy_help_functions.py --> helper file to use spaCy for text processing
- spanbert.py --> Code related to the SpanBert Model
You can clone the repo using the command given below,
git clone https://github.com/abhishekpaul11/COMS-E6111-information-extraction.git
or download a zip file of this repo.
Once, you're in the repository in your terminal, type the following commands
Note:
The project uses the spaCy NLP library which is most compatible with Python 3.12.9. In case you are having installation issues, kindly switch to
Python 3.12.9
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
You might need to install wget
tool in your system before doing this.
bash download_finetuned.sh
python3 -m spacy download en_core_web_lg
python3 main.py [-spanbert|-gemini] <r> <t> <q> <k>
where,
[-spanbert|-gemini]
is one of "-spanbert" or "-gemini" indicating the method you'd like to use for Information Extraction,
<r>
is an integer between 1 and 4, indicating the relation to extract: 1 is for Schools_Attended, 2 is for Work_For,3 is for Live_In, and 4 is for Top_Member_Employees,
<t>
is a real number between 0 and 1, indicating the "extraction confidence threshold", which is the minimum extraction confidence that we request for the tuples in the output; t is ignored if we are specifying -gemini,
<q>
is a "seed query", which is a list of words in double quotes corresponding to a plausible tuple for the relation to extract (e.g., "bill gates microsoft" for relation Work_For),
<k>
is an integer greater than 0, indicating the number of tuples that we request in the output.
Note
- The above instructions are mentioned for macOS / Linux systems. Run the appropriate commands if you are using a Windows System.
- You need to add your API Keys for Google Custom Search, Google Programmable Search Engine and Google Gemini in main.py for the project to be functional.
Assumption
You have python3 and pip3 available in your system.
If not, you can get it from Python Downloads and
Pip Installation respectively.
This is the entry point of the application.
- It parses the CLI arguments and performs the Google Search by calling this subroutine.
- After filtering out the non-html results, it sends the url from each result to scrape_text.py.
- It returns the pre-processed and cleaned text extracted from the url.
- This extracted text chunk is sent to spaCy to filter out the sentences that have the entities of our interest based on the relation the user has provided in the CLI prompt. (For example, PERSON and ORGANISATION in case of Work_For).
- If the chosen method is 'spanbert', these valid sentences are then sent to extract_spanbert.py to get the tuples with the required relation and confidence level above the threshold, the details of which can be found in the next section.
- If the chosen method is 'gemini', these valid sentences are then sent to extract_gemini.py to get the tuples with the required relation, the details of which can be found in the next section.
- These new tuples are added to a set after removing the duplicates and a new query is chosen to repeat the process. The specifics can be found in the next_section.
It terminates the application under these scenarios:
- Stalling - If no unused queries are left for iterative set expansion process to continue.
- 'k' unique tuples (with confidence level above the threshold in case of spanbert) have been found.
This simply uses the Google Custom Search JSON Api Key and the Programmable Search Engine ID to perform the google search on the given query and returns the results with appropriate error handling.
Given a URL, it extracts the text content from it, cleans it up (removes unnecessary whitespaces, newline characters and non-printable characters) and returns the first 10,000 characters from it.
Helper method that makes the Gemini API call to the Gemini 2.0 Flash Model for extracting tuples from a sentence, given a relation.
Helper method that returns tuples along with its confidence level from a sentence for a given relation by invoking the SpanBert model.
Helper file with sub-routines for performing IE tasks on text.
Returns all named entities in a sentence after named entity tagging.
Returns true if the passed sentence contains all the entities based on the relation the user has provided in the CLI prompt. For example, for Work_For, it returns true if the sentence contains a PERSON and an ORGANISATION entity.
Returns a list of all tuples from the provided text for the given relation whose confidence level is above the threshold.
It accepts a sentence and returns all possible combinations of pairs of entities of our interest (as explained above) present in the sentence.
Code related to the SpanBert Model
- The text content is extracted from each url (which has not been processed before) returned after the google search performed on the seed query given the by the user in the CLI prompt for the initial iteration (or the generated query for subsequent iterations).
- The text content is cleaned by removing unnecessary whitespaces, newline characters and non-printable characters. It is then trimmed to the first 10,000 characters, which will be used for IE going forward.
- Using spaCy, the text chunk is broken into individual sentences. For each sentence, spaCy returns the list of named entities it contains. We are only interested in those sentences which at least has all the entities involving the relation specified by the user in the CLI prompt. We filter out these valid sentences and carry on our process with them.
- If the user has specified spanbert as the method for information extraction, we try to generate all possible combinations of pairs of entities for the relation in the valid sentence.
- We then filter out only those pairs which have an exact match with the relation type. For example, for Work_For the subject should be a PERSON and object should be an ORGANISATION, all (PERSON, ORGANISATION) entity pairs are kept while removing all other pairs like (PERSON, PERSON), (ORGANISATION, PERSON) and (ORGANISATION, ORGANISATION).
- The filtered entity pairs are then sent to the SpanBert model along with the parent sentence.
- It returns the relation between the pairs and the confidence level.
- We then pick only those relations which match with the user specified relation and have a confidence level above the threshold. If duplicates are found, we keep the one with the higher confidence level.
- After repeating the process for every valid sentence in every url, we get a bunch of relevant tuples. Again, if duplicates are found, we keep the one with the higher confidence level.
- The program terminates if the number of accumulated tuples from the above step has reached
<k>
as specified by the user. - If not, we pick the tuple with the highest confidence level (such that this tuple has not been used as a prior query) as the query for the next iteration and proceed to repeat the steps from the top.
- In case we are left with no tuples which have not been used as a query before, it means the process has stalled. The program terminates in that case returning the tuples that have been gathered so far.
- If the user has specified gemini as the method for information extraction, we use the Gemini API to invoke the Gemini 2.0 Flash model.
- We have fabricated a one-shot in-context learning prompt to extract a list of all tuples from a sentence for a specific relation, which can be found at extract_gemini.py. We will be using this prompt to perform the IE task.
- We pass every valid sentence in an url to Gemini, and it returns to us a list of (subject, object) tuples that satisfy the given relation in the sentence. We remove the duplicate tuples, keeping only one copy of them.
- After repeating the process for every url, we get a bunch of relevant tuples. Again, if duplicates are found, we keep only one copy of them, removing the rest.
- The program terminates if the number of accumulated tuples from the above step has reached
<k>
as specified by the user. - If not, we arbitrarily pick a tuple (such that this tuple has not been used as a prior query) as the query for the next iteration and proceed to repeat the steps from the top.
- In case we are left with no tuples which have not been used as a query before, it means the process has stalled. The program terminates in that case returning the tuples that have been gathered so far.
To perform the Search Engine API call and get the results.
To scrape text from the urls returned by the Google search.
To perform NLP Pre-processing tasks (tokenization and Named Entity Tagging) on the plain text before Information Extraction.
To predict the relation between two entities in a sentence along with its confidence level.
To extract tuples for a given relation from a sentence.
We have decided to ignore the non-html files in the information extraction analysis.