Coleridge Initiative - Show US the Data

The repo contains copy of the Kaggle notebook.

Approach

Since research papers are large texts, I first created heuristics to find paragraphs which might contain dataset mentions using spacy's text matching/regex. On those paragraphs, I applied fine-tuned RoBERTa to extract custom dataset entity.

The work involved

converting the annotations to CoNLL-U format, and
training the transformer for multiple epochs.

To avoid overfitting only a subset of randomly selected labels was used along with randomly selected negative examples.

The results of this approach were capable of achieving a top 100 rank. The winning solution trained another model instead of applying heuristics for the first step of filtering paragraphs

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
spacy-3-0-transformer-custom-ner.ipynb		spacy-3-0-transformer-custom-ner.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Coleridge Initiative - Show US the Data

Approach

Trained model inference example:

About

Uh oh!

Releases

Packages

Languages

License

Shivam-Miglani/Coleridge-Show-US-the-Data

Folders and files

Latest commit

History

Repository files navigation

Coleridge Initiative - Show US the Data

Approach

Trained model inference example:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages