Skip to content

Shivam-Miglani/Coleridge-Show-US-the-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Coleridge Initiative - Show US the Data

Competition link | Kaggle notebook

The repo contains copy of the Kaggle notebook.

Approach

Since research papers are large texts, I first created heuristics to find paragraphs which might contain dataset mentions using spacy's text matching/regex. On those paragraphs, I applied fine-tuned RoBERTa to extract custom dataset entity.

The work involved

  • converting the annotations to CoNLL-U format, and
  • training the transformer for multiple epochs.

To avoid overfitting only a subset of randomly selected labels was used along with randomly selected negative examples.

The results of this approach were capable of achieving a top 100 rank. The winning solution trained another model instead of applying heuristics for the first step of filtering paragraphs

Trained model inference example:

image

About

Kaggle Competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published