Skip to content

V1.3

Latest
Compare
Choose a tag to compare
@Daniele-Gregori Daniele-Gregori released this 08 Mar 06:58
· 7 commits to main since this release
e0fc554

New release with improvement of the topic classifier into ArXiv category or hep-th subcategory.

In particular, proper NNs are built (rather than just using the Classify super-function), using the Long-Short-Term-Memory layer to analyse abstracts and/or titles. The latter enter as input either as SciBERT or CONCEPT embedding.

NN classifier
  • SciBERT is a well known net which can act as embedding layer specific for scientific texts. However, for our use is still somewhat generic, not specific for theoretical physics (hep-th), and very demanding in resource use (RAM).

  • CONCEPT is a new embedding built here from scratch, essentially by looking at common title words combinations identified as actual hep-th concepts. It allows much better resource management and allows to tackle larger datasets.

We test and compare the performance of the classifiers on both properly distinct category classes and mixed classes within the hep-th one. In the former case, we verify good accuracy of SciBERT embedding and perfect accuracy of CONCEPT embedding.

confusion matrix proper

The latter mixed classification problem (in terms of so called hep-th cross-list categories) is somehow ill-posed, as researchers themselves often do not bother to distinguish their paper as such. Accordingly we find a peculiar confusion matrix, having highlighted besides the central diagonal, also a vertical and horizontal lines, corresponding to the actual and predicted main hep-th category.

confusion matrix

Finally, a proof of concept recommendation program is set up. However it does not perform very well and it probably needs construction of some new NN based on citations rather than just abstracts and titles.