Release V1.3 · Daniele-Gregori/ArXiv-Hepth-Data-Analysis

New release with improvement of the topic classifier into ArXiv category or hep-th subcategory.

In particular, proper NNs are built (rather than just using the Classify super-function), using the Long-Short-Term-Memory layer to analyse abstracts and/or titles. The latter enter as input either as SciBERT or CONCEPT embedding.

SciBERT is a well known net which can act as embedding layer specific for scientific texts. However, for our use is still somewhat generic, not specific for theoretical physics (hep-th), and very demanding in resource use (RAM).
CONCEPT is a new embedding built here from scratch, essentially by looking at common title words combinations identified as actual hep-th concepts. It allows much better resource management and allows to tackle larger datasets.

We test and compare the performance of the classifiers on both properly distinct category classes and mixed classes within the hep-th one. In the former case, we verify good accuracy of SciBERT embedding and perfect accuracy of CONCEPT embedding.

The latter mixed classification problem (in terms of so called hep-th cross-list categories) is somehow ill-posed, as researchers themselves often do not bother to distinguish their paper as such. Accordingly we find a peculiar confusion matrix, having highlighted besides the central diagonal, also a vertical and horizontal lines, corresponding to the actual and predicted main hep-th category.

Finally, a proof of concept recommendation program is set up. However it does not perform very well and it probably needs construction of some new NN based on citations rather than just abstracts and titles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V1.3

Uh oh!