Skip to content

A baby step towards creating a language model which can process logical reasoning along with natural language, as would be required for an AI model equipped with chain-of-thoughts to human physicists.

Notifications You must be signed in to change notification settings

aninditamaiti/Step_towards_AI_Physicist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Step towards AI Physicist

In this repository, I haven undertaken a baby step towards creating a language model which can process and generalize semantic structures and tokens that are correlated through mathematical and physical reasoning models, in addition to usual natural language structures, as would be required for an AI model equipped with chain-of-thoughts similar to human physicists.

As a first step, I want to emphasize the importance of creating clean data for fine-tuning such models. Without high quality data, these models would not safely learn logical reasoning in terms of semantic structures and token embeddings. So far, I have a baby model for that, which still highlights how human thinking and science writing shows up as high-dimensional correlations between lingusitic token and mathematical symobls. In order to equip any (small, large etc) language models with foolproof logical reasoning and generalization of that ability, e.g. in-context-learning, we need to better understand how to map such correlations between natural language and mathematical symobl parts of scientific papers to multi-dimensional manifolds. Here, I have limited the scope of this data model to physics papers alone, i.e. within the scope of greek symbols and LateX formatting. However, the scope of this model can be asily generalized to other fields. I have demonstrated the efficiency of this model in case of a simple 7-page long recent physics paper. However, there are still bugs that degrade the quality of the data and correlations among symbols and letters. In a future iteration, I will attempt to fix as many bug as possible.

In case of perfect data generated from any scientific paper, and sufficiently large number of papers and compute power, state-of-the-art langauge models should develop some level of in-context-learning (ICL) ability to decipher home humans think science i.e. how mathematical symbols get correlated in accordance with scientifc concepts, and breakdown of such conecepts into logical flow among different parts of each paragarph within a page. That would be the most ideal scenario.

Code description:

  • html_to_tex_converter.py: This class converts any html webpage to a .txt file which can be easily embedded as tokens in train/test data-set of a Transformer-based Language Model. While this model is almost foolproof for natural language modules, there are still a few visible bugs within the space of mathematical symbols and semantics module. Fixing the latter is the goal of a future debug iteration; for now, we can approximate that in the limit of training a Transformer-based Language Model on the space of thousands, no, millions of such research papers, a fine-tuned model would wash out the influences of minor symbolic bugs in each paper, as the test/train data set in such an ideal case would naturally incur some visible covariance (effectively, standard error around the mean) which would absorb these bugs and uncertainties within the multi-dimensional token space. Having said that, such bugs may as well lead to hallucinations, and it is up to us to determine how these effects propagate through the model upon training.
  • data_generator.py: This block demonstrates how to generate .txt datasets out of any physics research paper online -- in particular, we choose "https://arxiv.org/html/2506.14609v1" from arxiv in its html version to demonstrate the useufulness of this block. This allows us to inspect the sequencing (and interplays) of natural langauge and mathematical symbol blocks that present written forms of human chain-of-thoughts in physics research.
  • train_model.py: This block evaluates a simple Transformer-based Language Model, known as DistilGPT-2, which is a simpler version of OpenAI's GPT-2 model, on the dataset generated by htmp_to_tex_converter.py. The tokenizer used is "AutoTokenizer.from_pretrained("distilgpt2")" and the model was run using google colab CPU, and runtime was about 2 mins. The performance of the model is terrible, and the reason can be (A) bugs in the dataset, (B) problem with tokenizer choice, espeically with mathematical symbols and semantics, (C) choice of model.

About

A baby step towards creating a language model which can process logical reasoning along with natural language, as would be required for an AI model equipped with chain-of-thoughts to human physicists.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages