This repository contains the code and datasets for a research project that focuses on a comprehensive statistical analysis of the Hindi and Sanskrit languages. The study aims to provide valuable insights into the linguistic structures of these languages and explore their relationship with culture and society. The findings have practical applications in fields such as cryptanalysis, machine translation, natural language processing, and sentiment analysis.
-
Dataset Selection: Meticulous selection and evaluation of datasets for both Hindi and Sanskrit languages.
-
Linguistic Aspects Explored:
- Frequency Analysis
- Character Grouping
- Digrams and Trigrams
- Average Word Length
- Zipf’s Law
- Word Entropy
- N-gram Entropy
-
Encouraging Results:
- Distinct patterns in character occurrences
- Structural complexities
- Adherence to Zipf’s Law in both languages
- Balanced mix of structured and variable word usage based on Word Entropy analysis
-
Comparisons with English:
- N-gram Entropy comparisons with English for insights into symbol relationships.
The repository contains two main folders - hindi
and sanskrit
. Each folder contains analysis.ipynb to generate the results. Results are in the form of CSV files and images.
To reproduce the results of the research, follow these steps:
-
Clone the repository:
git clone https://github.com/guptalab/hindisanskritstat.git
-
Install the required packages:
pip install -r requirements.txt
-
Run the 'analysis.ipynb' file in both folders to generate the results.