Project अर्थ (Artha)

See this loom for demo.

Discovered Challenges:

Text clean up instructions: a. ✅ There are strings like "Himālayas" which become "हिमालयस्" where as they should be "हिमालय". b. Apostophes like "Rāvaṇa's" become "रावणऽस्", they should be "रावण 's" so reading them in English still makes sense. c. Blanks can also cause issues like "श्रीदत्त ," which should be "श्रीदत्त," d. Page numbers are coming for eg. in "SŪRYA" text "[Page 772b]" should be removed. e. Clean the english text like "Viz. " to "which is "
Reference structures can be polluted: a. There are places where the references are also translated instead of being picked as a whole. b. References can also be bigger text like "(See under the word AMṚTAM )" not just the Shloka and all. It is the job of the ingesting system to figure out the types. c. The References are broken incorrectly. For eg. in string "A king of the Yayāti dynasty. (Bhāgavata, 9th Skandha)." the predicted output is "L (", "A Bhāgavata", "T , 9th Skandha", "L )." Where as it should be "R ( Bhāgavata, 9th Skandha )"
Poluted keys in the database like: a. "parrot (parrot)" which is english b. "śaṅkhaparvata (mountain)" of which part is english becoming like "शङ्खपर्वत (मोउन्तैन्)" c. numbers might come in eg. "subāhu xii" which is "SUBĀHU XII" d. "SAṂSĀRA" are getting indexed as "saṂsāra" which means the transliteration is not working correctly.
Missing Akshara: a. in strings like "T . Pratiśravas had a son named Pratīpa. " which contains two Akshara tokens. b. In key ŚĀNTANAVA", string "T . He has written a book called 'Phiṭsūtra' about the " contains an Akshara token. c. Just because there are no IAST tokens in a string does not mean it is not an Akshara, eg. "Upamanyu", "Brahman" d. "A sub Parva in Vana Parva of महाभारत comprising of chapters 165 to 175 ." should have Parva and Vana Parva as Akshara tokens.
Incorrect things getting tagged as Akshara: a. In key "ARUPATTIMŪVAR" text "(The sixty-three's)", the section "The sixty-three's" is being tagged as Akshara. b. numbers should not be tagged as Akshara eg. " १२थ् century." is wrong.
Standardization of text: a. The bullet points should be removed like (a), (1), 6), etc. b. Shlokas are not being picked up correctly. If there is a complete shloka then it should be picked up as whole.
No concept of Paragraphs in the page. The book contains paragraphs but they are not being picked up see key "RATI".

OpenAI Batch IDs:

batch_687e9f63013481908320f4085d263a7f

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
alembic		alembic
apps		apps
artha		artha
static		static
templates		templates
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project अर्थ (Artha)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

yashbonde/artha

Folders and files

Latest commit

History

Repository files navigation

Project अर्थ (Artha)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages