See this loom for demo.
Discovered Challenges:
-
Text clean up instructions: a. ✅ There are strings like "Himālayas" which become "हिमालयस्" where as they should be "हिमालय". b. Apostophes like "Rāvaṇa's" become "रावणऽस्", they should be "रावण 's" so reading them in English still makes sense. c. Blanks can also cause issues like "श्रीदत्त ," which should be "श्रीदत्त," d. Page numbers are coming for eg. in "SŪRYA" text "[Page 772b]" should be removed. e. Clean the english text like "Viz. " to "which is "
-
Reference structures can be polluted: a. There are places where the references are also translated instead of being picked as a whole. b. References can also be bigger text like "(See under the word AMṚTAM )" not just the Shloka and all. It is the job of the ingesting system to figure out the types. c. The References are broken incorrectly. For eg. in string "A king of the Yayāti dynasty. (Bhāgavata, 9th Skandha)." the predicted output is
"L (", "A Bhāgavata", "T , 9th Skandha", "L )."
Where as it should be"R ( Bhāgavata, 9th Skandha )"
-
Poluted keys in the database like: a. "parrot (parrot)" which is english b. "śaṅkhaparvata (mountain)" of which part is english becoming like "शङ्खपर्वत (मोउन्तैन्)" c. numbers might come in eg. "subāhu xii" which is "SUBĀHU XII" d. "SAṂSĀRA" are getting indexed as "saṂsāra" which means the transliteration is not working correctly.
-
Missing Akshara: a. in strings like "T . Pratiśravas had a son named Pratīpa. " which contains two Akshara tokens. b. In key ŚĀNTANAVA", string "T . He has written a book called 'Phiṭsūtra' about the " contains an Akshara token. c. Just because there are no IAST tokens in a string does not mean it is not an Akshara, eg. "Upamanyu", "Brahman" d. "A sub Parva in Vana Parva of महाभारत comprising of chapters 165 to 175 ." should have Parva and Vana Parva as Akshara tokens.
-
Incorrect things getting tagged as Akshara: a. In key "ARUPATTIMŪVAR" text "(The sixty-three's)", the section "The sixty-three's" is being tagged as Akshara. b. numbers should not be tagged as Akshara eg. " १२थ् century." is wrong.
-
Standardization of text: a. The bullet points should be removed like (a), (1), 6), etc. b. Shlokas are not being picked up correctly. If there is a complete shloka then it should be picked up as whole.
-
No concept of Paragraphs in the page. The book contains paragraphs but they are not being picked up see key "RATI".
OpenAI Batch IDs:
- batch_687e9f63013481908320f4085d263a7f