What benchmark was used to evaluate against other model alternatives? Please provide details on the dataset, evaluation criteria, and methodology <img width="852" alt="Image" src="https://github.com/user-attachments/assets/8116cae2-1117-4e8d-bbf7-6fdab24cf094" />