-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Describe the solution you'd like
In issue #1, we looked into a few Python libraries to compare markdown (MD) files with a "golden" MD to evaluate accuracy and find the best tool for converting PDFs to MDs. The issue was that these libraries relied on sentence or paragraph comparisons, using diff to calculate similarity percentages. The problem with this approach is different tools convert PDFs to MDs with varying formatting, even if the content stays the same. This skews the accuracy when compared to the golden MD. So, we need a better tool to focus on content similarity rather than formatting differences.
Additional context
Based on discussion, we can break the MD content into paragraphs, convert it into embeddings, and map it in a vector space. Once both the golden MD and the converted MD are plotted, we calculate the distance between points to measure their similarity.