Smarter conversion accuracy

**Describe the solution you'd like**

In issue #1, we looked into a few Python libraries to compare markdown (MD) files with a "golden" MD to evaluate accuracy and find the best tool for converting PDFs to MDs. The issue was that these libraries relied on sentence or paragraph comparisons, using diff to calculate similarity percentages. The problem with this approach is different tools convert PDFs to MDs with varying formatting, even if the content stays the same. This skews the accuracy when compared to the golden MD. So, we need a better tool to focus on content similarity rather than formatting differences.


**Additional context**
Based on discussion, we can break the MD content into paragraphs, convert it into embeddings, and map it in a vector space. Once both the golden MD and the converted MD are plotted, we calculate the distance between points to measure their similarity.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smarter conversion accuracy #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Smarter conversion accuracy #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions