Is it possible to fine tune with our own datasets? #413

ninedesu · 2024-11-22T02:01:18Z

ninedesu
Nov 22, 2024

I want to know if we can use our own dataset to finetune the OCR

PeterStaar-IBM · 2024-11-22T05:03:53Z

PeterStaar-IBM
Nov 22, 2024
Maintainer

@ninedesu This is an excellent question, and yes, we plan to build a community where people can contribute data for fine-tuning. At the moment, we are gathering all our internal and external datasets (eg https://huggingface.co/datasets/ds4sd/DocLayNet) and preparing them so we can share them all on the huggingface website!

With regard to OCR, we have a bit of work to do and are right now relying on 3rd party OCR.

4 replies

bit-scientist Dec 13, 2024

@PeterStaar-IBM, is there any update on custom training guidelines?

MengFoong Feb 6, 2025

@PeterStaar-IBM would glad to hear an updates

PeterStaar-IBM Feb 6, 2025
Maintainer

we just released the first test dataset (https://huggingface.co/datasets/ds4sd/docling-dpbench). More will come soon.

FrankFacundo Apr 11, 2025

@PeterStaar-IBM I made some scripts to fine-tune the layout model, in which directory could I add them if possible ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to fine tune with our own datasets? #413

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to fine tune with our own datasets? #413

Uh oh!

ninedesu Nov 22, 2024

Replies: 1 comment · 4 replies

Uh oh!

PeterStaar-IBM Nov 22, 2024 Maintainer

Uh oh!

bit-scientist Dec 13, 2024

Uh oh!

MengFoong Feb 6, 2025

Uh oh!

PeterStaar-IBM Feb 6, 2025 Maintainer

Uh oh!

FrankFacundo Apr 11, 2025

ninedesu
Nov 22, 2024

Replies: 1 comment 4 replies

PeterStaar-IBM
Nov 22, 2024
Maintainer

PeterStaar-IBM Feb 6, 2025
Maintainer