Skip to content
@DataCatalogue

DataCatalogue

GitHub organization for the research project DataCatalogue (Inria - BnF - INHA).

📜 Presentation of the Project

DataCatalogue is a research project jointly led by Inria Paris' ALMAnaCH research team, the Bibliothèque nationale de France (BnF), and the Institut national d'histoire de l'art (INHA). It is funded by Inria and the French Ministry of Culture. After an experimental phase in 2021-2022, the project has been renewed for a second phase in 2023-2024.

Our corpus consists in a sample of the sales catalogs from the collections of the BnF and the INHA (over 280,000 documents in total). The 713 catalogs in our sampled corpus are representative in terms of time periods (18th to 21st centuries) and types of sales (numismatics, books, antiquities, works of art, furniture, etc.). The vast majority of the catalogs is in French, but there are instances of catalogs in English and German as well. We aim at desining a complete and mostly automated workflow for processing sales catalogs from their digitization to their publication online as augmented documents that can be queried like a database.

🐈 The DataCatalogue Pipeline

[DIAGRAM COMING SOON]

📂 Repositories

  • .github → README for the DataCatalogue GitHub organization
  • datacat-object-detection-dataset → Development of the object detection model with YOLOv8
  • datacat-teiTEI customization for sales catalogs
  • extraction-internship → Internship on information extraction with GROBID (Abdel Farhi, 2022)
  • grobid-datacat → GROBID module for catalogs
  • grobid-datacat-TrainingData → Training datasets for the GROBID "catalogues" module
  • publication-internship → Internship on publication with TEI Publisher (Jules Nuguet, 2022)

📝 Bibliography

  • Hugo Scheithauer, Sarah Bénière, Jean-Philippe Moreux, & Laurent Romary. (2023, November 29). DataCatalogue : rétro-structuration automatique des catalogues de vente. Webinaire Culture Inria. https://hal.science/hal-04360229.
  • Thibault Clérice, Juliette Janès., Hugo Scheithauer, Sarah Bénière, Laurent Romary, & Benoît Sagot. (2024, August 6-9). Layout Analysis Dataset with SegmOnto. DH 2024 - Annual Conference of the Alliance of Digital Humanities Organizations, Washington, D.C., United States. https://inria.hal.science/hal-04513725.
  • Hugo Scheithauer, Sarah Bénière, & Laurent Romary. (2024, August 6-9). Automatic Retro-Structuration of Auction Sales Catalogs layout and Content. DH 2024 - Annual Conference of the Alliance of Digital Humanities Organizations, Washington, D.C., United States. https://hal.science/hal-04547239.

🖌️ Credits

Logo by Alix Chagué, inspiration from Loading Artist.

Pinned Loading

  1. datacat-object-detection-dataset datacat-object-detection-dataset Public

    DataCatalogue Object Detection Dataset

  2. datacat-tei datacat-tei Public

    TEI Customization for Encoding Sales Catalogues

    HTML 1

Repositories

Showing 7 of 7 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…