Extend `PdfContentImporter` to extract information from bibliographical pages in books

**Is your suggestion for improvement related to a problem? Please describe.**

Might be a possible GSoC project!

Many books have a special page with a lot of bibliographical and publishing information (typically a second, or a third, after blank ones, sometimes one of the last pages).

What if JabRef could extract information from these pages? After all, this is the purpose of such pages - to contain bibliographical information.

Of course, many of them are different, and it's hard (impossible) to make a universal extraction algorithm. But! This project would be very beneficial to Ukrainian (and others) community!

In Ukraine, each book has a special page with bibliographical information that has TONS of information, and it's highly standardized! We also include a **full** citation in our **single** standard. And after the citation, an abstract typically goes. I attached a screenshot in *Additional context*.

**Describe the solution you'd like**

One could improve `PdfContentImporter` to extract information from these pages, as they are highly rich.

Additional context

- Ukraine - "Collection of physics problems for 8th grade":

![Image](https://github.com/user-attachments/assets/e8a72141-d07c-4994-879e-26670f66f16f)

Yes, every book in Ukraine has this 😄. Well, IDK about fiction or modern literature, but scientific literature is like this.

(*This book is classics. My generation and several before/after have done problems from this book*. Abstract available online too.)

- Pearson - "Artificial Intelligence: Modern Approach" 3rd ed.:

![Image](https://github.com/user-attachments/assets/2050d669-d6af-476c-9aa4-4d86135f90de)

Not much information here.

- O'Reily - "Natural Language Processing with Python":

![Image](https://github.com/user-attachments/assets/5f37bd55-9704-4586-af72-5757842fe8b8)

Not much information here too.

- "The formal semantics of programming languages: an introduction":

![Image](https://github.com/user-attachments/assets/3b882298-03b5-4559-a00e-a0514b0845dc)

One of the few where there is a citation. However, there are many foreign citation styles. In Ukraine, there is only a single, so it's simpler to improve `PdfContentImporter`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extend `PdfContentImporter` to extract information from bibliographical pages in books #12874

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Extend PdfContentImporter to extract information from bibliographical pages in books #12874

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Extend `PdfContentImporter` to extract information from bibliographical pages in books #12874