Skip to content

Extend PdfContentImporter to extract information from bibliographical pages in books #12874

@InAnYan

Description

@InAnYan

Is your suggestion for improvement related to a problem? Please describe.

Might be a possible GSoC project!

Many books have a special page with a lot of bibliographical and publishing information (typically a second, or a third, after blank ones, sometimes one of the last pages).

What if JabRef could extract information from these pages? After all, this is the purpose of such pages - to contain bibliographical information.

Of course, many of them are different, and it's hard (impossible) to make a universal extraction algorithm. But! This project would be very beneficial to Ukrainian (and others) community!

In Ukraine, each book has a special page with bibliographical information that has TONS of information, and it's highly standardized! We also include a full citation in our single standard. And after the citation, an abstract typically goes. I attached a screenshot in Additional context.

Describe the solution you'd like

One could improve PdfContentImporter to extract information from these pages, as they are highly rich.

Additional context

  • Ukraine - "Collection of physics problems for 8th grade":

Image

Yes, every book in Ukraine has this 😄. Well, IDK about fiction or modern literature, but scientific literature is like this.

(This book is classics. My generation and several before/after have done problems from this book. Abstract available online too.)

  • Pearson - "Artificial Intelligence: Modern Approach" 3rd ed.:

Image

Not much information here.

  • O'Reily - "Natural Language Processing with Python":

Image

Not much information here too.

  • "The formal semantics of programming languages: an introduction":

Image

One of the few where there is a citation. However, there are many foreign citation styles. In Ukraine, there is only a single, so it's simpler to improve PdfContentImporter.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Low priority

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions