Doc to Mongo

A simple Spring Boot app that allows you to upload a document file (eg pdf, word), the backend then extracts the data and puts it into MongoDB.

📋 Details

This is a small project which allows you to extract data from documents such as PDF, word, txt etc, and make them queryable and searchable from MongoDB. When combined with Atlas Search, you can perform full Lucene text search on the document.

Since 0.2, it is possible to extract entities from the documents.

NB: this is designed as an example of what you can do with MongoDB, Atlas etc. It is not a production system, and it has not been tested against every single document type and for performance.

📦 Installation

Requirements

Java 11
Maven 3
A MongoDB Atlas instance

Project Structure

This repo contains 2 main folders:

doc-to-mongo-service - a Spring Boot Java application that allows you to upload a file, extracts information (via Apache Tika) and then uploads it to a MongoDB instance

💻 How to use

Make sure you have a MongoDB Atlas cluster running - you do not need Atlas and can run MongoDB locally, but you won't then be able to use the Atlas Search features.
Start the doc-to-mongo-service instance. See the project README for more details.
Navigate to localhost:9090 (default ports if running locally), select a pdf or txt document to upload.

Upload said document - if successful then you should get back the committed ObjectId

Navigate to your Atlas Cluster either via Compass, Atlas UI or simple the mongosh
Run queries against your documents

If you have enabled Atlas search on the documents, you can run full text search on your uploaded documents eg:

db.docs.aggregate([
  {$search: {
    index: 'default',
    text: {
      query: 'sample',
      path: 'content'
    }
  }
}])

returns...

{ 
  _id: ObjectId("6233278f9cef7f7c64f2d317"),
  filename: 'uploaded-docs/sample1.txt',
  timestamp: 2022-03-17T12:20:31.506Z,
  content: 'this is the first sample doc'
}

Run more complicated queries as required

Changelog

v0.3 (2022-04-03)

Improve Readme with clearer details on how to run the application
Remove the character limit when parsing large files
Make extracte entities optional - it is off by default
Fix broken tests

v0.2 (2022-03-17)

Perform named entity extract on documents using Apache Tika Open NLP libraries. For now, we extract person, location and organisation entities from documents and adds them as an embedded object of arrays.

v0.1 (2022-03-16)

Initial version, uploads files to a MongoDB instance, which can be searched via Atlas search. Text is extracted using Apache Tika.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
doc-to-mongo-service		doc-to-mongo-service
documentation		documentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Doc to Mongo

📋 Details

📦 Installation

Requirements

Project Structure

💻 How to use

Changelog

v0.3 (2022-04-03)

v0.2 (2022-03-17)

v0.1 (2022-03-16)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

SBrown52/doc-to-mongodb

Folders and files

Latest commit

History

Repository files navigation

Doc to Mongo

📋 Details

📦 Installation

Requirements

Project Structure

💻 How to use

Changelog

v0.3 (2022-04-03)

v0.2 (2022-03-17)

v0.1 (2022-03-16)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages