Gleaner

NOTE: This repo is archived and all gleaner work has been moved into one central repo for both nabu and gleaner

This repo is a heavily modified fork of the gleanerio/gleaner project, used under the Apache 2.0 license. It has been modified to be easier to test for the Geoconnex project.

About

Gleaner is a tool for extracting JSON-LD from web pages. You provide Gleaner a list of sites to index and it will access and retrieve pages based on the sitemap.xml of the domain(s). Gleaner can then check for well formed and valid structure in documents. The product of Gleaner runs can then be used to form Knowledge Graphs, Full-Text Indexes, Semantic Indexes Spatial Indexes or other products to drive discovery and use.

The image below gives an overview of the basic workflow of Gleaner

This image show that the product of Gleaner is really a populated data warehouse (document warehouse). Where those documents are either the JSON-LD structured data document harvested or the the provenance graphs generated by Gleaner during the process of harvesting.

Gleaner talks to an S3 compliant object store as part of its configuration. This can be AWS S3, Google Cloud Storage (GCS) or other S3 compliant object stores. A typical set up might see the use the open source Minio package in this role.

Note also the use of headless chrome in this diagram. A headless chrome instance is use for those cases where the resources to be harvested are placing the JSON-LD documents into the document object model (DOM) dynamically. In this case then the headless chrome is used to render the page and run the Javascript to form the rendered HTML document that can be parsed for the JSON-LD.

Name		Name	Last commit message	Last commit date
Latest commit History 766 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
cmd		cmd
docs		docs
internal		internal
testHelpers		testHelpers
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
makefile		makefile
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gleaner

NOTE: This repo is archived and all gleaner work has been moved into one central repo for both nabu and gleaner

About

Usage with Nabu

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

internetofwater/gleaner

Folders and files

Latest commit

History

Repository files navigation

Gleaner

NOTE: This repo is archived and all gleaner work has been moved into one central repo for both nabu and gleaner

About

Usage with Nabu

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages