Indexify Extractors

Overview

Extractors are modules that give Indexify data processing capabilities such as metadata or embedding extraction from document, videos and audio. This repository hosts a collection of extractors for Indexify.

For the main Indexify project, visit: Indexify Main Repository.

Available Extractors

We have built some extractors based on demand from our users. You can write a new or a custom extractor for your use-case too, instructions for writing new extractors are below.

Usage

Install

pip install indexify-extractor-sdk

List Available extractors

indexify-extractor list

Download an Extractor

Find the name of the extractor you want.

indexify-extractor download hub://embedding/minilm-l6

Load and Run in Notebook or Python Applications

from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("minilm-l6.minilm_l6:MiniLML6Extractor")
content = Content.from_text("hello world")
out = extractor.extract(content)

Extractors can be parameterized when they are called. The input parameters are Pydantic Models. Inspect the config class programatically or in the docs of the corresponding extractor -

ex, config = load_extractor("chunking.chunk_extractor:ChunkExtractor")
config.schema()
#{'properties': {'overlap': {'default': 0, 'title': 'Overlap', 'type': 'integer'}, 'chunk_size': {'default': 100, 'title': 'Chunk Size', 'type': 'integer'}, 'text_splitter': {'default': 'recursive', 'enum': ['char', 'recursive', 'markdown', 'html'], 'title': 'Text Splitter', 'type': 'string'}, 'headers_to_split_on': {'default': [], 'items': {'type': 'string'}, 'title': 'Headers To Split On', 'type': 'array'}}, 'title': 'ChunkExtractionInputParams', 'type': 'object'}

Extract Locally on shell -

indexify-extractor run-local minilm_l6:MiniLML6Extractor --text "hello world" // or --file

Run Extractors as a Service for Continous Extraction and Indexing with Indexify Server

To run the extractor with Indexify's control plane such that it can continuously extract from content -

indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr localhost:8900

The coordinator-addr and ingestion-addr above are the default addresses exposed by the Indexify server to get extraction instructions and to upload extracted data, they can be configured in the server configuration.

Build a new Extractor

If want to build a new extractor to give Indexify new data processing capabilities you can write a new extractor by cloning this repository - https://tensorlakeai/indexify-extractor-template

Clone the template

git clone https://github.com/tensorlakeai/indexify-extractor-template.git

Implement the extractor interface

class MyExtractor(Extractor):
    input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]

    def __init__(self):
        super().__init__()

    def extract(self, content: Content, params: InputParams) -> List[Content]:
        return [
            Content.from_text(
                text="Hello World",
                features=[
                    Feature.embedding(values=[1, 2, 3]),
                    Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
                ],
                labels={"url": "test.com"},
            ),
            Content.from_text(
                text="Pipe Baz",
                features=[Feature.embedding(values=[1, 2, 3])],
                labels={"url": "test.com"},
            ),
        ]

    def sample_input(self) -> Content:
        return Content.from_text("hello world")

Once you have developed the extractor you can test the extractor locally by running the indexify-extractor local command as described above.

Test and Deploy the extractor

First test your extractor

ex, config = load_extractor("my_extractor:MyExtractor")
config.schema()
ex.extract(Content(...), config(...)# or ignore if you don't have config)

Run the extractor on shell

indexify-extractor run-local my_extractor:MyExtractor --text "hello world" // or --file /path to file

When you are ready to deploy the extractor in production, package the extractor and deploy as many instances you want on your cluster for parallelism, and point it to the indexify server.

indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr localhost:8900

Package the Extractor

Once you build a new extractor, and have tested it and it's time to deploy this in production, you can build a container with the extractor -

indexify-extractor package my_extractor:MyExtractor

If you want to package an extractor in a container that support Nvidia CUDA GPU, you can pass the --gpu flag to the package command.

Running Your packaged extractor

To run your packaged extractor image you can run the following command

docker run ExtractorImageName indexify-extractor join-server --coordinator-addr=host.docker.internal:8950 --ingestion-addr=host.docker.internal:8900

If you have a GPU enabled extractor, you might need to set up your machine to support running the container with the GPU. This might involve installing the Nvidia Container Toolkit and setting up the Nvidia runtime for Docker. You can find more information on how to do this in the Nvidia Container Toolkit Documentation.

Finally, to run your GPU enabled extractor, you can add the --gpus all flag to the docker run command.

Name		Name	Last commit message	Last commit date
Latest commit History 656 Commits
.github/workflows		.github/workflows
audio		audio
embedding		embedding
extractor-sdk		extractor-sdk
html/wikipedia		html/wikipedia
image		image
invoices		invoices
pdf		pdf
src		src
text		text
text_splitters		text_splitters
video		video
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
extractors.json		extractors.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Indexify Extractors

Overview

Available Extractors

Usage

Install

List Available extractors

Download an Extractor

Load and Run in Notebook or Python Applications

Extract Locally on shell -

Run Extractors as a Service for Continous Extraction and Indexing with Indexify Server

Build a new Extractor

Clone the template

Implement the extractor interface

Test and Deploy the extractor

Package the Extractor

Running Your packaged extractor

About

Uh oh!

Releases

Packages

Languages

tosolveit/indexify-extractors

Folders and files

Latest commit

History

Repository files navigation

Indexify Extractors

Overview

Available Extractors

Usage

Install

List Available extractors

Download an Extractor

Load and Run in Notebook or Python Applications

Extract Locally on shell -

Run Extractors as a Service for Continous Extraction and Indexing with Indexify Server

Build a new Extractor

Clone the template

Implement the extractor interface

Test and Deploy the extractor

Package the Extractor

Running Your packaged extractor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages