-
Notifications
You must be signed in to change notification settings - Fork 192
Description
Problem Description
We would like to ingest documentation for tooling that we use to then be able to search, chat with and integrate into the AI assistant. Generally the documentation is available in Github but some is not in formats that the connector is able to ingest.
This is two examples of documentation available in Github that the connector can't ingest:
Keycloak documentation in asciidoc (.adoc) format: https://github.com/keycloak/keycloak/tree/a8225655cfc1d4d01d6cbeea70cf45e4958e36e8/docs/guides/getting-started
Octopus Deploy documentation in MDX (.mdx) format: https://github.com/OctopusDeploy/docs/blob/main/src/pages/docs.mdx
Given the ease Elastic has with ingesting and searching raw text from Confluence and Markdown ingestion from Github is a similar blob of text, I'm unable to see why there is such strict limitation on the extensions that are being ingested from Github
Proposed Solution
Extend the list of supported extensions to include additional formats that documentation is commonly written in.
Current configuration:
SUPPORTED_EXTENSION = [".markdown", ".md", ".rst"]
Proposed configuration or similar:
SUPPORTED_EXTENSION = [".markdown", ".md", ".rst", ".adoc", ".mdx"]
A more complex solution would be to expose the list to the user via the advanced sync rules or other configuration.
Alternatives
I have considered using the web crawler but ideally would like to use the connector given that it's available and this seems to be one of it's primary use cases. https://github.com/elastic/crawler
Additional Context
I would be happy to have a go at a PR if the proposal to change the list is good enough.