Best practices for extending/customizing the Haystack REST API? #3206

nickchomey · 2022-09-13T03:56:27Z

nickchomey
Sep 13, 2022

I would like to use the default REST API as a foundation for my application. By following tutorials, reading the documentation and just exploring the code, I more or less understand how to create custom endpoints, nodes, pipelines etc... in isolation. But, what I am uncertain about is how to actually insert them into the REST API in a way that avoids/minimizes conflicts when merging core Haystack code updates into my project code (the tutorials don't really seem to work/interface with the REST API).

I come from WordPress, where you NEVER touch any core code and instead use "hooks" to call/retrieve your custom code from the appropriate places within the core code, but I haven't yet found any analogue for this in Haystack's REST API. Therefore, I can't figure out how to achieve my goals without modifying core Haystack files, which will invite continual merge conflicts going forward...

I suspect that, at the very least, I could/should create my own my_application.py file that gets launched via gunicorn my_application:app... I could easily add and modify environment variables from there, prior to calling get_app(), get_pipeline(), etc...
e.g.

from rest_api.utils import get_app, get_pipelines

os.environ["PIPELINE_YAML_PATH"] =  "path/to/custom/pipelines.haystack-pipeline.yml" 
# Or use a `.env` file with `load_dotenv()`

app = get_app()
app.router.include_router()
pipelines = get_pipelines()

But it isn't clear to me how I can avoid modifying or altogether replacing other Core files.

For example, let's say I want to have two search endpoints that use different query pipelines. Currently there's just one /query endpoint, which loads the pipeline that is set by QUERY_PIPELINE_NAME. In order to have two query endpoints, do I add an additional endpoint directly to rest_api/controller/search.py and have that point to a corresponding environment variable - e.g. QUERY_PIPELINE_NAME_2? Or do I create a file such as rest_api/controller/search2.py and then add that to rest_api/utils.py with router.include_router(search2.router...).

Either way, I'm modifying a Core Haystack file.

I suppose I could do something in my_application.py like

#retrieve all the default routers etc... from `rest_api.utils.py`
app = get_app() 

#add my own router defined in `path/search2.py`
app.include_router(search2.router, tags=["search2"])

But that just seems clunky. It would be best if I could just "hook into", or otherwise insert, search2 into the existing utils.py module without modifying the file.

Am I missing something very fundamental about how to work with a Python project in general, or Haystack in particular? Should I just be modifying the "core" files? Or should I be treating some (or any/all) of the core files - namely application.py and utils.py - as templates from which to create my own versions?

I expect that the answer to all of this is extremely basic/simple, so I would very much appreciate if someone could take a few minutes to point me in the right direction here. Once I can get an understanding of these conceptual/architectural things, I should be able to start making rapid progress with my application, as well as developing a Node for spaCy that I can contribute back to the Haystack project.

Thanks!

masci · 2022-09-14T12:37:10Z

masci
Sep 14, 2022

Hi @nickchomey thanks for this discussion, there's plenty of interesting pointers here. Let's start with the simple ones.

Forking vs. Extending

This is not peculiar to Haystack, not even Python - when you want to extend an existing project, either the project is flexible enough to let you do that (like in Wordpress) or you have to fork it and make changes to the core components yourself, taking care of pulling new code from the upstream and ensure your changes keep working. While forking can be handy for a POC or a quick, dirty fix, I would generally leave it as the last resort.

In this case for example, FastAPI can be of great help as it supports natively adding new endpoints to existing applications (the support is limited though, so it might now work for you). If FastAPI is not enough for your use case, being Haystack an open source project you can consider contributing the feature yourself. Forking should really come last in my opinion :)

Adding new endpoints to rest_api

| Disclaimer: I didn't try this code, if this looks interesting we can get deeper.

You could create a standalone FastAPI application in a different repo, like this:

# file: myapi.py
import uvicorn
from fastapi import FastAPI
from rest_api.utils import get_app

app = FastAPI()
haystack_app = get_app()

my_api.mount("/haystack", haystack_app)


@my_api.get("/search")
def index(pipeline_id: int = 0):
    return "Hello from custom search!"


if __name__ == "__main__":
    uvicorn.run("myapi:app", host="127.0.0.1", port=8000)

With your endpoints mounted on the root and the ones from rest_api under /haystack like this:

$ curl http://localhost:8000/search
"Hello from custom search!"
$ curl http://localhost:8000/haystack/health
{"version":"1.6.1rc0","cpu":{"used":0.0},"memory":{"used":1.79},"gpus":[]}

Serving multiple pipelines

Now this is interesting and I believe the method above would only bring you so far, rest_api makes several assumption about the fact that the pipeline exposes is always one and one only. I think it's worth it expanding your use case and see if it wouldn't be strategic for rest_api to serve more than one pipeline - the api could stay the same but we might introduce url params like curl http://localhost:8000/search?pipeline_id=0.

Let me know if these points are enough to unblock you!

8 replies

nickchomey Sep 14, 2022
Author

Thanks very much! I very much appreciate the perspective and explanation, and understand that there was a misunderstanding between what I thought Haystack was and what it actually is (namely, it seems to be a fairly good framework for preprocessing, indexing, retrieving etc..., but doesn't handle the actual communication with the external application, in particular through a REST API).

Anyway, to clarify once again, I have no expectation for anyone to cater to my specific needs - certainly not immediately, or even ever. However, as you acknowledge, many people do seem to expect/need Haystack to work in the way that I expected (with a drop-in REST API). So, I am very grateful to the efforts that you have and will continue to make towards this end. I likewise hope that Deepset will recognize this deep need from users and provide you with assistance for this endeavour!

Whatever the case, I do think there would be value in giving more attention and priority to the already-acknowledged need for refactoring the pipelines and rest api. Anyway, I'm happy to leave this conversation for you folks to discuss, if it something you think is worthwhile.

I suppose for now, in the context of this discussion that I started, I would really just appreciate if someone could confirm how I should be using Haystack. Is it correct to say that I should be creating my own custom FastAPI application that imports and uses Haystack modules, methods etc... as-needed, rather than trying to base my application off/extend the Haystack REST API?

If that's the case, again, I'll probably limit my explorations of Haystack to something only very superficial as I don't really want to have to learn to re-create the wheel for a FastAPI application that properly handles multithreading, security, etc... There's aspects of that already handled in the Haystack REST API, but, again, it seems crazy to be creating some sort of hybrid fork of Haystack, especially as it continues to evolve/change.

vblagoje Sep 15, 2022
Maintainer

@nickchomey, as it seems that you have immediate needs and goals to meet, I think it is better to create your custom FastAPI application that uses Haystack rather than extending Haystack REST API, waiting for integration of these extensions etc. By the time your app is ready to go to production, we (including @danielbichuetti) will likely expand Haystack support on that front. That will give you additional options for proceeding further (switch to Haystack or continue to use your custom component). We will likely exchange experiences and develop a better solution for all parties involved. However, this way, you don't depend on anyone's timeline. Developing these components isn't complicated and could be picked up from FastApi examples and tutorials. I might be wrong, @danielbichuetti thoughts?

masci Sep 15, 2022

Ultimately, it seems to me that Haystack's current architecture isn't properly aligned with its core premise of being a framework for adding NLP to your application's search mechanism.

I don't 100% agree here but I see where the misunderstanding comes from. When I joined the project I was confused myself by the duality of Haystack: on one side we try to provide a turnkey solution (see all the dependencies that come along with a basic installation!), on the other side we want Haystack to be a framework, think about a box of Lego pieces.

I would like to state here that the direction I'm giving to the project points towards providing a pure NLP framework, I have the ambitious goal of making Haystack the Django of NLP :) but obviously we're not quite there.

Daniel explained better than I could the history of the rest_api package - once again, on one side Haystack provides a handy tool that does 80% of the job, (albeit making impossible the remaining 20%). On the other side, Haystack should be generic enough to be used beyond rest_api, but we're falling short communicating it to the users.

I'll probably limit my explorations of Haystack to something only very superficial as I don't really want to have to learn to re-create the wheel for a FastAPI application that properly handles multithreading, security, etc...

Let me set expectations right here: that's very correct, being (or wannabe) an NLP framework, Haystack's core will be focused on NLP features, delegating everything else (deployment, user authentication and permissions, etc...) to its "ecosystem". As Daniel said, rest_api remains strategic for my team and will keep evolving, just not as part of Haystack - matter of fact, as soon as I have some time I'll move the code away into its own repo (and pip package) to stress the concept that rest_api is not Haystack core. So bare with us, we're going to incorporate feedback like the one in this discussion.

Let me also stress that deepset will invest heavily on this "ecosystem" because even the best NLP framework is pointless if people can't use it, and if we get things right maybe more projects will spawn and maybe they will be better than our own rest_api, or our own Docker images, or our own Helm chart, and that will be a big win for Haystack!

Is it correct to say that I should be creating my own custom FastAPI application that imports and uses Haystack modules, methods etc... as-needed, rather than trying to base my application off/extend the Haystack REST API?

It depends on the amount of customisation you need and how much you want to invest in making rest_api work at the expense of (for example) more infra and setup work. For example, if you need two pipelines to be up and running, the options might be:

Deploy rest_api twice and expose the two URLs, or put a proxy in front of the two services to have one URL only
Forking away rest_api and make the changes you need
Creating your own api service, with FastAPI or Flask or any other Python web framework (this might be less work than it looks like)

danielbichuetti Sep 15, 2022

@nickchomey, as it seems that you have immediate needs and goals to meet, I think it is better to create your custom FastAPI application that uses Haystack rather than extending Haystack REST API, waiting for integration of these extensions, etc.

@vblagoje Today, if someone needs a customized way of exposing Haystack to a third-party project, the best option, in my opinion, is FastAPI. I really think FastAPI provides ways to develop fast customized endpoints for any need. You don't need to know much, it will provide abstracts on everything.

I think there are two ideas that are being transmitted wrongly to newcomers. That REST API is the perfect Haystack deployment solution. And that it's the only way. This is getting into their heads, and they are trying to adapt everything to the REST API when they should be using the REST API as an example for simple deployment.

And the REST API was not tailored to be a solution for every scenario. 🦖 It's not thread-safe, some users get scared because of the transformer's multithreading issues with FastTokenizers, but it's not hard to solve it. The error is huge, but the solution can be some dozens of lines with multithreading or less than 8 lines with multiprocessing + gunicorn. But the idea of a standard deployment makes them worry about changing the REST API. And it shouldn't be this way.

@masci Regarding the move from the Haystack repo, I think that this move should be well-orchestrated, and in the future. If this happens without the ongoing withstand of a team behind it and enough contributors, you may be creating a phantom repo. And killing the door for the average user. 😨

Certainly, there is room for improvement. I'm sure deepset will continue investing heavily in Haystack. There are contributors that I'm sure will do their best to improve this.

Indeed, I'll focus on the REST API PRs. I haven't yet because, since last week, I've had to orchestrate my time between being a father, a husband, a founder of a small startup, and delivering the Final Paper the next couple of weeks for my two specializations. 🐫 @nickchomey I apologize for this delay. It would be a dream to have 28 hours a day to do what I love since I was a child (coding and taking challenges). But rest assured, this will be done shortly.

nickchomey Sep 17, 2022
Author

Sorry folks, somehow I missed the notifications here and only just saw all of your wonderful responses now!

Thanks very much everyone for your detailed clarifications, advice, and, most importantly, continued efforts for the project. I understand quite well now that the focus of Haystack is on building a great framework for NLP-driven search, and it seems to be succeeding greatly with that goal!

I think that all that needs to be said about this topic has been said already, but I'll only just reiterate my core feedback once more. It seems to me that without an easy-to-use and robust API, you're going to be excluding many (probably most) potential users - be they open-source or commercial. Imagine if Elasticsearch (and the other DocumentStores) didn't do this - it would be like building a beautiful beachside resort but neglecting to build a road for people to actually access it, and then saying "yeah, but it's easy enough to charter a boat or helicopter...". That's how I more or less interpret the responses that "Fast API is great and really easy to use - just build it yourself!"

Given that most people surely interact with ES via REST API and DSL queries and are probably not familiar with python, NLP, APIs etc.., it would be ideal if a DSL-receiving REST API could be a 1st class citizen within Haystack, so as to be as close to a drop-in experience as possible. That would make it FAR more likely for people to consider adopting Haystack than if they need to build some custom (likely clumsy) API along with a bunch of python code to handle their application's existing logic - this could only lead to greater success for the Haystack ecosystem and Deepset commercially.

Also, keep in mind that ElasticSearch 8.x has introduced quite a lot of ML and NLP capabilities (in their paid-plans). Perhaps Haystack is superior (though ES is surely putting ENORMOUS resources behind all of this), but if a company is faced with the choice between using what is built-in and seamless and trying to shoe-horn some custom tool into their process, which do you think they're going to go with? In fact, I only discovered Haystack after going on a lengthy search after I realized that ES' ML capabilities are only available with their $109+/month licenses... But that's surely a negligible amount for any potential commercial Deepset client, so they'd just stop their search with ES 8.x... So, your best chance at competing with that is making it as seamless as possible. Of course, you also have a significant advantage in that you interface with all sorts of other vector DocumnetSores, but ES is where the majority of clients are, and those other DocumentStores probably are best used with a REST API as well...

Anyway, you have your vision, priorities and responsibilities, so I don't expect any of that to meaningfully change based on the ravings of a freeloading newbie. I just hope this will help you better shape your plans going forward. I really want Deepset/Haystack to succeed!

For the time being, as some have already suggested, I have proceeded with a very simple Fast API application (that I'm quite sure is missing all sorts of safety and performance features that would be expected in a boat or helicopter...) and have been piecing together some basic endpoints, functions and pipelines. Once it is "working", I'll return my focus to my actual WordPress-based project.

I'm not nearly the developer/engineer that you folks are, but I hope that I'll be able to contribute some PRs for things like spaCy (which I'm still 100% convinced should be the primary preprocessor in Haystack, and I'll aim to provide some benchmark data to back this up) and other small improvements as I see opportunity for them. There might even be an opportunity for me to share my basic application as a template/tutorial for other clueless newcomers to learn from.

I'm glad that all of this sparked some positive discussion and reflection and eagerly look forward to whatever might evolve out of this! Thanks again for everything.

nickchomey · 2022-09-25T15:43:35Z

nickchomey
Sep 25, 2022
Author

Ps. I just stumbled upon this tool that might be worth considering for any REST API work that you do.

https://pinferencia.underneathall.app/0.2/

It uses fastapi and uvicorn, so it's the same foundation, and seems to be focused on serving inference models, so perhaps it would allow you to outsource this non-core feature to a purpose-built tool that you don't need to worry about developing or maintaining?

Or perhaps it's too simple/doesn't fit in with Haystack pipeline stuff. I'm not knowledgeable enough to answer that, but figured I'd at least share this with you folks.

0 replies

Best practices for extending/customizing the Haystack REST API? #3206

Uh oh!

Uh oh!

nickchomey Sep 13, 2022

Replies: 2 comments · 8 replies

Uh oh!

masci Sep 14, 2022

Forking vs. Extending

Adding new endpoints to rest_api

Serving multiple pipelines

Uh oh!

Uh oh!

nickchomey Sep 14, 2022 Author

Uh oh!

vblagoje Sep 15, 2022 Maintainer

Uh oh!

masci Sep 15, 2022

Uh oh!

Uh oh!

danielbichuetti Sep 15, 2022

Uh oh!

Uh oh!

nickchomey Sep 17, 2022 Author

Uh oh!

nickchomey Sep 25, 2022 Author

nickchomey
Sep 13, 2022

Replies: 2 comments 8 replies

masci
Sep 14, 2022

nickchomey Sep 14, 2022
Author

vblagoje Sep 15, 2022
Maintainer

nickchomey Sep 17, 2022
Author

nickchomey
Sep 25, 2022
Author