-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Currently, requests queue is implemented using a go channel. This causes some limitations:
Known limitation 1: max-loras is not taken into consideration when worker pulls request from the queue
The component that receives requests, push a new request to a channel, workers are waiting for a request on this channel and process them.
Always requests are processed by arrival time.
In some scenarios current behavior leads to running requests in parallel with more lora adapters than defined in the max-lora parameter.
Example:
max-loras is deined to 2, number of parallel requests - 3
Queue contains: R1(lora1), R2(lora2), R3(lora3), R4 (lora1)
Current implementation will pull 3 requests for processing r1, r2, and r3.
Required behavior: pull r1, r2, r4 (r3 could not be sent for processing since it will cause loading of more than 2 loras)
Known limitation 2: extra entries in loraInfo metrics are reported
When a single request is received, it is pushed to the queue (the channel) which creates metrics report that the lora of this request is in waiting list. This report does not affects the LoraAwareScorer but should be removed in the future version. Implementing new queue will fix this behavior.
Solution:
Implement a queue class which will expose API similar to channel - workers will wait for a new request to process.
It will skip requests with loras that cannot be loaded right now.
Design - TBD