Change requests queue implementation from a channel to more complex queue

Currently, requests queue is implemented using a go channel. This causes some limitations:

**Known limitation 1**: max-loras is not taken into consideration when worker pulls request from the queue
The component that receives requests, push a new request to a channel, workers are waiting for a request on this channel and process them.
Always requests are processed by arrival time.
In some scenarios current behavior leads to running requests in parallel with more lora adapters than defined in the max-lora parameter.

*Example:*
max-loras is deined to 2, number of parallel requests - 3
Queue contains: R1(lora1), R2(lora2), R3(lora3), R4 (lora1)
Current implementation will pull 3 requests for processing r1, r2, and r3.
Required behavior: pull r1, r2, r4 (r3 could not be sent for processing since it will cause loading of more than 2 loras)

**Known limitation 2**: extra entries in loraInfo metrics are reported
When a single request is received, it is pushed to the queue (the channel) which creates metrics report that the lora of this request is in waiting list. This report does not affects the LoraAwareScorer but should be removed in the future version. Implementing new queue will fix this behavior.

**Solution:**
Implement a queue class which will expose API similar to channel - workers will wait for a new request to process. 
It will skip requests with loras that cannot be loaded right now.
Design - **TBD**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change requests queue implementation from a channel to more complex queue #172

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change requests queue implementation from a channel to more complex queue #172

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions