Skip to content

Change requests queue implementation from a channel to more complex queue #172

@mayabar

Description

@mayabar

Currently, requests queue is implemented using a go channel. This causes some limitations:

Known limitation 1: max-loras is not taken into consideration when worker pulls request from the queue
The component that receives requests, push a new request to a channel, workers are waiting for a request on this channel and process them.
Always requests are processed by arrival time.
In some scenarios current behavior leads to running requests in parallel with more lora adapters than defined in the max-lora parameter.

Example:
max-loras is deined to 2, number of parallel requests - 3
Queue contains: R1(lora1), R2(lora2), R3(lora3), R4 (lora1)
Current implementation will pull 3 requests for processing r1, r2, and r3.
Required behavior: pull r1, r2, r4 (r3 could not be sent for processing since it will cause loading of more than 2 loras)

Known limitation 2: extra entries in loraInfo metrics are reported
When a single request is received, it is pushed to the queue (the channel) which creates metrics report that the lora of this request is in waiting list. This report does not affects the LoraAwareScorer but should be removed in the future version. Implementing new queue will fix this behavior.

Solution:
Implement a queue class which will expose API similar to channel - workers will wait for a new request to process.
It will skip requests with loras that cannot be loaded right now.
Design - TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions