-
Notifications
You must be signed in to change notification settings - Fork 4
Description
app.py
get_model_size()
does a lot of regex matching on every request. This could be optimized by pre-compiling the regexes and/or caching the size lookups.
augment_reply()
is called multiple times per stream. Can refactor to do the common logic once.
check_bill_usage()
makes an HTTP request to bill on every completion.
- check_bill_usage is already asynchronous
- Exception handling could be improved
- Batching billing calls could further optimize
- Reusing an HTTPX client avoids overhead
alt_models()
does repetitive regex searches. Can build a map once at startup.
Redundant get_reg_mgr()
calls - can store it locally instead.
worker_stats()
regenerates stats each time - cache periodically.
Lots of log.debug
calls, which can be slowed down by IO. Use selectively.
stats.py
StatsContainer recalculates stats on every call. Can cache results, rebuild on update, or lazy rebuild as needed.
pick_best()
sorts every time it is called.
- Store perf data separately from heap, to allow fast lookup in update step.
- For tied perf, use gpu info as tie-breaker for deterministic results.
StatsStore` separate thread may contend - profile to see if lock contention.
Stats calculations under contention can use lock-free approaches (e.g. atomic counters).
General
Use a profiler to identify any other hot spots missed above.
Benchmark different approaches to quantify gains.