Rough runtime benchmarks across tasks and context-lengths on any hardware setups

Are there any rough numbers folks have on any hardaware setup for the inference runtime of any standard models across tasks and various context lengths?

Using vLLM, for just 5 requests for the 131072 context length for NIAH_single_1, I'm currently seeing ~15 minutes on a single a100 for Llama 3.1 8b. 

While I actively play with parallelism configs, I'm wondering if other folks have any rough order of magnitude runtime changes, they have seen across various tasks and context lengths on any hardware setups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rough runtime benchmarks across tasks and context-lengths on any hardware setups #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rough runtime benchmarks across tasks and context-lengths on any hardware setups #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions