Batch Size
int64 1
1
| Seq Length
int64 128
128
| New Tokens
int64 16
64
| Torch Compile
stringclasses 3
values | Implementation
stringclasses 4
values | Mean Generation Latency (ms)
float64 182
5.12k
| Mean Prefill Latency (ms)
float64 76.5
498
| Mean Decode Latency (ms)
float64 96.7
4.64k
| Peak Mem (MB)
float64 27.4k
35.9k
|
|---|---|---|---|---|---|---|---|---|
1
| 128
| 16
|
False
|
eager
| 1,452
| 480.2
| 971.81
| 27,425.16
|
1
| 128
| 16
|
max-autotune-no-cudagraphs
|
eager
| 1,620.98
| 498
| 1,122.99
| 27,410.34
|
1
| 128
| 16
|
False
|
grouped_mm
| 850.87
| 76.51
| 774.35
| 27,425.4
|
1
| 128
| 16
|
max-autotune-no-cudagraphs
|
grouped_mm
| 492.87
| 76.79
| 416.08
| 27,425.4
|
1
| 128
| 16
|
False
|
batched_mm
| 815.11
| 316.56
| 498.56
| 35,866.47
|
1
| 128
| 16
|
max-autotune
|
batched_mm
| 412.98
| 316.33
| 96.65
| 35,866.48
|
1
| 128
| 16
|
False
|
grouped_prefill+batched_decode
| 588.87
| 77.15
| 511.72
| 27,470.51
|
1
| 128
| 16
|
max-autotune
|
grouped_prefill+batched_decode
| 181.79
| 76.78
| 105.01
| 27,424.49
|
1
| 128
| 64
|
False
|
eager
| 4,524.16
| 486.84
| 4,037.31
| 27,418.44
|
1
| 128
| 64
|
max-autotune-no-cudagraphs
|
eager
| 5,116.71
| 477.42
| 4,639.29
| 27,419.62
|
1
| 128
| 64
|
False
|
grouped_mm
| 3,327.45
| 76.46
| 3,250.98
| 27,434.67
|
1
| 128
| 64
|
max-autotune-no-cudagraphs
|
grouped_mm
| 1,824.7
| 76.55
| 1,748.15
| 27,433.49
|
1
| 128
| 64
|
False
|
batched_mm
| 2,411.23
| 316.18
| 2,095.05
| 35,875.48
|
1
| 128
| 64
|
max-autotune
|
batched_mm
| 707.73
| 316.24
| 391.49
| 35,875.48
|
1
| 128
| 64
|
False
|
grouped_prefill+batched_decode
| 2,219.34
| 76.89
| 2,142.45
| 27,479.51
|
1
| 128
| 64
|
max-autotune
|
grouped_prefill+batched_decode
| 489.06
| 76.88
| 412.18
| 27,433.5
|
README.md exists but content is empty.
- Downloads last month
- 24