vllm.model_executor.layers.quantization.utils.mxfp8_utils ¶
Mxfp8LinearOp ¶
This class executes a MXFP8 linear layer.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
apply ¶
apply(
input: Tensor,
weight: Tensor,
weight_scale: Tensor,
out_dtype: dtype,
bias: Tensor | None = None,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
_cast_mxfp8_scales_to_bf16 ¶
Cast MXFP8 scales from uint8 to BF16. The scales are stored in uint8 format and need to be converted to BF16 by left-shifting by 7 bits (to form the exponent) and reinterpreting as bfloat16. Args: scales: uint8 tensor containing MXFP8 scales Returns: BF16 tensor with the converted scales
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
dequant_mxfp8_to_bf16 ¶
Dequantize MXFP8 tensor to BF16. Args: x: FP8 E4M3 tensor to dequantize scales: uint8 tensor containing MXFP8 scales Returns: BF16 dequantized tensor
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize ¶
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize_fake ¶
Fake implementation for torch.compile tracing. Returns empty tensors with the correct shapes and dtypes.