Terminology

This glossary explains the key terms used across the TensorFusion docs.

Basic concepts

TFLOPS: Trillions of floating-point operations per second. This is the core unit for compute allocation and scheduling. The system standardizes on FP16 dense TFLOPs.
VRAM: GPU/NPU memory (often referred to as GPU Mem). The system uses MiB as the minimum unit for accounting, allocation, and scheduling.
vGPU: A software-defined virtual GPU created by isolating and limiting GPU/NPU resources. From the application perspective, a vGPU behaves like a physical GPU.

FP16: 16-bit floating-point precision (half precision), widely used for training and inference.
BF16: BFloat16 16-bit floating-point precision with a wider exponent range for better training stability.
INT8: 8-bit integer precision, commonly used for inference acceleration and lower memory usage via quantization.
KV Cache: The cache of attention keys/values used to speed up long-context or multi-turn inference; cache size grows with sequence length.
MoE: Mixture of Experts architecture that sparsely activates expert networks to scale parameters efficiently.

This glossary explains the key terms used across the TensorFusion docs.

TFLOPS: Trillions of floating-point operations per second. This is the core unit for compute allocation and scheduling. The system standardizes on FP16 dense TFLOPs.
VRAM: GPU/NPU memory (often referred to as GPU Mem). The system uses MiB as the minimum unit for accounting, allocation, and scheduling.
vGPU: A software-defined virtual GPU created by isolating and limiting GPU/NPU resources. From the application perspective, a vGPU behaves like a physical GPU.

FP16: 16-bit floating-point precision (half precision), widely used for training and inference.
BF16: BFloat16 16-bit floating-point precision with a wider exponent range for better training stability.
INT8: 8-bit integer precision, commonly used for inference acceleration and lower memory usage via quantization.
KV Cache: The cache of attention keys/values used to speed up long-context or multi-turn inference; cache size grows with sequence length.
MoE: Mixture of Experts architecture that sparsely activates expert networks to scale parameters efficiently.