LogoTensorFusion Docs
LogoTensorFusion Docs
HomepageDocumentation

Getting Started

OverviewKubernetes InstallVM/Server Install(K3S)Helm On-premises InstallHost/GuestVM InstallTensorFusion Architecture

Application Operations

Create WorkloadConfigure AutoScalingMigrate Existing WorkloadBest Practices

Customize AI Infra

Production-Grade DeploymentConfig QoS and BillingBring Your Own CloudManaging License

Maintenance & Optimization

Upgrade ComponentsSetup AlertsGPU Live MigrationPreload ModelOptimize GPU Efficiency

Troubleshooting

HandbookTracing/ProfilingQuery Metrics & Logs

Reference

Comparison

Compare with NVIDIA vGPUCompare with MIG/MPSCompare with Run.AICompare with HAMi

Workload Configuration

This doc explains how to allocate vGPU resources for your AI workloads using annotations and WorkloadProfile custom resources.

Add Pod Annotations

Add the following annotations to your Pod metadata to configure GPU workload requirements.

Note: If multiple containers in a Pod need GPUs, provide per-container GPU counts. Use tensor-fusion.ai/container-gpu-count to specify GPU count per container. tensor-fusion.ai/container-gpus is written back by the system to record the GPU IDs per container and generally should not be set manually.

Enable TensorFusion via Label

tensor-fusion.ai/enabled is a label (not an annotation) used to explicitly enable or disable TensorFusion. It is typically set under metadata.labels:

metadata:
  labels:
    tensor-fusion.ai/enabled: "true"

Relationship with tensor-fusion.ai/enabled-replicas: enabled turns TensorFusion on/off at the workload level; enabled-replicas controls how many replicas are enabled for a canary rollout after it’s on. When enabled is false, enabled-replicas has no effect.

Annotation Reference

Basic Annotations

AnnotationDescriptionExample Value
tensor-fusion.ai/tflops-requestRequested TFLOPs(FP16) per vGPU worker per GPU device'10'
tensor-fusion.ai/vram-requestRequested VRAM(stand for Video Memory or Frame Buffer) per vGPU worker per GPU device4Gi
tensor-fusion.ai/tflops-limitMaximum TFLOPs(FP16) allowed per vGPU worker per GPU device'20'
tensor-fusion.ai/vram-limitMaximum VRAM(stand for Video Memory or Frame Buffer) allowed per vGPU worker per GPU device4Gi
tensor-fusion.ai/inject-containerContainer to inject GPU resources into, could be comma split format for multiple containerspython
tensor-fusion.ai/qosQuality of service levellow medium high critical
tensor-fusion.ai/is-local-gpuSchedule the workload to the same GPU server that runs vGPU worker for best performance, default to false'true'
tensor-fusion.ai/gpu-countRequested GPU device count, each vGPU worker will map to N physical GPU devices set by this field, and vram/TFLOPs resource consumption will be scaled by this field, default to 1, your AI workloads can get cuda:0 device'4'
tensor-fusion.ai/gpupoolSpecifies target GPU pooldefault-pool
tensor-fusion.ai/vendorSpecify GPU/NPU vendor, NVIDIA AMD Ascend Intel Hygon MetaX MThreads Cambricon Enflame Qualcomm Cerebras AWSNeuron GoogleNVIDIA

Advanced Annotations

AnnotationDescriptionExample Value
tensor-fusion.ai/gpu-modelSpecifies the GPU/NPU modelA100 H100 L4 L40s
tensor-fusion.ai/dedicated-gpuUse along with tensor-fusion.ai/gpu-model annotation, occupancy whole GPU for this workload'true'
tensor-fusion.ai/isolationIncludes shared soft hard partitioned four modessoft
tensor-fusion.ai/compute-percent-requestCompute resource request percentage, range 0-100, mutually exclusive with TFLOPS request, only one of them needs to be set'100'
tensor-fusion.ai/compute-percent-limitCompute resource limit percentage, range 0-100, mutually exclusive with TFLOPS limit, only one of them needs to be set'100'
tensor-fusion.ai/gpu-indicesSpecify GPU device index, range 0-N, limit the range of scheduled devices, comma separated if requesting multiple cards'0,1'
tensor-fusion.ai/container-gpu-countPer-container GPU counts for multi-container Pods (JSON map of container name to GPU count). If not set, containers share the same GPUs'{"container-a":1,"container-b":2}'
tensor-fusion.ai/container-gpusContainer-to-GPU ID mapping (JSON). Written back by the system after scheduling to record GPU IDs per container'{"container-a":["gpu-1","gpu-2"],"container-b":["gpu-3"]}'
tensor-fusion.ai/workloadTensorFusionWorkload name, if exists, will share the same vGPU workerspytorch-example
tensor-fusion.ai/workload-profileReference to a WorkloadProfile to reuse pre-defined parametersdefault-profile
tensor-fusion.ai/enabled-replicasSet to any number less or equal to ReplicaSet replicas, for grey releasing TensorFusion'1','42'
tensor-fusion.ai/autoscaleEnable autoscaling for this workload (auto-set GPU resources based on historical usage)'true'
tensor-fusion.ai/autoscale-targetTarget resource to autoscale: compute (only compute-percent/TFLOPs), vram (only VRAM), or all (both)'all'
tensor-fusion.ai/standalone-worker-modeWhen is-local-gpu is true, this option is false, it means vGPU worker will be injected into init container, not running standalone vGPU worker, to achieve best performance, the trade-off is user might bypass vGPU worker and directly use physical GPU, when is-local-gpu is false, this option is invalid'true'
tensor-fusion.ai/disable-featuresKiller switch to disable tensor fusion built-in features partially, could be comma split format for multiple features'gpu-limiter,gpu-opt,mem-manager'

Example Config

kind: Deployment
apiVersion: apps/v1
metadata: {}
spec:
  template:
    metadata:
      labels:
        tensor-fusion.ai/enabled: "true"
      annotations:
        tensor-fusion.ai/inject-container: python # could be comma split if multiple containers using GPU
        tensor-fusion.ai/tflops-limit: '20'
        tensor-fusion.ai/tflops-request: '10'
        tensor-fusion.ai/vram-limit: 4Gi
        tensor-fusion.ai/vram-request: 4Gi
        tensor-fusion.ai/qos: medium
        tensor-fusion.ai/workload-profile: default-profile # WorkloadProfile has lower priority as template
        tensor-fusion.ai/is-local-gpu: 'true'
        tensor-fusion.ai/gpu-count: '1' # GPU device number per TensorFusion Worker
    spec: {}

Configure WorkloadProfile Custom Resource

For advanced features like auto-scaling, create a WorkloadProfile custom resource and reference it in your Pod annotations.

apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
  name: example-workload-profile
  namespace: same-namespace-as-your-workload
spec:
  # Specify AI computing resources needed
  resources:
    requests:
      tflops: "5"
      vram: "3Gi"
    limits:
      tflops: "15"
      vram: "3Gi"
  # Specify the number of vGPU workers, usually the same as Deployment replicas
  replicas: 1

  # Schedule the workload to the same GPU server that runs GPU worker for best performance
  isLocalGPU: true

  # Specify pool name (optional)
  poolName: default-pool

  # Specify QoS level (defaults to medium)
  qos: medium

  # Specify the number of GPU devices per vGPU worker (optional, default to 1)
  gpuCount: 1

  # Specify the GPU/NPU model (optional)
  gpuModel: A100

  # Auto-scaling configuration options (optional)
  autoScalingConfig: {}

Then reference this profile in your Pod annotation:

tensor-fusion.ai/workload-profile: example-workload-profile

For more details on WorkloadProfile schema, see the WorkloadProfile Schema Reference.

Table of Contents

Add Pod Annotations
Enable TensorFusion via Label
Annotation Reference
Example Config
Configure WorkloadProfile Custom Resource