Workload Configuration

This doc explains how to allocate vGPU resources for your AI workloads using annotations and WorkloadProfile custom resources.

Add Pod Annotations

Add the following annotations to your Pod metadata to configure GPU workload requirements.

Note: If multiple containers in a Pod need GPUs, provide per-container GPU counts. Use tensor-fusion.ai/container-gpu-count to specify GPU count per container. tensor-fusion.ai/container-gpus is written back by the system to record the GPU IDs per container and generally should not be set manually.

Enable TensorFusion via Label

tensor-fusion.ai/enabled is a label (not an annotation) used to explicitly enable or disable TensorFusion. It is typically set under metadata.labels:

metadata:
  labels:
    tensor-fusion.ai/enabled: "true"

Relationship with tensor-fusion.ai/enabled-replicas: enabled turns TensorFusion on/off at the workload level; enabled-replicas controls how many replicas are enabled for a canary rollout after it’s on. When enabled is false, enabled-replicas has no effect.

Annotation Reference

Basic Annotations

Annotation	Description	Example Value
`tensor-fusion.ai/tflops-request`	Requested TFLOPs(FP16) per vGPU worker per GPU device	`'10'`
`tensor-fusion.ai/vram-request`	Requested VRAM(stand for Video Memory or Frame Buffer) per vGPU worker per GPU device	`4Gi`
`tensor-fusion.ai/tflops-limit`	Maximum TFLOPs(FP16) allowed per vGPU worker per GPU device	`'20'`
`tensor-fusion.ai/vram-limit`	Maximum VRAM(stand for Video Memory or Frame Buffer) allowed per vGPU worker per GPU device	`4Gi`
`tensor-fusion.ai/inject-container`	Container to inject GPU resources into, could be comma split format for multiple containers	`python`
`tensor-fusion.ai/qos`	Quality of service level	`low` `medium` `high` `critical`
`tensor-fusion.ai/is-local-gpu`	Schedule the workload to the same GPU server that runs vGPU worker for best performance, default to false	`'true'`
`tensor-fusion.ai/gpu-count`	Requested GPU device count, each vGPU worker will map to N physical GPU devices set by this field, and vram/TFLOPs resource consumption will be scaled by this field, default to 1, your AI workloads can get `cuda:0` device	`'4'`
`tensor-fusion.ai/gpupool`	Specifies target GPU pool	`default-pool`
`tensor-fusion.ai/vendor`	Specify GPU/NPU vendor, `NVIDIA` `AMD` `Ascend` `Intel` `Hygon` `MetaX` `MThreads` `Cambricon` `Enflame` `Qualcomm` `Cerebras` `AWSNeuron` `Google`	`NVIDIA`

Advanced Annotations

Annotation	Description	Example Value
`tensor-fusion.ai/gpu-model`	Specifies the GPU/NPU model	`A100` `H100` `L4` `L40s`
`tensor-fusion.ai/dedicated-gpu`	Use along with `tensor-fusion.ai/gpu-model` annotation, occupancy whole GPU for this workload	`'true'`
`tensor-fusion.ai/isolation`	Includes `shared` `soft` `hard` `partitioned` four modes	`soft`
`tensor-fusion.ai/compute-percent-request`	Compute resource request percentage, range 0-100, mutually exclusive with TFLOPS request, only one of them needs to be set	`'100'`
`tensor-fusion.ai/compute-percent-limit`	Compute resource limit percentage, range 0-100, mutually exclusive with TFLOPS limit, only one of them needs to be set	`'100'`
`tensor-fusion.ai/gpu-indices`	Specify GPU device index, range 0-N, limit the range of scheduled devices, comma separated if requesting multiple cards	`'0,1'`
`tensor-fusion.ai/container-gpu-count`	Per-container GPU counts for multi-container Pods (JSON map of container name to GPU count). If not set, containers share the same GPUs	`'{"container-a":1,"container-b":2}'`
`tensor-fusion.ai/container-gpus`	Container-to-GPU ID mapping (JSON). Written back by the system after scheduling to record GPU IDs per container	`'{"container-a":["gpu-1","gpu-2"],"container-b":["gpu-3"]}'`
`tensor-fusion.ai/workload`	TensorFusionWorkload name, if exists, will share the same vGPU workers	`pytorch-example`
`tensor-fusion.ai/workload-profile`	Reference to a WorkloadProfile to reuse pre-defined parameters	`default-profile`
`tensor-fusion.ai/enabled-replicas`	Set to any number less or equal to ReplicaSet replicas, for grey releasing TensorFusion	`'1','42'`
`tensor-fusion.ai/autoscale`	Enable autoscaling for this workload (auto-set GPU resources based on historical usage)	`'true'`
`tensor-fusion.ai/autoscale-target`	Target resource to autoscale: `compute` (only compute-percent/TFLOPs), `vram` (only VRAM), or `all` (both)	`'all'`
`tensor-fusion.ai/standalone-worker-mode`	When `is-local-gpu` is true, this option is false, it means vGPU worker will be injected into init container, not running standalone vGPU worker, to achieve best performance, the trade-off is user might bypass vGPU worker and directly use physical GPU, when `is-local-gpu` is false, this option is invalid	`'true'`
`tensor-fusion.ai/disable-features`	Killer switch to disable tensor fusion built-in features partially, could be comma split format for multiple features	`'gpu-limiter,gpu-opt,mem-manager'`

Example Config

kind: Deployment
apiVersion: apps/v1
metadata: {}
spec:
  template:
    metadata:
      labels:
        tensor-fusion.ai/enabled: "true"
      annotations:
        tensor-fusion.ai/inject-container: python # could be comma split if multiple containers using GPU
        tensor-fusion.ai/tflops-limit: '20'
        tensor-fusion.ai/tflops-request: '10'
        tensor-fusion.ai/vram-limit: 4Gi
        tensor-fusion.ai/vram-request: 4Gi
        tensor-fusion.ai/qos: medium
        tensor-fusion.ai/workload-profile: default-profile # WorkloadProfile has lower priority as template
        tensor-fusion.ai/is-local-gpu: 'true'
        tensor-fusion.ai/gpu-count: '1' # GPU device number per TensorFusion Worker
    spec: {}

Configure WorkloadProfile Custom Resource

For advanced features like auto-scaling, create a WorkloadProfile custom resource and reference it in your Pod annotations.

apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
  name: example-workload-profile
  namespace: same-namespace-as-your-workload
spec:
  # Specify AI computing resources needed
  resources:
    requests:
      tflops: "5"
      vram: "3Gi"
    limits:
      tflops: "15"
      vram: "3Gi"
  # Specify the number of vGPU workers, usually the same as Deployment replicas
  replicas: 1

  # Schedule the workload to the same GPU server that runs GPU worker for best performance
  isLocalGPU: true

  # Specify pool name (optional)
  poolName: default-pool

  # Specify QoS level (defaults to medium)
  qos: medium

  # Specify the number of GPU devices per vGPU worker (optional, default to 1)
  gpuCount: 1

  # Specify the GPU/NPU model (optional)
  gpuModel: A100

  # Auto-scaling configuration options (optional)
  autoScalingConfig: {}

Then reference this profile in your Pod annotation:

tensor-fusion.ai/workload-profile: example-workload-profile

For more details on WorkloadProfile schema, see the WorkloadProfile Schema Reference.

Add Pod Annotations

Add the following annotations to your Pod metadata to configure GPU workload requirements.

Enable TensorFusion via Label

tensor-fusion.ai/enabled is a label (not an annotation) used to explicitly enable or disable TensorFusion. It is typically set under metadata.labels:

metadata:
  labels:
    tensor-fusion.ai/enabled: "true"

Annotation Reference

Basic Annotations

Annotation	Description	Example Value
`tensor-fusion.ai/tflops-request`	Requested TFLOPs(FP16) per vGPU worker per GPU device	`'10'`
`tensor-fusion.ai/vram-request`	Requested VRAM(stand for Video Memory or Frame Buffer) per vGPU worker per GPU device	`4Gi`
`tensor-fusion.ai/tflops-limit`	Maximum TFLOPs(FP16) allowed per vGPU worker per GPU device	`'20'`
`tensor-fusion.ai/vram-limit`	Maximum VRAM(stand for Video Memory or Frame Buffer) allowed per vGPU worker per GPU device	`4Gi`
`tensor-fusion.ai/inject-container`	Container to inject GPU resources into, could be comma split format for multiple containers	`python`
`tensor-fusion.ai/qos`	Quality of service level	`low` `medium` `high` `critical`
`tensor-fusion.ai/is-local-gpu`	Schedule the workload to the same GPU server that runs vGPU worker for best performance, default to false	`'true'`
`tensor-fusion.ai/gpu-count`	Requested GPU device count, each vGPU worker will map to N physical GPU devices set by this field, and vram/TFLOPs resource consumption will be scaled by this field, default to 1, your AI workloads can get `cuda:0` device	`'4'`
`tensor-fusion.ai/gpupool`	Specifies target GPU pool	`default-pool`
`tensor-fusion.ai/vendor`	Specify GPU/NPU vendor, `NVIDIA` `AMD` `Ascend` `Intel` `Hygon` `MetaX` `MThreads` `Cambricon` `Enflame` `Qualcomm` `Cerebras` `AWSNeuron` `Google`	`NVIDIA`

Advanced Annotations

Annotation	Description	Example Value
`tensor-fusion.ai/gpu-model`	Specifies the GPU/NPU model	`A100` `H100` `L4` `L40s`
`tensor-fusion.ai/dedicated-gpu`	Use along with `tensor-fusion.ai/gpu-model` annotation, occupancy whole GPU for this workload	`'true'`
`tensor-fusion.ai/isolation`	Includes `shared` `soft` `hard` `partitioned` four modes	`soft`
`tensor-fusion.ai/compute-percent-request`	Compute resource request percentage, range 0-100, mutually exclusive with TFLOPS request, only one of them needs to be set	`'100'`
`tensor-fusion.ai/compute-percent-limit`	Compute resource limit percentage, range 0-100, mutually exclusive with TFLOPS limit, only one of them needs to be set	`'100'`
`tensor-fusion.ai/gpu-indices`	Specify GPU device index, range 0-N, limit the range of scheduled devices, comma separated if requesting multiple cards	`'0,1'`
`tensor-fusion.ai/container-gpu-count`	Per-container GPU counts for multi-container Pods (JSON map of container name to GPU count). If not set, containers share the same GPUs	`'{"container-a":1,"container-b":2}'`
`tensor-fusion.ai/container-gpus`	Container-to-GPU ID mapping (JSON). Written back by the system after scheduling to record GPU IDs per container	`'{"container-a":["gpu-1","gpu-2"],"container-b":["gpu-3"]}'`
`tensor-fusion.ai/workload`	TensorFusionWorkload name, if exists, will share the same vGPU workers	`pytorch-example`
`tensor-fusion.ai/workload-profile`	Reference to a WorkloadProfile to reuse pre-defined parameters	`default-profile`
`tensor-fusion.ai/enabled-replicas`	Set to any number less or equal to ReplicaSet replicas, for grey releasing TensorFusion	`'1','42'`
`tensor-fusion.ai/autoscale`	Enable autoscaling for this workload (auto-set GPU resources based on historical usage)	`'true'`
`tensor-fusion.ai/autoscale-target`	Target resource to autoscale: `compute` (only compute-percent/TFLOPs), `vram` (only VRAM), or `all` (both)	`'all'`
`tensor-fusion.ai/standalone-worker-mode`	When `is-local-gpu` is true, this option is false, it means vGPU worker will be injected into init container, not running standalone vGPU worker, to achieve best performance, the trade-off is user might bypass vGPU worker and directly use physical GPU, when `is-local-gpu` is false, this option is invalid	`'true'`
`tensor-fusion.ai/disable-features`	Killer switch to disable tensor fusion built-in features partially, could be comma split format for multiple features	`'gpu-limiter,gpu-opt,mem-manager'`

Example Config

kind: Deployment
apiVersion: apps/v1
metadata: {}
spec:
  template:
    metadata:
      labels:
        tensor-fusion.ai/enabled: "true"
      annotations:
        tensor-fusion.ai/inject-container: python # could be comma split if multiple containers using GPU
        tensor-fusion.ai/tflops-limit: '20'
        tensor-fusion.ai/tflops-request: '10'
        tensor-fusion.ai/vram-limit: 4Gi
        tensor-fusion.ai/vram-request: 4Gi
        tensor-fusion.ai/qos: medium
        tensor-fusion.ai/workload-profile: default-profile # WorkloadProfile has lower priority as template
        tensor-fusion.ai/is-local-gpu: 'true'
        tensor-fusion.ai/gpu-count: '1' # GPU device number per TensorFusion Worker
    spec: {}

Configure WorkloadProfile Custom Resource

For advanced features like auto-scaling, create a WorkloadProfile custom resource and reference it in your Pod annotations.

apiVersion: tensor-fusion.ai/v1
kind: WorkloadProfile
metadata:
  name: example-workload-profile
  namespace: same-namespace-as-your-workload
spec:
  # Specify AI computing resources needed
  resources:
    requests:
      tflops: "5"
      vram: "3Gi"
    limits:
      tflops: "15"
      vram: "3Gi"
  # Specify the number of vGPU workers, usually the same as Deployment replicas
  replicas: 1

  # Schedule the workload to the same GPU server that runs GPU worker for best performance
  isLocalGPU: true

  # Specify pool name (optional)
  poolName: default-pool

  # Specify QoS level (defaults to medium)
  qos: medium

  # Specify the number of GPU devices per vGPU worker (optional, default to 1)
  gpuCount: 1

  # Specify the GPU/NPU model (optional)
  gpuModel: A100

  # Auto-scaling configuration options (optional)
  autoScalingConfig: {}

Then reference this profile in your Pod annotation:

tensor-fusion.ai/workload-profile: example-workload-profile

For more details on WorkloadProfile schema, see the WorkloadProfile Schema Reference.

Add Pod Annotations

Enable TensorFusion via Label

Annotation Reference

Example Config

Configure WorkloadProfile Custom Resource

Table of Contents

Workload Configuration

Add Pod Annotations

Enable TensorFusion via Label

Annotation Reference

Example Config

Configure WorkloadProfile Custom Resource

Table of Contents