LogoTensorFusion Docs
LogoTensorFusion Docs
HomepageDocumentation

Getting Started

OverviewKubernetes InstallVM/Server Install(K3S)Helm On-premises InstallHost/GuestVM InstallTensorFusion Architecture

Application Operations

Create WorkloadConfigure AutoScalingMigrate Existing WorkloadBest Practices

Customize AI Infra

Production-Grade DeploymentConfig QoS and BillingBring Your Own CloudManaging License

Maintenance & Optimization

Upgrade ComponentsSetup AlertsGPU Live MigrationPreload ModelOptimize GPU Efficiency

Troubleshooting

HandbookTracing/ProfilingQuery Metrics & Logs

Reference

Comparison

Compare with NVIDIA vGPUCompare with MIG/MPSCompare with Run.AICompare with HAMi

Create Workload

Create a workload for your AI application using TensorFusion vGPU

Step 1. Analyze Computing Resource Requirements and QoS Level

Calculate Initial Resource Requests

You can use TensorFusion cloud to get resource recommendations. Or you can estimate the TFLOPs/VRAM by following method:

VRAM:

  • For each 1B parameters FP8 precision inference, need about 1GiB VRAM
  • For LLM, each 1K context window indicates about 1GiB additional VRAM for each user

TFLOPs estimation is complex since different training and inference framework has huge different, different types of AI models also varies. One possible way is to run a basic case on a single GPU and monitor the GPU utilization, then calculate the TFLOPs for multiple users or larger dataset value in linear way, and then adjust that value or enable TFLOPs auto scaling.

Refer: Common GPU Information

Choose QoS Levels

  • low: Best for training and labs. Ensures capacity but not latency. Accumulates credits for bursts when GPUs are available. VRAM cools down quickly.
  • medium: Ideal for offline tasks like embedding. Ensures capacity with bursts, preempting low QoS tasks. No latency guarantee. VRAM cools down moderately.
  • high: Suited for non-latency-sensitive online tasks like inference. Ensures capacity, preempts medium QoS tasks. VRAM stays at requested levels.
  • critical: For real-time, latency-critical tasks like live translation. Ensures capacity and low latency, preempts most tasks. VRAM remains at requested levels.

Step 2. Create Workload with Annotations

Add Pod Annotations

tensor-fusion.ai/inject-container: python
tensor-fusion.ai/tflops-limit: '20'
tensor-fusion.ai/tflops-request: '10'
tensor-fusion.ai/vram-limit: 4Gi
tensor-fusion.ai/vram-request: 4Gi
tensor-fusion.ai/qos: medium
tensor-fusion.ai/gpu-count: '1'

Use the WorkloadProfile

You can also create WorkloadProfile and refer it like this tensor-fusion.ai/workload-profile: default-profile in Pod annotation to use advanced features.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <...>
spec:
  template:
    metadata:
      labels:
        tensor-fusion.ai/enabled: 'true'
      annotations:
        tensor-fusion.ai/workload-profile: template-for-small-model

See all configuration options Workload Configuration

Step 3. Verify the App Status

  1. Check that a new container named inject-lib appears in your pods

  2. Execute into the shell of the first container or containers which specified in tensor-fusion.ai/inject-container annotation and run:

nvidia-smi
  1. Verify that:
  • The command runs successfully
  • The GPU memory quota has been updated to match your tensor-fusion.ai/vram-limit setting
  • The GPU utilization has been updated to match your tensor-fusion.ai/tflops-limit setting

Table of Contents

Step 1. Analyze Computing Resource Requirements and QoS Level
Calculate Initial Resource Requests
Choose QoS Levels
Step 2. Create Workload with Annotations
Add Pod Annotations
Use the WorkloadProfile
Step 3. Verify the App Status