LogoTensorFusion Docs
LogoTensorFusion Docs
HomepageDocumentation

Getting Started

OverviewKubernetes InstallVM/Server Install(K3S)Helm On-premises InstallHost/GuestVM InstallTensorFusion Architecture

Application Operations

Create WorkloadConfigure AutoScalingMigrate Existing WorkloadBest Practices

Customize AI Infra

Production-Grade DeploymentConfig QoS and BillingBring Your Own CloudManaging License

Maintenance & Optimization

Upgrade ComponentsSetup AlertsGPU Live MigrationPreload ModelOptimize GPU Efficiency

Troubleshooting

HandbookTracing/ProfilingQuery Metrics & Logs

Reference

Comparison

Compare with NVIDIA vGPUCompare with MIG/MPSCompare with Run.AICompare with HAMi

Helm On-premises Install

Deployment to On-premises environments without Cluster Agent and Cloud Console using Helm

Installation

When you need fully local installation and don't want to use advanced features, you can use this option, but you can not use TensorFusion Console for centralized management.

Step 1, deploy TensorFusion cluster with Helm command.

helm upgrade --install --create-namespace --namespace tensor-fusion-sys \
  --repo https://download.tensor-fusion.ai --set agent.agentId="" \
  tensor-fusion-sys tensor-fusion

# Note: helm.tensor-fusion.ai is alternative to download.tensor-fusion.ai, both domain works
helm upgrade --install --create-namespace --namespace tensor-fusion-sys \
  --repo https://download.tensor-fusion.ai \
  --set agent.enrollToken=xxx --set agent.agentId=xxx \
  --set agent.cloudEndpoint=wss://your-own.domain/_ws \
  tensor-fusion-sys tensor-fusion

Step 2, apply the TensorFusion cluster configuration in to Kubernetes.

kubectl apply -f https://app.tensor-fusion.ai/tmpl/tf-cluster
kubectl apply -f https://app.tensor-fusion.ai/tmpl/tf-scheduling-config

Step 3, verify the TensorFusion cluster is ready.

kubectl get pods -n tensor-fusion-sys
# Expected output:
# NAME                                      READY   STATUS    RESTARTS   AGE
# hypervisor-<node-name>                    1/1     Running   0          2m

kubectl get tensorfusionclusters
# Expected output:
# NAME                                  STATUS      AGE
# shared-tensor-fusion-cluster          Ready       2m

Finally, deploy an application to verify TensorFusion is working. Apply the following YAML to create a simple PyTorch deployment with TensorFusion remote vGPU.

# simple-pytorch.yaml
# kubectl apply -f simple-pytorch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-example
  namespace: default
  labels:
    app: pytorch-example
    tensor-fusion.ai/enabled: 'true'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-example
  template:
    metadata:
      labels:
        app: pytorch-example
        tensor-fusion.ai/enabled: 'true'
      annotations:
        tensor-fusion.ai/inject-container: python
        tensor-fusion.ai/tflops-limit: '10'
        tensor-fusion.ai/tflops-request: '20'
        tensor-fusion.ai/vram-limit: 4Gi
        tensor-fusion.ai/vram-request: 4Gi
    spec:
      containers:
        - name: python
          image: pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
          command:
            - sh
            - '-c'
            - sleep 1d
      restartPolicy: Always
      terminationGracePeriodSeconds: 0
      dnsPolicy: ClusterFirst

Then, you would see a pytorch pod and the corresponding vGPU worker Pod started (Don't worry, it's super lightweight). Run "kubectl exec" into the pytorch pod, you can run nvidia-smi to see the limited GPU memory and utilization.

nvidia-smi

Finally, run python3 to start python REPL console and test a simple Google T5 model inference, the following codes should translate English "Hello" to German "Hallo" in seconds.

from transformers import pipeline
pipe = pipeline("translation_en_to_de", model="google-t5/t5-base", device="cuda:0")
pipe("Hello")

refer to Deploy and Verify for end-to-end testing by running a PyTorch model inference with TensorFusion virtual remote GPU.

Uninstall TensorFusion

Run the following command to uninstall all components and custom resources

# export KUBECONFIG if needed
curl -sfL https://download.tensor-fusion.ai/uninstall.sh | sh -

Issues and Troubleshooting

If your TensorFusion hypervisor Pods are not showing up, check if your GPU nodes has been labeled with nvidia.com/gpu.present=true

kubectl get nodes --show-labels | grep nvidia.com/gpu.present=true

# Expected GPU nodes found in output:
# gpu-node-name   Ready   <none>   42h   v1.32.1 beta.kubernetes.io/arch=amd64,...,kubernetes.io/os=linux,nvidia.com/gpu.present=true

To resolve this issue, you can neither add the label or change the TensorFusionCluster resource to use your own labels to find GPU nodes.

# Using helm `initialGpuNodeLabelSelector` parameter to add env var `INITIAL_GPU_NODE_LABEL_SELECTOR` to tensor-fusion-operator:
helm upgrade --install --create-namespace --namespace tensor-fusion-sys --repo https://download.tensor-fusion.ai --set agent.agentId="" --set initialGpuNodeLabelSelector="your-own-gpu-label-key=value" tensor-fusion-sys tensor-fusion
curl https://app.tensor-fusion.ai/tmpl/tf-cluster > tf-cluster.yaml

# Edit tf-cluster.yaml
# nodeManagerConfig:
#   nodeSelector:
#    nodeSelectorTerms:
#     - matchExpressions:
#       - key: nvidia.com/gpu.present  //  TODO -/+
#         operator: In
#         values:
#           - "true"

kubectl apply -f tf-cluster.yaml

Table of Contents

Installation
Uninstall TensorFusion
Issues and Troubleshooting