Create Workload
Create a workload for your AI application using TensorFusion vGPU
Step 1. Analyze Computing Resource Requirements and QoS Level
Calculate Initial Resource Requests
You can use TensorFusion cloud to get resource recommendations. Or you can estimate the TFLOPs/VRAM by following method:
VRAM:
- For each 1B parameters FP8 precision inference, need about 1GiB VRAM
- For LLM, each 1K context window indicates about 1GiB additional VRAM for each user
TFLOPs estimation is complex since different training and inference framework has huge different, different types of AI models also varies. One possible way is to run a basic case on a single GPU and monitor the GPU utilization, then calculate the TFLOPs for multiple users or larger dataset value in linear way, and then adjust that value or enable TFLOPs auto scaling.
Refer: Common GPU Information
Choose QoS Levels
- low: Best for training and labs. Ensures capacity but not latency. Accumulates credits for bursts when GPUs are available. VRAM cools down quickly.
- medium: Ideal for offline tasks like embedding. Ensures capacity with bursts, preempting low QoS tasks. No latency guarantee. VRAM cools down moderately.
- high: Suited for non-latency-sensitive online tasks like inference. Ensures capacity, preempts medium QoS tasks. VRAM stays at requested levels.
- critical: For real-time, latency-critical tasks like live translation. Ensures capacity and low latency, preempts most tasks. VRAM remains at requested levels.
Step 2. Create Workload with Annotations
Add Pod Annotations
tensor-fusion.ai/inject-container: python
tensor-fusion.ai/tflops-limit: '20'
tensor-fusion.ai/tflops-request: '10'
tensor-fusion.ai/vram-limit: 4Gi
tensor-fusion.ai/vram-request: 4Gi
tensor-fusion.ai/qos: medium
tensor-fusion.ai/gpu-count: '1'Use the WorkloadProfile
You can also create WorkloadProfile and refer it like this tensor-fusion.ai/workload-profile: default-profile in Pod annotation to use advanced features.
apiVersion: apps/v1
kind: Deployment
metadata:
name: <...>
spec:
template:
metadata:
labels:
tensor-fusion.ai/enabled: 'true'
annotations:
tensor-fusion.ai/workload-profile: template-for-small-modelSee all configuration options Workload Configuration
Step 3. Verify the App Status
-
Check that a new container named
inject-libappears in your pods -
Execute into the shell of the first container or containers which specified in
tensor-fusion.ai/inject-containerannotation and run:
nvidia-smi- Verify that:
- The command runs successfully
- The GPU memory quota has been updated to match your
tensor-fusion.ai/vram-limitsetting - The GPU utilization has been updated to match your
tensor-fusion.ai/tflops-limitsetting
TensorFusion Docs