LogoTensorFusion 文档
LogoTensorFusion 文档
首页

快速开始

TensorFusion概览在Kubernetes安装在虚拟机/服务器安装(K3S)Helm本地安装在宿主机/虚拟机安装TensorFusion架构

应用操作

创建AI应用配置自动扩缩容迁移现有应用最佳实践

自定义AI基础设施

生产级部署指南QoS级别与计价云厂商集成(BYOC)管理许可证

维护与优化

组件更新配置告警GPU热迁移预加载模型优化GPU效率

故障排除

问题处理手册链路追踪/性能分析查询指标和日志

参考

Helm安装Values配置
Kubernetes 事件监控指标定义性能测试命令行参考GPU/驱动/操作系统支持矩阵TensorFusion 安全白皮书

对比

与NVIDIA vGPU比较与MIG/MPS对比与趋动科技对比与 Run.AI 对比与HAMi的对比
系统管理员参考

命令行参考

本文档提供了TensorFusion所有命令行接口的全面参考。

Operator & Scheduler CLI

CLI Parameters

ParameterDescriptionDefault
-enable-http2Enables HTTP/2 for the metrics and webhook servers-
-health-probe-bind-addressThe address the probe endpoint binds to:8081
-kubeconfigPath to a kubeconfig file (only required if out-of-cluster)-
-leader-electEnable leader election for controller manager to ensure only one active instance-
-metrics-bind-addressThe address the metrics endpoint binds to0 (disabled)
-metrics-secureServe metrics endpoint securely via HTTPS (use --metrics-secure=false for HTTP)-
-zap-develUse development mode for loggingtrue
-zap-encoderZap log encoding format (json or console)-
-zap-log-levelVerbosity level for logging (debug, info, error, or any integer value > 0)-
-zap-stacktrace-levelLevel at which stacktraces are captured (info, error, panic)-
-zap-time-encodingTime encoding format (epoch, millis, nano, iso8601, rfc3339, rfc3339nano)epoch

Environment Variables

VariableDescriptionExample
INITIAL_GPU_NODE_LABEL_SELECTORInitial label selector for GPU nodesnvidia.com/gpu.present=true
ENABLE_WEBHOOKSEnable webhook functionalitytrue
OPERATOR_NAMESPACENamespace for the operatortensor-fusion-sys
KUBECONFIGPath to kubeconfig file<kubeconfig>

Hypervisor CLI

CLI Parameters

ParameterDescriptionDefault
--sock_pathWorker unix socket path/tensor-fusion/worker/sock
--gpu_metrics_fileGPU metrics file location/logs/metrics.log
--schedulerScheduling policy for multiple processes on single GPU node (when GPU load is high)Options: FIFO for simple first-in-first-out, MLFQ for multi-level feedback queue

Node Discovery CLI

CLI Parameters

ParameterDescriptionExample
--hostnameCustom hostname for binding current node with GPUNode custom resource<hostname>
--gpu-info-configPath to the GPU info configuration fileSee example below

GPU Info Config Example

- model: RTX5090
  fullModelName: "NVIDIA GeForce RTX 5090"
  vendor: NVIDIA
  costPerHour: 0.65
  fp16TFlops: 419

Environment Variables

VariableDescriptionExample
HOSTNAMENode hostname<hostname>
KUBECONFIGPath to kubeconfig file<kubeconfig>
NODE_DISCOVERY_REPORT_GPU_NODEGPU node custom resource name<gpu-node-custom-resource-name>

Worker CLI

CLI Parameters

ParameterDescriptionDefault/Notes
-nNetwork protocolCurrently only native (native TCP communication)
-pWorker portRandom value assigned by TensorFusion Operator-Scheduler
-sUnix socket path folderShould be /tensor-fusion/worker/sock/ in Kubernetes

Environment Variables

VariableDescriptionValue
TF_ENABLE_LOGEnable logging1

GPU Client Stub

The GPU Client Stub consists of two libraries that use LD_PRELOAD to run before every process started inside the container or server:

  • libadd_path.so: Adds additional library paths for AI application environments (e.g., hooked NVML)
  • libcuda.so: Hooks into CUDA runtime

Example configuration in worker template:

env:
- name: LD_PRELOAD
  value: /tensor-fusion/libadd_path.so:/tensor-fusion/libcuda.so

Environment Variables

VariableDescriptionValue/Notes
TF_PATHAppended to PATH environment variable/tensor-fusion
TF_LD_PRELOADAppended to LD_PRELOADVaries
TF_LD_LIBRARY_PATHAppended to LD_LIBRARY_PATH/tensor-fusion
TF_ENABLE_LOGDisable/Enable logging, default to disabled0

性能测试

上一页

GPU/驱动/操作系统支持矩阵

下一页

目录

Operator & Scheduler CLI
CLI Parameters
Environment Variables
Hypervisor CLI
CLI Parameters
Node Discovery CLI
CLI Parameters
GPU Info Config Example
Environment Variables
Worker CLI
CLI Parameters
Environment Variables
GPU Client Stub
Environment Variables