LogoTensorFusion 文档
LogoTensorFusion 文档
首页文档

监控指标定义

TensorFusion collects comprehensive metrics for monitoring GPU infrastructure, workloads, and system performance. All metrics are stored in GreptimeDB with time-series indexing.

System Metrics

Measurement: tf_system_metrics

Cluster-wide system statistics and operational counters.

Tags

TagDescription
poolGPU pool identifier

Fields

FieldTypeDescription
total_workers_cntint64Total active workers
total_nodes_cntint64Total nodes in cluster
total_allocation_fail_cntint64Cumulative allocation failures
total_allocation_success_cntint64Cumulative successful allocations
total_scale_up_cntint64Cumulative scale-up events
total_scale_down_cntint64Cumulative scale-down events
tstimestampRecord timestamp

Worker Resource Metrics

Measurement: tf_worker_resources

Resource allocation and usage per worker.

Tags

TagDescription
workerWorker identifier
workloadAssociated workload
poolGPU pool identifier
namespaceKubernetes namespace
qosQuality of Service class

Fields

FieldTypeDescription
tflops_requestfloat64Requested TFLOPs
tflops_limitfloat64TFLOPs limit
vram_bytes_requestfloat64Requested VRAM in bytes
vram_bytes_limitfloat64VRAM limit in bytes
gpu_countintNumber of GPUs allocated
raw_costfloat64Raw compute cost
readyboolWorker readiness status
tstimestampRecord timestamp

Node Resource Metrics

Measurement: tf_node_resources

Resource allocation and utilization per node.

Tags

TagDescription
nodeNode identifier
poolGPU pool identifier
phaseNode phase/status

Fields

FieldTypeDescription
allocated_tflopsfloat64Allocated TFLOPs
allocated_tflops_percentfloat64TFLOPs utilization percentage
allocated_vram_bytesfloat64Allocated VRAM in bytes
allocated_vram_percentfloat64VRAM utilization percentage
allocated_tflops_percent_virtualfloat64TFLOPs vs virtual capacity
allocated_vram_percent_virtualfloat64VRAM vs virtual capacity
raw_costfloat64Node compute cost
gpu_countintNumber of GPUs on node
tstimestampRecord timestamp

Worker Usage Metrics

Measurement: tf_worker_usage

Real-time worker resource usage from hypervisor.

Tags

TagDescription
workloadAssociated workload
worker_nameWorker identifier
namespaceKubernetes namespace
pool_nameGPU pool identifier
node_nameHost node name
uuidGPU UUID

Fields

FieldTypeDescription
compute_percentagefloat64GPU compute utilization
compute_tflopsfloat64Actual TFLOPs usage
memory_percentagefloat64VRAM utilization percentage
memory_bytesuint64VRAM usage in bytes
tstimestampRecord timestamp

GPU Usage Metrics

Measurement: tf_gpu_usage

Detailed GPU hardware metrics from hypervisor.

Tags

TagDescription
nodeHost node name
poolGPU pool identifier
uuidGPU UUID

Fields

FieldTypeDescription
compute_percentagefloat64GPU compute utilization
memory_percentagefloat64VRAM utilization percentage
memory_bytesuint64VRAM usage in bytes
compute_tflopsfloat64Actual TFLOPs usage
rxfloat64PCIe receive KB/s
txfloat64PCIe transmit KB/s
temperaturefloat64GPU temperature (°C)
graphics_clock_mhzfloat64Graphics clock frequency
sm_clock_mhzfloat64SM clock frequency
memory_clock_mhzfloat64Memory clock frequency
video_clock_mhzfloat64Video clock frequency
power_usagefloat64Power consumption (W)
nvlink_rxfloat64NVLink receive throughput
nvlink_txfloat64NVLink transmit throughput
tstimestampRecord timestamp

Pool Metrics

Measurement: tf_pool_metrics

Pool level metrics and capacity summary

Tags

TagDescription
poolGPU pool identifier
phasePool phase/status

Fields

FieldTypeDescription
allocated_tflopsfloat64Total Allocated(Requested) TFLOPs
allocated_tflops_percentfloat64Allocated(Requested) TFLOPs percentage to total capacity
allocated_tflops_percent_virtualfloat64Allocated(Requested) TFLOPs percentage to virtual capacity
allocated_vram_bytesfloat64Total Allocated(Requested) VRAM in bytes
allocated_vram_percentfloat64Allocated(Requested) VRAM percentage to total capacity
allocated_vram_percent_virtualfloat64Allocated(Requested) VRAM percentage to virtual capacity
limited_tflopsfloat64Total Limited TFLOPs
limited_vram_bytesfloat64Total Limited VRAM in bytes
limited_tflops_percent_virtualfloat64Limited TFLOPs percentage to virtual capacity
limited_vram_percent_virtualfloat64Limited VRAM percentage to virtual capacity
gpu_countintNumber of GPUs of the pool
tstimestampRecord timestamp

目录

System Metrics
Tags
Fields
Worker Resource Metrics
Tags
Fields
Node Resource Metrics
Tags
Fields
Worker Usage Metrics
Tags
Fields
GPU Usage Metrics
Tags
Fields
Pool Metrics
Tags
Fields