Changelog

Stay up to date with the latest changes in our product

Monthly Release — 2025-12

2025-12

Production-ready core engine with improved isolation and portability

2025/12/31

Core Engine Features

Skip hooks installation when up_limit >= 100 (passthrough mode).
Added isolation annotation and skip hook init for "shard" and "hard" isolation levels.
Added nvidia-dev-root option for custom host path prefix in device plugin.
Refactored hypervisor from cgo to purego call for better portability.

Technical Improvements & Bug Fixes

Optimized pod registration API with fast-path caching.
Optimized rate_limiter fast path with up_limit cache.
Fixed hypervisor start issues and name mismatch bugs.
Fixed device plugin and index queue issues.
Refined device mock code for testing.
Switched to NexusGPU/frida-gum fork.
Optimized hypervisor, typing, and TUI.
Improved Karpenter permission handling.
Fixed pod index split logic.

Ecosystem

Improved Karpenter integration with better permission handling.

Artifacts Versions

tensor-fusion-operator:1.48.2
tensor-fusion-node-discovery:1.48.2
tensor-fusion-worker:1.68.0
tensor-fusion-client:1.61.1
tensor-fusion-ngpu:1.8.1
tensor-fusion-hypervisor:1.41.7

Monthly Release — 2025-11

2025-11

Partitioned scheduling, non-locking device extension, and hard isolation milestone

2025/11/30

Core Engine Features

Implemented partitioned scheduling for hardware-partitioned isolation (e.g., MIG-like).
Added non-locking Kubernetes device extension for improved scalability.
Introduced device controller for managing accelerator lifecycle.
Added auto-freeze configuration for QoS levels in vgpu.rs.
Integrated Kubernetes device plugin into tensor-fusion hypervisor.
Added overwrite detection for pod registration.
Milestone: Hard isolation mode with spatial-division sharing (no oversubscription).

Technical Improvements & Bug Fixes

Added enqueue extension plugin for queue hint, achieving faster reschedule.
Extended device ID range from 256 to 512 in K8s device plugin.
Aligned pod resource naming across crates.
Added integral_decay_factor for PID controller to forget old clamp data.
Fixed forward leader API auth issue.
Fixed node auto scale-up for large pending pod counts.
Eliminated code duplication and optimized performance.
Added compute shard support across modules.
Renamed compute-isolation to isolation in configs.

Ecosystem

None.

Monthly Release — 2025-10

2025-10

Computing isolation modes (soft/hard/shared), elastic rate limiter, and VRAM hard-isolation

2025/10/31

Core Engine Features

Added soft/hard/shared computing isolation modes with compute percent scheduling.
Introduced sidecar worker mode for hard-isolation with worker customization in annotation.
Support VRAM hard-isolation for strict memory enforcement.
Implemented elastic rate limiter for adaptive compute throttling.
Support simpler migration from nvidia.com/gpu limits.

Technical Improvements & Bug Fixes

Fixed ld.so.preload is not an ELF file bug by changing conf path.
Fixed remote worker compute percent and NVML hook issues.
Used ld.so.conf.d for dynamic libs rather than LD_LIBRARY_PATH.
Added so.1 fallback for apps detecting libcuda.so.1.
Avoided deadlock in shared memory cleanup.
Disabled ngpu mode by default.
Fixed node expansion and NUMA node not found issues.
Optimized historical metrics loading.
Fixed single workload generation for Deployment.

Ecosystem

None.

Monthly Release — 2025-09

2025-09

Autoscaling, Karpenter node expansion, and GPU worker preemption

2025/09/30

Core Engine Features

Implemented autoscaling for GPU workloads based on resource utilization.
Added node auto expansion when pods pending, integrating with Karpenter.
Added preempt support for GPU workers to improve scheduling fairness.
Support configuration of auto-update for individual components.
Implemented shared memory TUI monitor for real-time debugging.
Support skip kernel launch limits when up_limit >= 100 for passthrough scenarios.

Technical Improvements & Bug Fixes

Optimized default placement and scoring for better scheduling quality.
Improved K8s version compatibility and fixed Karpenter label/annotation issues.
Fixed dedicated GPU annotation causing webhook failures.
Added resource validation in Bind to prevent GPU over-allocation.
Fixed GPU UUID handling to ensure case-insensitive matching across modules.
Increased default shared memory size to 128 MB with padding.
Simplified sleep mechanism in Limiter to fixed duration.
Added node hash for GPU K8s node and owner ref for hypervisor.
Isolated shm paths per cluster/namespace.

Ecosystem

Deeper integration with Karpenter for node expansion and label/annotation handling.

Monthly Release — 2025-08

2025-08

Large-scale benchmarking, RDMA support, and hypervisor probes

2025/08/31

Core Engine Features

Added hypervisor probe for health checking and monitoring.
Implemented large-scale benchmark and performance optimization for high GPU count clusters.
Added compute percentage tracking for GPU metrics.
Introduced healthz/readyz API for hypervisor liveness and readiness probes.

Technical Improvements & Bug Fixes

Added kubelet pod-resource mount for real-time device-plugin allocation detection.
Fixed metrics recorder bugs causing system metrics missing.
Added remap for extra metrics labels.
Optimized the order of Pods when scaling down.
Fixed K8s 1.20-1.22 compatibility issues.
Added mem percentage metrics and power usage / NVLink bandwidth placeholders.
Fixed lower version Kubernetes hypervisor auth issues.
Updated operator Dockerfile.
Updated README and improved unit test coverage.

Ecosystem

Milestone: Support RDMA transport for low-latency/high-throughput remote GPU access.

Monthly Release — 2025-07

2025-07

Karpenter integration, progressive migration, and hypervisor/TUI enhancements

2025/07/31

Core Engine Features

Added GPUNodeClaim for cloud vendor integration and Karpenter auto-scaling.
Support progressive migration from existing NVIDIA operator/device-plugin setups.
Added built-in component manifests with JSON monitoring format and dynamic tags.
Support pod namespace and container name env vars for worker/hypervisor containers.
Introduced shared memory versioning and versioned device state for cross-process coordination.
Integrated Kubernetes device plugin into vgpu.rs hypervisor.
Added TUI for monitoring workers in real-time.
Implemented dlsym hooking and NVML hook for device limiting (instead of env var).

Technical Improvements & Bug Fixes

Fixed scheduler reserve plugin bugs and workload vGPU worker scaling issues.
Added shm device for shared limiter among processes communicating with hypervisor.
Fixed allocation debug/simulate, allocator memory state bugs, and npe issues.
Fixed token review permission for remote workers.
Resolved deadlocks from orphaned locks in shared memory.
Fixed Helm typos and GPU node patching bugs.
Released bootstrap artifacts for x64 and arm64.
Added security context for hypervisor and init containers.

Ecosystem

Integrated with Karpenter for GPU node auto-scaling.

Monthly Release — 2025-06

2025-06

Scheduler framework refactor, alerting integration, and NVIDIA remoting milestone

2025/06/30

Core Engine Features

Refactored TensorFusion scheduling to the Kubernetes scheduler framework (foundation for advanced policies).
Improved alerting + metrics pipelines and operational hardening.
Expanded limiter test coverage and improved engine runtime stability.

Technical Improvements & Bug Fixes

Fixed config path mismatches and multiple GPU deallocation issues.
Improved deployment configs and runtime synchronization (condvar vs busy-wait), and refined metrics/logging correctness.
Added hostType and log collection configuration support.
Improved installation guides and usage examples.

Ecosystem

Milestone: Full-fledged NVIDIA remoting, including Windows vGPU / Remote GPU support.
Integrated with Alertmanager for GPU cluster alerting.

Monthly Release — 2025-05

2025-05

Multi-GPU requests, GPU model filtering, and per-GPU limiting via UUIDs

2025/05/31

Core Engine Features

Enabled clients to request multiple GPUs and added GPU model filtering.
Added per-GPU limiting via UUID (and/or index) and improved scheduling primitives in the engine.
Strengthened TensorFusionWorkload lifecycle signaling (status/conditions).

Technical Improvements & Bug Fixes

Improved allocation metrics and refined CRD schema/annotations.
Improved worker watcher decoupling and GPU utilization error handling.
Maintained all GPU state in memory to reduce apiserver access pressure.
Print version info on startup for easier troubleshooting and tracking.

Ecosystem

None.

Monthly Release — 2025-04

2025-04

Canary rollout support and deeper limiter foundations (memory hooks, runtime env utilities)

2025/04/30

Core Engine Features

Added canary/gray rollout support for TensorFusion-enabled Pods.
Advanced limiter foundations, including CUDA memory hooks and runtime env helpers.

Technical Improvements & Bug Fixes

Improved cleanup and finalizer semantics; fixed Helm chart and GPU info map issues.
CI/release workflow hardening for artifact handling.
Improved installation and usage documentation.

Ecosystem

None.

Monthly Release — 2025-03

2025-03

TFLOPs-based limiting, workload lifecycle controls, and richer GPU device metrics

2025/03/31

Core Engine Features

Added TFLOPs-based resource limiting and GPU info configuration.
Hardened workload lifecycle controls (finalizers, events) and scheduling distribution controls.
Expanded engine-side device metrics and worker control primitives (pause, NVML resilience).

Technical Improvements & Bug Fixes

Improved compatibility management (worker version in connection URL) and worker metrics output.
Improved worker error handling and NVML initialization fallback behavior.
Added Docker image latest tag for easier integration and deployment.

Ecosystem

None.

Monthly Release — 2025-02

2025-02

Cluster reconciliation hardening and control-plane stability improvements

2025/02/28

Core Engine Features

Improved cluster reconcile behavior and controller ownership logic for GPU nodes.
Hardened control-plane lifecycle management for node/controller resources.

Technical Improvements & Bug Fixes

Fixed stability issues around node/controller lifecycle (including controller panics).
Improved lifecycle handling (destroying phase, NotFound handling) and GPU pool controller robustness.

Ecosystem

None (vendor/transport-specific features start showing up in later months).

Monthly Release — 2025-01

2025-01

Metrics foundations and early scheduling/observability building blocks

2025/01/31

Core Engine Features

Expanded GPU metrics foundations across controller/operator and the vGPU engine (TFLOPs/VRAM, logging pipelines).
Improved GPU pool / resource management building blocks and controller-level signal collection.

Technical Improvements & Bug Fixes

Fixed webhook/service configuration issues and avoided worker port conflicts.
Improved error handling around GPU process metrics and NVML initialization fallbacks.

Ecosystem

None (vendor/transport-specific features start showing up in later months).

Core Engine Features

Added soft/hard/shared computing isolation modes with compute percent scheduling.
Introduced sidecar worker mode for hard-isolation with worker customization in annotation.
Support VRAM hard-isolation for strict memory enforcement.
Implemented elastic rate limiter for adaptive compute throttling.
Support simpler migration from nvidia.com/gpu limits.

Technical Improvements & Bug Fixes

Fixed ld.so.preload is not an ELF file bug by changing conf path.
Fixed remote worker compute percent and NVML hook issues.
Used ld.so.conf.d for dynamic libs rather than LD_LIBRARY_PATH.
Added so.1 fallback for apps detecting libcuda.so.1.
Avoided deadlock in shared memory cleanup.
Disabled ngpu mode by default.
Fixed node expansion and NUMA node not found issues.
Optimized historical metrics loading.
Fixed single workload generation for Deployment.

Ecosystem

None.