LogoTensorFusion
  • Pricing
  • Docs
GPU Go ConsoleTensorFusion EE
LogoTensorFusion

Boundless Computing, Limitless Intelligence

GitHubGitHubDiscordYouTubeYouTubeLinkedInEmail
Product
  • Pricing
  • FAQ
Resources
  • Blog
  • Documentation
  • Ecosystem
  • Changelog
  • Roadmap
  • Affiliates
Company
  • About
Legal
  • Cookie Policy
  • Privacy Policy
  • Terms of Service
© 2026 NexusGPU PTE. LTD. All Rights Reserved.

Changelog

Stay up to date with the latest changes in our product

Monthly Release — 2025-12

2025-12

Production-ready core engine with improved isolation and portability

2025/12/31

Core Engine Features

  • Skip hooks installation when up_limit >= 100 (passthrough mode).
  • Added isolation annotation and skip hook init for "shard" and "hard" isolation levels.
  • Added nvidia-dev-root option for custom host path prefix in device plugin.
  • Refactored hypervisor from cgo to purego call for better portability.

Technical Improvements & Bug Fixes

  • Optimized pod registration API with fast-path caching.
  • Optimized rate_limiter fast path with up_limit cache.
  • Fixed hypervisor start issues and name mismatch bugs.
  • Fixed device plugin and index queue issues.
  • Refined device mock code for testing.
  • Switched to NexusGPU/frida-gum fork.
  • Optimized hypervisor, typing, and TUI.
  • Improved Karpenter permission handling.
  • Fixed pod index split logic.

Ecosystem

  • Improved Karpenter integration with better permission handling.

Artifacts Versions

  • tensor-fusion-operator:1.48.2
  • tensor-fusion-node-discovery:1.48.2
  • tensor-fusion-worker:1.68.0
  • tensor-fusion-client:1.61.1
  • tensor-fusion-ngpu:1.8.1
  • tensor-fusion-hypervisor:1.41.7

Monthly Release — 2025-11

2025-11

Partitioned scheduling, non-locking device extension, and hard isolation milestone

2025/11/30

Core Engine Features

  • Implemented partitioned scheduling for hardware-partitioned isolation (e.g., MIG-like).
  • Added non-locking Kubernetes device extension for improved scalability.
  • Introduced device controller for managing accelerator lifecycle.
  • Added auto-freeze configuration for QoS levels in vgpu.rs.
  • Integrated Kubernetes device plugin into tensor-fusion hypervisor.
  • Added overwrite detection for pod registration.
  • Milestone: Hard isolation mode with spatial-division sharing (no oversubscription).

Technical Improvements & Bug Fixes

  • Added enqueue extension plugin for queue hint, achieving faster reschedule.
  • Extended device ID range from 256 to 512 in K8s device plugin.
  • Aligned pod resource naming across crates.
  • Added integral_decay_factor for PID controller to forget old clamp data.
  • Fixed forward leader API auth issue.
  • Fixed node auto scale-up for large pending pod counts.
  • Eliminated code duplication and optimized performance.
  • Added compute shard support across modules.
  • Renamed compute-isolation to isolation in configs.

Ecosystem

  • None.

Monthly Release — 2025-10

2025-10

Computing isolation modes (soft/hard/shared), elastic rate limiter, and VRAM hard-isolation

2025/10/31

Core Engine Features

  • Added soft/hard/shared computing isolation modes with compute percent scheduling.
  • Introduced sidecar worker mode for hard-isolation with worker customization in annotation.
  • Support VRAM hard-isolation for strict memory enforcement.
  • Implemented elastic rate limiter for adaptive compute throttling.
  • Support simpler migration from nvidia.com/gpu limits.

Technical Improvements & Bug Fixes

  • Fixed ld.so.preload is not an ELF file bug by changing conf path.
  • Fixed remote worker compute percent and NVML hook issues.
  • Used ld.so.conf.d for dynamic libs rather than LD_LIBRARY_PATH.
  • Added so.1 fallback for apps detecting libcuda.so.1.
  • Avoided deadlock in shared memory cleanup.
  • Disabled ngpu mode by default.
  • Fixed node expansion and NUMA node not found issues.
  • Optimized historical metrics loading.
  • Fixed single workload generation for Deployment.

Ecosystem

  • None.

Monthly Release — 2025-09

2025-09

Autoscaling, Karpenter node expansion, and GPU worker preemption

2025/09/30

Core Engine Features

  • Implemented autoscaling for GPU workloads based on resource utilization.
  • Added node auto expansion when pods pending, integrating with Karpenter.
  • Added preempt support for GPU workers to improve scheduling fairness.
  • Support configuration of auto-update for individual components.
  • Implemented shared memory TUI monitor for real-time debugging.
  • Support skip kernel launch limits when up_limit >= 100 for passthrough scenarios.

Technical Improvements & Bug Fixes

  • Optimized default placement and scoring for better scheduling quality.
  • Improved K8s version compatibility and fixed Karpenter label/annotation issues.
  • Fixed dedicated GPU annotation causing webhook failures.
  • Added resource validation in Bind to prevent GPU over-allocation.
  • Fixed GPU UUID handling to ensure case-insensitive matching across modules.
  • Increased default shared memory size to 128 MB with padding.
  • Simplified sleep mechanism in Limiter to fixed duration.
  • Added node hash for GPU K8s node and owner ref for hypervisor.
  • Isolated shm paths per cluster/namespace.

Ecosystem

  • Deeper integration with Karpenter for node expansion and label/annotation handling.

Monthly Release — 2025-08

2025-08

Large-scale benchmarking, RDMA support, and hypervisor probes

2025/08/31

Core Engine Features

  • Added hypervisor probe for health checking and monitoring.
  • Implemented large-scale benchmark and performance optimization for high GPU count clusters.
  • Added compute percentage tracking for GPU metrics.
  • Introduced healthz/readyz API for hypervisor liveness and readiness probes.

Technical Improvements & Bug Fixes

  • Added kubelet pod-resource mount for real-time device-plugin allocation detection.
  • Fixed metrics recorder bugs causing system metrics missing.
  • Added remap for extra metrics labels.
  • Optimized the order of Pods when scaling down.
  • Fixed K8s 1.20-1.22 compatibility issues.
  • Added mem percentage metrics and power usage / NVLink bandwidth placeholders.
  • Fixed lower version Kubernetes hypervisor auth issues.
  • Updated operator Dockerfile.
  • Updated README and improved unit test coverage.

Ecosystem

  • Milestone: Support RDMA transport for low-latency/high-throughput remote GPU access.

Monthly Release — 2025-07

2025-07

Karpenter integration, progressive migration, and hypervisor/TUI enhancements

2025/07/31

Core Engine Features

  • Added GPUNodeClaim for cloud vendor integration and Karpenter auto-scaling.
  • Support progressive migration from existing NVIDIA operator/device-plugin setups.
  • Added built-in component manifests with JSON monitoring format and dynamic tags.
  • Support pod namespace and container name env vars for worker/hypervisor containers.
  • Introduced shared memory versioning and versioned device state for cross-process coordination.
  • Integrated Kubernetes device plugin into vgpu.rs hypervisor.
  • Added TUI for monitoring workers in real-time.
  • Implemented dlsym hooking and NVML hook for device limiting (instead of env var).

Technical Improvements & Bug Fixes

  • Fixed scheduler reserve plugin bugs and workload vGPU worker scaling issues.
  • Added shm device for shared limiter among processes communicating with hypervisor.
  • Fixed allocation debug/simulate, allocator memory state bugs, and npe issues.
  • Fixed token review permission for remote workers.
  • Resolved deadlocks from orphaned locks in shared memory.
  • Fixed Helm typos and GPU node patching bugs.
  • Released bootstrap artifacts for x64 and arm64.
  • Added security context for hypervisor and init containers.

Ecosystem

  • Integrated with Karpenter for GPU node auto-scaling.

Monthly Release — 2025-06

2025-06

Scheduler framework refactor, alerting integration, and NVIDIA remoting milestone

2025/06/30

Core Engine Features

  • Refactored TensorFusion scheduling to the Kubernetes scheduler framework (foundation for advanced policies).
  • Improved alerting + metrics pipelines and operational hardening.
  • Expanded limiter test coverage and improved engine runtime stability.

Technical Improvements & Bug Fixes

  • Fixed config path mismatches and multiple GPU deallocation issues.
  • Improved deployment configs and runtime synchronization (condvar vs busy-wait), and refined metrics/logging correctness.
  • Added hostType and log collection configuration support.
  • Improved installation guides and usage examples.

Ecosystem

  • Milestone: Full-fledged NVIDIA remoting, including Windows vGPU / Remote GPU support.
  • Integrated with Alertmanager for GPU cluster alerting.

Monthly Release — 2025-05

2025-05

Multi-GPU requests, GPU model filtering, and per-GPU limiting via UUIDs

2025/05/31

Core Engine Features

  • Enabled clients to request multiple GPUs and added GPU model filtering.
  • Added per-GPU limiting via UUID (and/or index) and improved scheduling primitives in the engine.
  • Strengthened TensorFusionWorkload lifecycle signaling (status/conditions).

Technical Improvements & Bug Fixes

  • Improved allocation metrics and refined CRD schema/annotations.
  • Improved worker watcher decoupling and GPU utilization error handling.
  • Maintained all GPU state in memory to reduce apiserver access pressure.
  • Print version info on startup for easier troubleshooting and tracking.

Ecosystem

  • None.

Monthly Release — 2025-04

2025-04

Canary rollout support and deeper limiter foundations (memory hooks, runtime env utilities)

2025/04/30

Core Engine Features

  • Added canary/gray rollout support for TensorFusion-enabled Pods.
  • Advanced limiter foundations, including CUDA memory hooks and runtime env helpers.

Technical Improvements & Bug Fixes

  • Improved cleanup and finalizer semantics; fixed Helm chart and GPU info map issues.
  • CI/release workflow hardening for artifact handling.
  • Improved installation and usage documentation.

Ecosystem

  • None.

Monthly Release — 2025-03

2025-03

TFLOPs-based limiting, workload lifecycle controls, and richer GPU device metrics

2025/03/31

Core Engine Features

  • Added TFLOPs-based resource limiting and GPU info configuration.
  • Hardened workload lifecycle controls (finalizers, events) and scheduling distribution controls.
  • Expanded engine-side device metrics and worker control primitives (pause, NVML resilience).

Technical Improvements & Bug Fixes

  • Improved compatibility management (worker version in connection URL) and worker metrics output.
  • Improved worker error handling and NVML initialization fallback behavior.
  • Added Docker image latest tag for easier integration and deployment.

Ecosystem

  • None.

Monthly Release — 2025-02

2025-02

Cluster reconciliation hardening and control-plane stability improvements

2025/02/28

Core Engine Features

  • Improved cluster reconcile behavior and controller ownership logic for GPU nodes.
  • Hardened control-plane lifecycle management for node/controller resources.

Technical Improvements & Bug Fixes

  • Fixed stability issues around node/controller lifecycle (including controller panics).
  • Improved lifecycle handling (destroying phase, NotFound handling) and GPU pool controller robustness.

Ecosystem

  • None (vendor/transport-specific features start showing up in later months).

Monthly Release — 2025-01

2025-01

Metrics foundations and early scheduling/observability building blocks

2025/01/31

Core Engine Features

  • Expanded GPU metrics foundations across controller/operator and the vGPU engine (TFLOPs/VRAM, logging pipelines).
  • Improved GPU pool / resource management building blocks and controller-level signal collection.

Technical Improvements & Bug Fixes

  • Fixed webhook/service configuration issues and avoided worker port conflicts.
  • Improved error handling around GPU process metrics and NVML initialization fallbacks.

Ecosystem

  • None (vendor/transport-specific features start showing up in later months).