NVIDIA/NVSentinel

Go 182 stars

NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments

⟳ Syncing… Share on X →
README badge: [![ngmi](https://ngmi.review/badge/NVIDIA/NVSentinel.svg)](https://ngmi.review/repo/NVIDIA/NVSentinel)
364 Merged PRs
2 days Avg Merge Time
4m Fastest PR
1 month Slowest PR
#234 Global Speed Rank

Top Reviewers

Recent Merged PRs

# Title Author Time Reviews Blocks
#714 feat: added breakfix response time metrics @nitz2407 1 month 30
#894 feat: Add terminate node template in fault remediation values @XRFXLP 1.8h 2
#720 feat: add device-api-server with NVML fallback provider @ArangoGutierrez 1 month 29
#892 fix: initialize variable in UAT GPU reset test @natherz97 52m 2
#887 fix: set RuntimeClassName to nvidia in GPU reset pod @natherz97 14.8h 5
#885 feat: include slinky drain in github CI pipeline @XRFXLP 1 day 11
#837 Add NCCL all reduce test in preflight framework @XRFXLP 3 days 30
#879 fix: wait for GPUReset CRD rather than check syslog in UAT @natherz97 20.1h 7
#877 feat: Make IAM role name configurable in CSP health monitor for EKS @KaivalyaMDabhadkar 13.0h 2
#881 update nvsentinel version in readme @XRFXLP 57m 2
#876 fix: increase unquarantine timeout to 5 minutes @XRFXLP 51m 1
#759 fix: fault_quarantine_current_quarantined_nodes metric update @tanishagoyal2 19 days 30
#839 build(deps): bump aquasecurity/trivy-action from 0.33.1 to 0.34.0 @dependabot 21.0h 1
#875 chore: merge dependabot updates @lalitadithya 1.5h 0
#828 fix: Clean up nolint directives marked as TODO - Part 1 @cbumb 3 days 6
#768 feat: enable GPU reset with e2e and UAT tests @natherz97 12 days 12
#834 fix: add retry while injecting inforom error in E2E test @tanishagoyal2 2.6h 2
#831 fix: Clean up nolint directives marked as TODO - Part 2 @cbumb 1 day 5
#830 Scale tests issue 386 @ksaur 18.9h 7
#818 [Feature] Add gang discovery in preflight framework @XRFXLP 2 days 30