Documentation · Rognix

Overview

Rognix is autonomous operations for AI infrastructure, GPU clusters, training fleets, and modern data centers. We watch every accelerator, rack, and workload, predict thermal and power events before they trigger sensors, and act on cooling imbalances and underutilization before workloads fail.

Standard Linux servers (cloud / VPS / on-premise) are also supported on the same agent - they fall under our self-serve standard server tiers. The AI infrastructure tiers add GPU telemetry, thermal analytics, and workload-aware failure prediction.

Primary wedge

AI servers, GPU clusters, AI data centers

Also supports

Cloud, VPS, on-premise (without GPU features)

Replaces

Static threshold alerting + manual rack walks

Complements

Datadog, Grafana, DCIM, NVIDIA DCGM

AI infrastructure

For the operators of GPU clusters, training fleets, and AI data centers, Rognix surfaces the signal classes that matter, and only those.

Signal coverage

Per-GPU health: utilization, memory pressure, temperatures, power draw, fan speed, clock state, ECC errors.
Cluster fabric: NVLink / InfiniBand / RoCE error counters, bandwidth degradation, congestion patterns.
Thermal & cooling: rack-level inlet/outlet baselines, hot-spot detection, predicted throttle events.
Power & rack density: live draw vs. budget, redistribution suggestions, stranded capacity.
Workload context: Slurm / Kubernetes / Ray job state correlated with hardware state.
BMC / IPMI / Redfish: chassis-level signals on AI Datacenter and Enterprise tiers.

AI-specific detectors

Out-of-the-box detector templates that fire automatically once GPU telemetry is enabled:

GPU thermal stress, predicts throttle events ahead of sensor trips by tracking deviation from per-device baselines.
GPU underutilization, flags accelerators sitting idle while jobs queue. Quantifies waste in $/hr.
Stuck-job detection, power draw without proportional utilization (zombie kernel / hung CUDA context).
Cluster cooling imbalance, cross-rack temperature divergence indicating CRAC issues.
Workload failure prediction, pattern-match against PSU, NVLink, HBM, and disk failure modes; migrate jobs before checkpoint loss.

Plan gating

GPU telemetry, thermal analytics, and workload failure prediction unlock at the AI Cluster tier. Standard server tiers can run the agent on a GPU host but won't process the GPU signal, useful for evaluation but not the full product.

How it works

A small read-only collector runs on each server. It streams resource and health metrics over an authenticated, encrypted channel to Rognix. Our intelligence engine continuously evaluates that data, surfaces incidents and opportunities you should know about, and optionally takes action when you've given permission.

You interact with the platform through three surfaces:

Dashboard, live view of every connected server.
Insights, plain-English explanations of detected events, with confidence scores and suggested actions.
Activity timeline, every decision (yours, Rognix's, scheduled jobs) logged with reason and outcome.

Connect a server

One line. The collector is a small static binary, ~7 MB.

curl -fsSL https://rognix.com/install/agent.sh \
  | sudo ROGNIX_KEY=rgx_bk_<bootstrap> bash

The installer:

Downloads the collector to /usr/local/bin/rognix-agent.
Stores a single-use bootstrap key at /etc/rognix-agent/key (mode 0600, root-owned).
Registers a hardened systemd unit (see the security section).
Starts the service. On first connection, the bootstrap is exchanged for a long-lived runtime key, you never pin a static credential into config.

Supported architectures: linux-amd64, linux-arm64. Tested on Ubuntu 22.04+, Debian 12, RHEL 9, Alpine 3.19.

GPU & accelerator telemetry

The Rognix collector samples GPU state via vendor tooling (NVIDIA nvidia-smitoday; AMD ROCm and Intel Habana on the roadmap). If the vendor tool isn't installed on the host, GPU sampling is a silent no-op, the agent stays useful as a standard server collector.

Metrics emitted per GPU

gpu_util_pct          0–100, compute utilization
gpu_mem_pct           0–100, framebuffer used
gpu_mem_used_bytes    raw
gpu_temp_c            °C
gpu_power_w           current draw in watts
gpu_power_limit_w     enforced cap
gpu_fan_pct           0–100
gpu_clock_mhz         current graphics clock
gpu_ecc_errors        aggregate uncorrected ECC count
gpu_nvlink_errors     (planned)
gpu_pcie_rx_bps       (planned)
gpu_pcie_tx_bps       (planned)

Each sample carries labels gpu_index, gpu_uuid, and gpu_name (e.g. "NVIDIA H100 80GB HBM3"). Detectors pivot on gpu_index for per-device analysis.

What is not collected

Process names, command-line arguments, or PIDs running on the GPU.
Model weights, training data, dataset paths, checkpoints.
NVIDIA Inference Server (Triton) request bodies or responses.
Any application-layer payload of any kind.

Workload context (Slurm job IDs, K8s pod names) is opt-in via the integration setup - never auto-discovered through process inspection.

Authentication model

For users (web)

Passwords hashed with argon2id (memory-hard, ASIC-resistant).
Access token is a 1-hour JWT in an HTTP-only, SameSite=Lax cookie.
Refresh token is a 48-byte opaque secret stored hashed in sessions; rotated on each refresh.
RBAC: owner, admin, operator, viewer.

For agents

Two-stage credential flow eliminates long-lived secrets at install time:

You generate a bootstrap key (rgx_bk_…) in the dashboard. It is single-use and expires in 1 hour.
The agent presents the bootstrap on first connection. The server validates, then issues a fresh runtime key (rgx_ak_…) in the hello acknowledgement.
The agent persists the runtime key at /etc/rognix-agent/key and uses it on every subsequent connection.
A host fingerprint (hostname + cpu + memory + arch) is pinned on first hello. Future connections from a different fingerprint are rejected and logged as agent.fingerprint_mismatch.
Server-side, only the SHA-256 hash of the key is stored. Plaintext is shown once and never again.
You can revoke any agent in Settings → Agents; the next connection attempt is refused.

For cloud integrations

Cloud credentials (AWS keys, GCP service accounts) are encrypted at rest with AES-256-GCM using a versioned key set. Rotation is supported: the active key (v1) encrypts new ciphertext; older keys remain available for decrypt-only via ENCRYPTION_KEYS_PRIOR until the rewrap script re-encrypts every row under the new key.

Intelligence engine

Rognix runs its own intelligence engine. It analyses your infrastructure's behaviour over time, learns what normal looks like, and tells you when something deviates, with a clear explanation and a recommended response. None of your metric data ever leaves Rognix to be processed by third-party AI providers.

What you actually get on every insight:

Plain-English title, specific, including the affected node and the numeric signal.
Description, 1–2 sentences on what was observed and what it usually means.
Reasoning, the underlying signals, baselines, and statistical evidence.
Confidence score, 0–100, calibrated against historical accuracy.
Impact estimate, quantified where possible (downtime minutes, affected services).
Suggested action, one-click executable when permitted, with rollback notes.

Cross-environment correlation happens automatically. If multiple servers in the same environment exhibit related anomalies simultaneously, you get one insight that names all of them, not five separate cards for what is really one upstream cause.

Adaptive tuning. Every time you mark an insight as acted, dismissed, or snoozed, Rognix learns. Insight categories that you frequently dismiss for a given environment get higher confidence thresholds; categories you consistently act on are surfaced more eagerly. The engine recalibrates every 15 minutes.

Security & privacy

We built Rognix assuming the only thing that matters is this: the system that monitors your infrastructure must not be the one that compromises it. You don't need to worry, every claim below is enforced, audited, and inspectable.

In transit

TLS 1.3 everywhere. All traffic, web, API, collector, is encrypted in transit with auto-renewing certificates. No plaintext on the wire, ever.
Authenticated WebSocket ingest. The collector authenticates every reconnect with a per-host runtime key plus a host fingerprint. Mismatches are rejected and audit-logged.

At rest

Passwords: argon2id (memory-hard).
Agent keys, refresh tokens, bootstrap keys: stored as SHA-256 hashes only. Plaintext is shown to you exactly once at creation.
Cloud credentials (AWS/GCP/etc.): AES-256-GCM with versioned keys. Rotation is a one-command operation, no downtime.
Metric data never leaves Rognix to be processed by third parties. The intelligence engine runs entirely on our infrastructure.

On your servers

Read-only by default. Newly connected servers can only collect metrics. Actions require explicit, per-template permission in Settings.
Per-template grants, not a master switch. Authorize service.restart without authorizing compute.resize. Each template lists its risk level and reversibility.
systemd hardening. The collector runs with NoNewPrivileges, ProtectSystem=strict, PrivateTmp, MemoryDenyWriteExecute, and a writable path limited to its own key directory. It cannot touch the rest of your filesystem.
What it collects: CPU, memory, disk, network, process counts, SMART hardware telemetry. No log content, no environment variables, no application secrets, no file content.

Operationally

Approval queues for high-risk actions. Even in autonomous mode, anything classified as high- or critical-risk routes through your approval queue with an SLA timer.
Full audit trail. Every action, yours, ours, scheduled jobs, system, is logged with actor, subject, reason, outcome. Human-readable and CSV-exportable.
Rate limiting. All API endpoints are rate-limited; abusive clients are throttled before they ever reach the database.
No silent updates. Collector versions are explicit; new versions don't auto-install.
Payments via Stripe. We never see, store, or transmit your card details.

You don't have to take our word. The collector's permissions are visible in the systemd unit at /etc/systemd/system/rognix-agent.service. The audit log is queryable at any time via the API. And every action that happens on your infrastructure shows up in your activity timeline within seconds.

API reference

Base URL: https://rognix.com/api. All endpoints expect/return JSON. Auth via Authorization: Bearer <jwt> header or rgx_at cookie.

Auth

POST

/auth/signup

Create user + org. Body: { email, password, displayName?, orgName? }

POST

/auth/login

Body: { email, password }. Sets cookies, returns access token.

POST

/auth/refresh

Rotate access + refresh. Reads rgx_rt cookie.

POST

/auth/logout

Revoke session and clear cookies.

GET

/auth/me

Current user, org, role.

Environments

GET

/environments

List envs in your org.

POST

/environments

Body: { name, kind: cloud|vps|onprem|kubernetes }

PATCH

/environments/:id

Body: { aiMode?, name? }

Agents

GET

/agents

List agents in your org (no key material returned).

POST

/agents

Body: { envId, label? }. Returns one-time bootstrap key.

DELETE

/agents/:id

Revoke. Next connection refused.

POST

/agents/:id/permissions

Body: { allowedActionTemplates: string[] }

Nodes & metrics

GET

/nodes

List discovered nodes (?envId optional).

GET

/nodes/:id

Node detail.

GET

/nodes/:id/metrics

?kinds=cpu_pct,mem_pct&sinceMin=60

Insights & actions

GET

/insights

?envId=&state=open|acted|dismissed|snoozed

POST

/insights/:id/feedback

Body: { state, snoozeMinutes? }

GET

/actions

?envId=&state=...

GET

/actions/templates

Available action templates with risk levels.

POST

/actions/:id/approve

Move from pending_approval → approved.

POST

/actions/:id/reject

Refuse a proposed action.

Activity

GET

/events

?envId=&limit=&before=

GET

/realtime/stream

Server-Sent Events. See realtime section.

Realtime stream

Subscribe to /realtime/stream as a Server-Sent Events stream. Authenticated by the same cookie or bearer token as REST. Events are scoped to your organisation.

const es = new EventSource('/api/realtime/stream', { withCredentials: true });
es.addEventListener('insight.created', (e) => {
  const { envId, insightId, title } = JSON.parse(e.data);
  // refresh insights panel...
});
es.addEventListener('metrics', (e) => {
  // node activity ping, debounce a refresh
});
es.addEventListener('agent.hello', (e) => {
  // a new server just connected
});

Integrations

What you can wire up today:

Notification channels: Slack, PagerDuty, GitHub issues, email digests, generic webhook.
Sources: Linux hosts (the Rognix collector), Docker, Proxmox, VMware, AWS / GCP / Azure read-only credentials.
Identity: email + password today. SSO (OIDC and SAML) on Business and Enterprise plans.
Status pages: a public, auto-generated status page derived from your Rognix data, included on Starter and above.

Need something not listed? Email hello@rognix.com, most integrations get added within a sprint.

FAQ

Where does my data live?

In Rognix's own infrastructure. Metric data is never sent to third-party AI providers, and never leaves Rognix at all unless you explicitly export it.

Can I pause monitoring without canceling?

Yes, stop the collector service on any host (systemctl stop rognix-agent) and Rognix will mark it as offline without deleting history. Your other servers are unaffected.

What if Rognix is down? Do I lose visibility?

The collector buffers metrics during disconnects and replays them when the connection returns. Your local journalctl -u rognix-agent log shows what was sent and when. Independence on monitoring is a hard requirement we take seriously.

Can I run Rognix on-premise / in my own VPC?

The Enterprise plan ships a self-hosted control plane image with an air-gapped install path. Email us for a deployment guide.

How do I cancel?

Settings → Billing → Manage. The Stripe customer portal lets you change plan, update card, or cancel any time. No emails, no calls, no retention dark patterns.