Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Agent instructions — probing

This repository uses the [Agent Skills](https://agentskills.io) layout for training diagnostics.

## Module boundaries

Before adding code, read **`docs/src/design/modularity.md`** (中文: `modularity.zh.md`).

| Layer | Where | Your change belongs if… |
|-------|--------|-------------------------|
| L1 Platform | `probing/core`, `memtable`, `proto` | SQL engine, federation, storage format |
| L2 Collectors | `probing/extensions/*`, `python/probing/profiling` | New metrics / tables |
| L3 Control | `probing/server`, `probing/cli` | HTTP, inject, fan-out |
| L4 Experience | `skills/`, `web/`, Python hooks | Diagnostics UX, skills, UI |

**Contracts:** `ProbeDataSource` (tables), `ProbeExtension` (config/HTTP), `@table` (Python data), `skills/*/steps.yaml` (workflows). Do not add cross-collector calls — use SQL JOINs.

## Skills

All diagnostic skills live under **`skills/`**. Each subdirectory contains:

- **`SKILL.md`** — when to use the skill and how to interpret results (read this for routing)
- **`steps.yaml`** — executable probe steps (used by `probing skill run` and the Web Investigate agent)

Browse the catalog: `skills/catalog.yaml`

## Install skills into your agent

So Cursor / Claude Code / Codex can discover and invoke skills:

```bash
./skills/install.sh
```

This copies `skills/<id>/` into:

- `.cursor/skills/` (Cursor)
- `.claude/skills/` (Claude Code)
- `.agents/skills/` (Codex)

Use `probing skill install --user` for global install under `~/`.

## Run diagnostics

Requires a probed training process (`PROBING=1` or `probing -t <pid> inject`):

```bash
probing skill list
probing -t <pid> skill run health_overview
probing -t <pid> skill run slow_rank --global
probing -t <pid> skill run nccl_culprit_victim
```

From Python (e.g. in agent-generated scripts):

```python
from probing.skills.tools import list_skills, run_skill
run_skill("health_overview", target="<pid>")
```

## Built-in skills (summary)

| id | use when |
|----|----------|
| `health_overview` | first look / triage |
| `training_hang` | stall or hang |
| `slow_rank` | straggler rank |
| `comm_bottleneck` | NCCL / collective slow |
| `nccl_culprit_victim` | NCCL culprit/victim analysis |
| `memory_leak` | GPU memory growth |
| `module_bottleneck` | slow PyTorch modules |
| `gpu_pressure` | GPU util / headroom |

Details in each `skills/<id>/SKILL.md`.

## Extending

Add table plugins under `python/probing/ext/` (data). Add diagnostic skills under `skills/` (how to investigate). NCCL profiler plugin: `docs/src/design/nccl-profiler.md`. See `docs/src/design/extensibility.md`.
4 changes: 3 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

> Uncover the Hidden Truth of AI Performance

Probing is a production-grade performance profiler designed specifically for distributed AI workloads. Built on dynamic probe injection technology, it delivers zero-overhead runtime introspection with SQL-queryable performance metrics and cross-node correlation analysis.
Probing is a production-grade performance profiler designed specifically for distributed AI workloads. Built on dynamic probe injection technology, it delivers **minimal-overhead** runtime introspection—lock-free mmap writes on the hot path, background sampling—with SQL-queryable performance metrics and cross-node correlation analysis.

### What probing delivers...

Expand Down
23 changes: 22 additions & 1 deletion docs/src/design/modularity.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Python-side collectors (same layer, different language):

| Unit | Path | Responsibility |
|------|------|----------------|
| **probing-server** | `probing/server/` | Axum routes, auth, `initialize_engine()`, cluster fan-out |
| **probing-server** | `probing/server/` | Axum routes, auth, `initialize_engine()`, cluster fan-out, **torchrun heartbeat** (`torchrun_cluster.rs`) |
| **probing-cli** | `probing/cli/` | HTTP client to probe; inject/list/query/repl/**skill** |

Stable HTTP surface: `probing/server/API.md`, enforced by `tests/regression/spec/api_spec.json`.
Expand Down Expand Up @@ -325,8 +325,11 @@ Use this table to decide **where a change belongs**:
| Concern | Owner module | Interface |
|---------|--------------|-----------|
| SQL parsing, federation rewrite | probing-core | `Engine::async_query` |
| Table/column semantic docs | probing-core | `semantic_catalog` → `probe.probing.table_docs` / `column_docs` |
| mmap format, compaction | probing-memtable | `RowWriter`, `ColdStore` |
| Torchrun cluster heartbeat | probing-server | `torchrun_cluster.rs`, `cluster_report_backoff.rs`, PUT `/apis/nodes` |
| Mixed Python/C stack | probing-python/features | `python.backtrace`, pprof |
| macOS per-thread SIGUSR2 | probing-core | `signal::send_sigusr2_to_thread_id` |
| Torch module sampling | python/probing/profiling | `python.torch_trace` |
| Collective wall time | python/probing/profiling/collective | `python.comm_collective` |
| NCCL wait decomposition | probing-nccl-profiler | `nccl.proxy_ops` |
Expand Down Expand Up @@ -372,13 +375,29 @@ Track and fix incrementally:
| Issue | Current | Target |
|-------|---------|--------|
| Python ext → CLI | `probing-python` → `probing-cli` | **Accepted** for maturin wheel (`cli_main` only); keep import surface minimal |
| Python ext → CC | ~~`probing-python` → `probing-cc`~~ | **Done** — `send_sigusr2_to_thread_id` moved to `probing-core::signal` |
| Core → NCCL/HCCL | `probing-core` → `probing-nccl-profiler` / `probing-hccl-shim` (`builtin-schema-docs` feature) for `semantic_catalog` | Register docs via collector hooks or manifest; drop L1→L2 compile deps |
| Core → skills YAML | `semantic_catalog.rs` `include_str!(skills/semantic/tables.yaml)` | Accept as L4 overlay SSOT, or move YAML under `probing/core/resources/` |
| Server → python `features/*` | ~~`server/profiling.rs`~~ removed | Flamegraphs via `torchextension` / `pprofextension` `ProbeExtensionCall` |
| Server → python REPL internals | ~~`PythonRepl` in server~~ | `/ws` uses `ReplSession` facade only |
| Composition sprawl | All wiring in `server/engine.rs` | Optional: manifest TOML listing enabled extensions |
| Skills triple loader | Rust + Python + Web embed `skills/` | Keep `skills/` SSOT; loaders versioned together in CI |
| kmsg collector | Registered (Linux/kmsg feature gate) | Done |
| Architecture doc | 2-layer diagram | Superseded by this doc + [Data Layer](data-layer.md) |

### Cluster membership: Torchrun heartbeat vs Pulsing

Two **complementary** paths populate `cluster.nodes` and power `global.*` federation.
They do not replace each other.

| Path | Layer | When | Mechanism |
|------|-------|------|-----------|
| **Torchrun cluster heartbeat** | L3 `probing-server` | Default for `torchrun` / elastic jobs (`WORLD_SIZE > 1`, `PROBING=1/2`) | Hierarchical HTTP PUT + TCPStore side channel (`probing/torchrun/<run_id>/…`). Does **not** touch torch rendezvous keys. See [Torchrun cluster heartbeat](torchrun-cluster.md). |
| **Pulsing integration** | L4 passive + external runtime | Another process already runs [Pulsing](cluster-pulsing.md) and writes `pulsing.*` memtables | Probing discovers `pulsing.*` mmap tables; no probing-owned heartbeat thread. Optional bootstrap via Pulsing APIs. |

**Default for torchrun users:** heartbeat auto-starts from the Rust ctor (`maybe_start_torchrun_cluster()`).
**Pulsing:** use when the job already centers on Pulsing for membership/failure detection, or you need Pulsing actors alongside probing tables.

---

## 9. Adding a new feature (decision tree)
Expand Down Expand Up @@ -414,6 +433,8 @@ Need new raw signals?
| [Data Layer](data-layer.md) | MEMT/MEMC internals |
| [Extensibility](extensibility.md) | Public extension paths (table + skill) |
| [Distributed](distributed.md) | Federation & cluster |
| [Torchrun cluster heartbeat](torchrun-cluster.md) | Hierarchical torchrun membership |
| [Cluster with Pulsing](cluster-pulsing.md) | Optional Pulsing-based membership |
| [NCCL Profiler](nccl-profiler.md) | NCCL plugin boundary |
| [web/DESIGN.md](https://github.com/DeepLink-org/probing/blob/main/web/DESIGN.md) | UI module layout |
| [AGENTS.md](https://github.com/DeepLink-org/probing/blob/main/AGENTS.md) | Agent skill usage |
24 changes: 22 additions & 2 deletions docs/src/design/modularity.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ Python 侧采集(同层,不同语言):

| 单元 | 路径 | 职责 |
|------|------|------|
| **probing-server** | `probing/server/` | 路由、认证、`initialize_engine()`、fan-out |
| **probing-server** | `probing/server/` | 路由、认证、`initialize_engine()`、fan-out、**torchrun 心跳**(`torchrun_cluster.rs`) |
| **probing-cli** | `probing/cli/` | HTTP 客户端;inject/query/repl/**skill** |

HTTP 契约:`probing/server/API.md` + `tests/regression/spec/api_spec.json`。
Expand Down Expand Up @@ -269,8 +269,11 @@ sequenceDiagram
| 关注点 | 归属 | 接口 |
|--------|------|------|
| SQL 解析、federation 重写 | probing-core | `Engine::async_query` |
| 表/列语义文档 | probing-core | `semantic_catalog` → `probe.probing.table_docs` / `column_docs` |
| mmap 格式、冷压缩 | probing-memtable | `RowWriter`, `ColdStore` |
| Torchrun 集群心跳 | probing-server | `torchrun_cluster.rs`、`cluster_report_backoff.rs`、`PUT /apis/nodes` |
| 混合 Python/C 栈 | probing-python/features | `python.backtrace`、pprof |
| macOS 线程 SIGUSR2 | probing-core | `signal::send_sigusr2_to_thread_id` |
| Torch 模块采样 | python/profiling | `python.torch_trace` |
| Collective 墙钟 | python/profiling/collective | `python.comm_collective` |
| NCCL wait 分解 | nccl-profiler | `nccl.proxy_ops` |
Expand Down Expand Up @@ -313,13 +316,28 @@ sequenceDiagram
| 问题 | 现状 | 目标 |
|------|------|------|
| python → cli | `probing-python` → `probing-cli` | **可接受**(maturin wheel,仅 `cli_main`);禁止扩散 import |
| python → cc | ~~`probing-python` → `probing-cc`~~ | **已完成** — `send_sigusr2_to_thread_id` 迁至 `probing-core::signal` |
| core → NCCL/HCCL | `probing-core` → nccl/hccl shim(`builtin-schema-docs`)供 `semantic_catalog` | 改由采集器注册 docs,去掉 L1→L2 编译依赖 |
| core → skills YAML | `semantic_catalog` `include_str!(skills/semantic/tables.yaml)` | 接受为 L4 overlay SSOT,或迁到 `probing/core/resources/` |
| server → python features | ~~`server/profiling.rs`~~ 已删 | 火焰图走 Extension |
| server → REPL 内部 | ~~`PythonRepl`~~ | `/ws` 仅用 `ReplSession` 门面 |
| 组装集中 | 全在 server/engine.rs | 可选 extension manifest |
| Skill 三份 loader | Rust/Python/Web | `skills/` 为 SSOT,CI 同步校验 |
| kmsg 采集器 | 已实现未注册 | 在 engine 注册或删除 |
| kmsg 采集器 | 已注册(Linux/kmsg feature gate) | Done |
| Architecture 文档 | 二层旧图 | 以本文 + [数据层](data-layer.zh.md) 为准 |

### 集群成员:Torchrun 心跳 vs Pulsing

两条**互补**路径填充 `cluster.nodes` 并支撑 `global.*` 联邦,互不取代。

| 路径 | 层 | 适用场景 | 机制 |
|------|-----|----------|------|
| **Torchrun 集群心跳** | L3 `probing-server` | 默认:`torchrun`/elastic(`WORLD_SIZE > 1`,`PROBING=1/2`) | 分层 HTTP PUT + TCPStore 旁路键(`probing/torchrun/<run_id>/…`),**不**写 rendezvous 键。见 [Torchrun 集群心跳](torchrun-cluster.zh.md)。 |
| **Pulsing 集成** | L4 被动 + 外部运行时 | 作业已跑 [Pulsing](cluster-pulsing.zh.md) 并写 `pulsing.*` memtable | Probing 发现 mmap 表;无 probing 自有心跳线程。 |

**torchrun 用户默认**:Rust ctor 自动 `maybe_start_torchrun_cluster()`。
**Pulsing**:作业以 Pulsing 为中心做成员/故障检测,或需要 Pulsing actor 与 probing 表并存时使用。

---

## 9. 新功能决策树
Expand Down Expand Up @@ -347,6 +365,8 @@ sequenceDiagram
| [数据层](data-layer.zh.md) | MEMT/MEMC 内部实现 |
| [扩展机制](extensibility.zh.md) | 对外扩展路径(表 + skill + NCCL) |
| [分布式](distributed.zh.md) | 联邦与集群 |
| [Torchrun 集群心跳](torchrun-cluster.zh.md) | 分层 torchrun 成员注册 |
| [基于 Pulsing 的集群](cluster-pulsing.zh.md) | 可选 Pulsing 成员发现 |
| [NCCL Profiler](nccl-profiler.zh.md) | NCCL 插件边界 |
| [web/DESIGN.md](https://github.com/DeepLink-org/probing/blob/main/web/DESIGN.md) | 前端模块布局 |
| [AGENTS.md](https://github.com/DeepLink-org/probing/blob/main/AGENTS.md) | Agent 使用 skill |
6 changes: 3 additions & 3 deletions probing/cli/src/cli/bench/runners/coldscan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

use std::time::Instant;

use anyhow::{bail, Result};
use anyhow::{bail, Context, Result};
use probing_memtable::memc::{ColdStore, SegmentReader};

use crate::cli::bench::args::ColdscanArgs;
Expand Down Expand Up @@ -48,8 +48,8 @@ pub fn run(args: &ColdscanArgs, json: bool, seed: u64) -> Result<()> {
let mut rows = 0u64;
let mut disk = 0u64;
for path in &segments {
let reader = SegmentReader::open(path)
.map_err(|e| anyhow::anyhow!("open {}: {e}", path.display()))?;
let reader =
SegmentReader::open(path).with_context(|| format!("open {}", path.display()))?;
for (i, page) in reader.pages().iter().enumerate() {
disk += page.block_len as u64;
let cols = reader
Expand Down
7 changes: 5 additions & 2 deletions probing/cli/src/cli/bench/runners/mixed.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use std::sync::Arc;
use std::time::{Duration, Instant};

use anyhow::{bail, Result};
use anyhow::{bail, Context, Result};
use probing_memtable::memc::{ColdStore, Compactor, CompactorConfig};
use probing_memtable::{DType, MemTable};

Expand Down Expand Up @@ -58,7 +58,10 @@ pub fn run(args: &MixedArgs, json: bool, seed: u64) -> Result<()> {
args.ring.chunk_size,
args.ring.chunks,
)?;
let path = creator.path().expect("shared path").to_path_buf();
let path = creator
.path()
.context("shared memtable has no file path")?
.to_path_buf();
(Attach::File(path), creator)
}
};
Expand Down
5 changes: 4 additions & 1 deletion probing/cli/src/cli/bench/runners/mp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ fn orchestrate(args: &MpArgs, json: bool, seed: u64) -> Result<()> {
args.ring.chunk_size,
args.ring.chunks,
)?;
let path = creator.path().expect("shared path").to_path_buf();
let path = creator
.path()
.context("shared memtable has no file path")?
.to_path_buf();
(Attach::File(path), creator)
}
};
Expand Down
4 changes: 2 additions & 2 deletions probing/cli/src/cli/bench/runners/write.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
use std::sync::Barrier;
use std::time::Instant;

use anyhow::{bail, Result};
use anyhow::{bail, Context, Result};
use probing_memtable::MemTable;

use super::common;
Expand Down Expand Up @@ -94,7 +94,7 @@ pub fn run(args: &WriteArgs, json: bool, seed: u64) -> Result<()> {
)?;
let path = creator
.path()
.expect("shared table has a path")
.context("shared memtable has no file path")?
.to_path_buf();
(Source::File(path), Some(creator))
}
Expand Down
6 changes: 3 additions & 3 deletions probing/cli/src/cli/ctrl.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use anyhow::Result;
use anyhow::{Context, Result};
use std::io::Write;

use http_body_util::{BodyExt, Full};
Expand Down Expand Up @@ -232,13 +232,13 @@ pub async fn request(ctrl: ProbeEndpoint, url: &str, body: Option<String>) -> Re
.method("POST")
.uri(url)
.body(Full::<Bytes>::from(body))
.map_err(|e| anyhow::anyhow!("Failed to build POST request: {e}"))?
.context("Failed to build POST request")?
} else {
Request::builder()
.method("GET")
.uri(url)
.body(Full::<Bytes>::default())
.map_err(|e| anyhow::anyhow!("Failed to build GET request: {e}"))?
.context("Failed to build GET request")?
};

let res = sender.send_request(request).await?;
Expand Down
8 changes: 4 additions & 4 deletions probing/cli/src/cli/repl.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use anyhow::Result;
use anyhow::{Context, Result};
use futures_util::sink::Sink;
use futures_util::stream::Stream;
use futures_util::{SinkExt, StreamExt};
Expand Down Expand Up @@ -45,7 +45,7 @@ pub async fn start_repl(ctrl: ProbeEndpoint) -> Result<()> {
editor.read_line(&prompt)
})
.await
.map_err(|e| anyhow::anyhow!("Error reading input task: {}", e))?;
.context("Error reading input task")?;

match sig {
Ok(Signal::Success(line)) => {
Expand Down Expand Up @@ -155,7 +155,7 @@ async fn connect_tcp_websocket(addr: &str) -> Result<WsConnection> {
let url = format!("ws://{}/ws", addr);
let (ws_stream, _) = connect_async(&url)
.await
.map_err(|e| anyhow::anyhow!("WebSocket connection failed: {}", e))?;
.context("WebSocket connection failed")?;

Ok(boxed_connection(ws_stream))
}
Expand All @@ -182,7 +182,7 @@ async fn connect_unix_websocket(pid: i32) -> Result<WsConnection> {

let (ws_stream, _) = client_async("ws://localhost/ws", stream)
.await
.map_err(|e| anyhow::anyhow!("WebSocket connection failed: {}", e))?;
.context("WebSocket connection failed")?;

Ok(boxed_connection(ws_stream))
}
Expand Down
4 changes: 4 additions & 0 deletions probing/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ similar_names = "allow"
test-utils = []
default = ["builtin-schema-docs"]
builtin-schema-docs = ["dep:probing-hccl-shim", "dep:probing-nccl-profiler"]
python-bridge = ["dep:pyo3"]

[lib]
crate-type = ["rlib"]
Expand Down Expand Up @@ -57,6 +58,9 @@ bincode = "1.3.3"
uuid = { version = "1.0", features = ["v4", "serde"] }
url = "2.5"
libc = "0.2"
pyo3 = { version = "0.29.0", optional = true, default-features = false, features = [
"macros",
] }

[dev-dependencies]
tempfile = "3.8"
4 changes: 2 additions & 2 deletions probing/core/src/core/cluster.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ static CLUSTER_VERSION: AtomicU64 = AtomicU64::new(0);
fn now_micros() -> u64 {
std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_micros() as u64
.map(|d| d.as_micros() as u64)
.unwrap_or(0)
}

fn stale_threshold_micros() -> u64 {
Expand Down
Loading
Loading