LangFuse Observability
Summary
Section titled “Summary”Add production AI observability by integrating LangFuse at the AI Gateway (Cloudflare Worker) level. The gateway already intercepts every LLM request/response — it gains a LangFuse integration that records traces with model, token usage, latency, and cost data. The client (Flutter/Rust) passes trace context headers (X-Trace-Id, X-Trace-Span-Id, X-Trace-Parent-Span-Id, X-Trace-Session-Id, X-Trace-Operation) through the gateway. The gateway uses LangFuse’s native SDK to build nested trace trees — parent agent spans containing child agent spans containing LLM generations — giving full semantic hierarchy without OTEL, and without exposing any secrets to the client.
Motivation
Section titled “Motivation”The AI pipeline currently has two observability paths, both inadequate for production:
- File-based logging (
logging.rsin Rust) — flat text written toai_chat.logvia direct file I/O. No structure, no aggregation, no cost tracking. Exists to debug tool calls during development. - Cloudflare console.log (in ai-gateway) — JSON log of
{uid, provider, status, key, requestId}per request. Visible inwrangler tailbut no aggregation, no token tracking, no prompt/completion capture.
We need:
- Per-session trace grouping — all LLM calls within an editing session linked together
- Per-user cost tracking — token usage attributed to users for billing/monitoring
- Latency breakdown — time spent in LLM calls, visible per provider and model
- Token usage tracking — input/output tokens per generation, aggregated by model
- Error visibility — rate limits (429s), provider errors (5xx), and fallback patterns
- Parent/child hierarchy — lesson plan generation shows parent agent → child whiteboard agents as linked traces
- Tool call visibility — which tools the agent called and with what arguments
Why the gateway, not the client
Section titled “Why the gateway, not the client”The Rust/Rig code runs on the user’s device (compiled into the Flutter app). Integrating LangFuse at the Rust level would require embedding LangFuse API keys in the client binary. This is a security risk:
- Key extraction — anyone who decompiles the app can extract the LangFuse secret key
- Trace poisoning — with the key, an attacker can write arbitrary traces to LangFuse, corrupting all observability data (fake token counts, phantom sessions, misleading error rates)
- Data exfiltration — depending on key permissions, the attacker could read other users’ prompts and AI responses
The AI gateway is the trust boundary. It’s server-side infrastructure we control, it already sees every LLM request/response, and it already has the authenticated user ID (X-Uid) and request ID (X-Request-Id). LangFuse keys stay server-side.
Why LangFuse
Section titled “Why LangFuse”- JavaScript SDK — native integration for Cloudflare Workers
- Open source, self-hostable — start with cloud, move to self-hosted when volume justifies
- Session/user grouping — first-class concepts in the data model
- Cost calculation — automatic from model name + token counts
- Prompt/completion capture — full request/response bodies for debugging
- Purpose-built for LLM observability — token tracking, cost dashboards, and prompt inspection out of the box, unlike general-purpose tools (Datadog, Grafana)
Data & privacy
Section titled “Data & privacy”LangFuse captures full request and response bodies. If user prompts contain personal information (student names, learning context), this data is stored in LangFuse Cloud.
Mitigations:
- LangFuse Cloud is SOC 2 Type II compliant with data processing in the US/EU.
- Self-hosting is a planned follow-up once volume justifies it — this keeps all data on our own infrastructure.
- For the initial deployment, prompt/completion capture is enabled by default (essential for debugging). If privacy review requires it, we can truncate or hash prompt bodies before sending to LangFuse — this is a one-line change in the gateway integration code.
- No student-identifiable data is stored in LangFuse metadata fields — only
uid(Firebase UID), which is opaque without access to our user database.
Design
Section titled “Design”Architecture
Section titled “Architecture”Flutter App (untrusted — no secrets) │ │ POST /ai/cerebras/v1/chat/completions │ Headers: │ Authorization: Bearer {firebase_jwt} │ X-Trace-Session-Id: {editing_session_id} │ X-Trace-Id: {operation_tree_id} │ X-Trace-Span-Id: {agent_phase_id} │ X-Trace-Parent-Span-Id: {parent_phase_id} (optional) │ X-Trace-Operation: generate | chat | generate_parent │ X-Trace-Tags: whiteboard,lesson-plan (optional) │ ▼jwt-worker (Firebase JWT + Oso auth) │ Sets: X-Uid, X-Request-Id │ Forwards: X-Trace-* headers via Service Binding (in-process) ▼ai-gateway Worker ◄── LangFuse integration here │ 1. Create/reuse LangFuse trace (by X-Trace-Id) │ 2. Create/reuse span (by X-Trace-Span-Id, nested under parent) │ 3. Proxy to CF AI Gateway → LLM Provider │ 4. Record generation under span (model, tokens, latency, status) │ 5. Flush trace (via waitUntil) ▼CF AI Gateway → LLM Provider (Cerebras, OpenAI, etc.)Gateway LangFuse integration
Section titled “Gateway LangFuse integration”The ai-gateway worker (infrastructure/ai-gateway/src/index.ts) gains LangFuse as a dependency and instruments every request.
Dependencies
Section titled “Dependencies”{ "dependencies": { "langfuse": "^3.0.0" }}Environment configuration
Section titled “Environment configuration”# wrangler.toml — add to each environment[env.dev.vars]LANGFUSE_BASE_URL = "https://us.cloud.langfuse.com"
# Secrets (set via `wrangler secret put`):# LANGFUSE_PUBLIC_KEY# LANGFUSE_SECRET_KEYHandler signature change
Section titled “Handler signature change”The current ai-gateway handler does not accept an ExecutionContext:
// BEFOREasync fetch(request: Request, env: Env): Promise<Response>
// AFTER — ctx is required for waitUntilasync fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response>This is required because ctx.waitUntil() is the only way to defer work (LangFuse flush) after the response is returned to the client.
CORS update
Section titled “CORS update”The current CORS headers in both handleCors() and setCorsHeaders() only allow Authorization, Content-Type, X-Uid, X-Request-Id. The new X-Trace-* headers must be added:
const ALLOWED_HEADERS = [ 'Authorization', 'Content-Type', 'X-Uid', 'X-Request-Id', 'X-Trace-Id', 'X-Trace-Span-Id', 'X-Trace-Parent-Span-Id', 'X-Trace-Session-Id', 'X-Trace-Operation', 'X-Trace-Tags',].join(', ');Note: The jwt-worker → ai-gateway path uses a Cloudflare Service Binding (
env.AI_GATEWAY.fetch(forwardReq)inai-routing.ts), which is in-process and bypasses CORS. However, the jwt-worker itself negotiates CORS with the browser and forwards all headers — so the jwt-worker’s own CORS config must also include these headers if it has an explicit allowlist.
Integration code
Section titled “Integration code”The LangFuse client is initialized as a module-level singleton (Workers reuse isolates across requests within the same instance):
import { Langfuse } from "langfuse";
// Module-level — persists across requests within the same Worker isolate.// Each isolate creates its own instance; unflushed data is lost on eviction,// which is why we flush via waitUntil on every request.let langfuse: Langfuse | null = null;
function getLangfuse(env: Env): Langfuse | null { // Gracefully degrade if secrets aren't configured if (!env.LANGFUSE_PUBLIC_KEY || !env.LANGFUSE_SECRET_KEY) return null; if (!langfuse) { langfuse = new Langfuse({ publicKey: env.LANGFUSE_PUBLIC_KEY, secretKey: env.LANGFUSE_SECRET_KEY, baseUrl: env.LANGFUSE_BASE_URL, }); } return langfuse;}Per-request instrumentation — all LangFuse calls are wrapped in try/catch so a LangFuse failure never affects the LLM response path:
// In the request handler, after auth check:const lf = getLangfuse(env);
// Extract trace context headersconst traceId = request.headers.get("X-Trace-Id") ?? requestId;const spanId = request.headers.get("X-Trace-Span-Id") ?? requestId;const parentSpanId = request.headers.get("X-Trace-Parent-Span-Id");const traceSessionId = request.headers.get("X-Trace-Session-Id");const traceOperation = request.headers.get("X-Trace-Operation") ?? "unknown";const traceTags = request.headers.get("X-Trace-Tags")?.split(",").filter(Boolean) ?? [];
// Buffer request body for both forwarding and logging.// The existing code uses request.arrayBuffer() — we keep binary for forwarding// and decode to string only for LangFuse logging.const bodyBuffer = await request.arrayBuffer();const bodyText = new TextDecoder().decode(bodyBuffer);
let trace, span, generation;try { if (lf) { // Create or reuse trace (LangFuse deduplicates by ID) trace = lf.trace({ id: traceId, name: traceOperation, sessionId: traceSessionId ?? undefined, userId: uid, tags: [provider, ...traceTags], metadata: { requestId, provider, environment: env.ENVIRONMENT }, });
// Create a span for this agent phase (nested under parent if provided) span = trace.span({ id: spanId, name: traceOperation, parentObservationId: parentSpanId ?? undefined, });
// Create generation nested under span (before proxying) generation = span.generation({ name: `${provider}.chat`, model: extractModelFromBody(bodyText), input: bodyText, metadata: { keyAlias: keyLabel(usedKey), fallbackAttempt: attemptIndex, }, }); }} catch (e) { console.error("[langfuse] trace creation failed:", e);}
// Proxy to CF AI Gateway — forward the original binary bodyconst upstreamResponse = await fetch(gatewayUrl, { method: "POST", headers: forwardHeaders, body: bodyBuffer,});
// Parse response for token usage (non-streaming path)if (!isStreaming) { const responseBody = await upstreamResponse.text(); const parsed = JSON.parse(responseBody);
try { generation?.end({ output: responseBody, usage: parsed.usage ? { inputTokens: parsed.usage.prompt_tokens, outputTokens: parsed.usage.completion_tokens, totalTokens: parsed.usage.total_tokens, } : undefined, statusMessage: upstreamResponse.ok ? undefined : `HTTP ${upstreamResponse.status}`, level: upstreamResponse.ok ? "DEFAULT" : "ERROR", }); if (lf) ctx.waitUntil(lf.flushAsync()); } catch (e) { console.error("[langfuse] generation end failed:", e); }
return new Response(responseBody, { status: upstreamResponse.status, headers: responseHeaders });}
// Streaming path — see belowextractModelFromBody helper
Section titled “extractModelFromBody helper”Extracts the model name from the request body. All supported providers use the OpenAI-compatible { "model": "..." } format since they go through CF AI Gateway:
function extractModelFromBody(bodyText: string): string | undefined { try { const parsed = JSON.parse(bodyText); return parsed.model ?? undefined; } catch { return undefined; }}Streaming responses
Section titled “Streaming responses”For the initial deployment, streaming responses log the trace without token usage or completion body (Option B). This is the simplest correct approach — adding SSE parsing is a follow-up.
SSE chunk boundaries do not align with ReadableStream read boundaries (a single read can contain partial lines or multiple events), making correct SSE parsing non-trivial. Rather than ship a buggy parser, we log what we can and add usage capture later.
if (isStreaming) { try { generation?.end({ // No output or usage for streaming — added in follow-up statusMessage: upstreamResponse.ok ? undefined : `HTTP ${upstreamResponse.status}`, level: upstreamResponse.ok ? "DEFAULT" : "ERROR", }); if (lf) ctx.waitUntil(lf.flushAsync()); } catch (e) { console.error("[langfuse] streaming generation end failed:", e); }
// Pass through the stream unmodified return new Response(upstreamResponse.body, { status: upstreamResponse.status, headers: responseHeaders, });}Follow-up: streaming token capture. When needed, add a TransformStream tee that watches for the final SSE usage chunk. OpenAI includes usage when stream_options.include_usage is set; Cerebras and Google may not. This requires per-provider testing and a proper SSE line parser.
Trace context headers
Section titled “Trace context headers”The client passes trace context as HTTP headers. These are not secrets — they’re metadata for grouping and filtering.
| Header | Purpose | Example | Required |
|---|---|---|---|
X-Trace-Session-Id | Groups all LLM calls in an editing session | "session-abc123" | No |
X-Trace-Id | Unique ID for an entire operation tree | "trace-xyz789" | No |
X-Trace-Span-Id | Unique ID for this specific agent phase | "span-parent" | No |
X-Trace-Parent-Span-Id | Parent span for nesting | "span-parent" (or empty for root) | No |
X-Trace-Operation | Names the trace/span in LangFuse UI | "generate", "chat", "generate_parent" | No |
X-Trace-Tags | Comma-separated filterable tags | "whiteboard,one-shot" | No |
Abuse mitigation: A client with a valid JWT could send unique X-Trace-Id values per request, creating many trace objects in LangFuse (cost amplification). Since trace IDs default to requestId when not provided (and requestId is already one-per-request), the attack surface only exists when explicit trace IDs are sent. Mitigation: the gateway validates that X-Trace-Id is a reasonable UUID format and ignores malformed values, falling back to requestId.
Client-side header injection (Rust)
Section titled “Client-side header injection (Rust)”The TraceContext struct carries trace hierarchy through the Rust pipeline:
pub struct TraceContext { pub session_id: Option<String>, // from Dart — editing session ID pub trace_id: String, // generated in Rust — unique per operation tree pub span_id: String, // generated in Rust — unique per agent phase pub parent_span_id: Option<String>, // generated in Rust — links child → parent pub operation: String, // set in Rust — "generate", "chat", "generate_parent" pub tags: Vec<String>, // from Dart + Rust — ["whiteboard", "one-shot"]}Only session_id comes from Dart (via AgentConfig). The trace_id, span_id, and parent_span_id are generated in Rust because the agent orchestration code (run_generate_parent, run_agent_loop in crates/session/src/agent.rs) is the only layer that knows the multi-phase parent/child structure. Dart doesn’t know how many LLM calls a generation will make or which are parent vs child.
Rig header injection mechanism
Section titled “Rig header injection mechanism”Rig v0.29’s Client<Ext, H> stores headers as Arc<HeaderMap> — immutable after construction. The post() and post_sse() methods copy these default headers onto every outgoing request. This works for static headers but not for per-request trace context (each multi-turn call within a single chat_stream needs the same span_id but a different generation identifier is already handled by LangFuse’s dedup).
The key insight: trace_id and span_id are stable within a single chat_stream call (one operation tree, one agent phase). They only change between calls (e.g., parent phase → child phase). Since each chat_stream call constructs a new Rig Agent with a new Client, we can set default headers at client construction time:
fn build_openai_model( config: &AiConfig, trace_ctx: Option<&TraceContext>,) -> Result<openai::CompletionModel, RunnerError> { let mut builder = openai::CompletionsClient::builder() .api_key(&config.api_key) .base_url(&config.base_url);
if let Some(ctx) = trace_ctx { let mut headers = http::HeaderMap::new(); headers.insert("X-Trace-Id", ctx.trace_id.parse().unwrap()); headers.insert("X-Trace-Span-Id", ctx.span_id.parse().unwrap()); if let Some(ref parent) = ctx.parent_span_id { headers.insert("X-Trace-Parent-Span-Id", parent.parse().unwrap()); } if let Some(ref session) = ctx.session_id { headers.insert("X-Trace-Session-Id", session.parse().unwrap()); } headers.insert("X-Trace-Operation", ctx.operation.parse().unwrap()); if !ctx.tags.is_empty() { headers.insert("X-Trace-Tags", ctx.tags.join(",").parse().unwrap()); } builder = builder.http_headers(headers); }
let client = builder.build().map_err(|e| RunnerError::Config(e.to_string()))?; Ok(client.completion_model(&config.model))}Rig’s ClientBuilder::http_headers(headers) sets the HeaderMap that gets Arc-wrapped at build time and applied to every request via post() / post_sse(). Since a new client is built per chat_stream call, each call gets the correct trace context. Multi-turn requests within the same call share the same span_id — which is correct, as they represent multiple LLM turns within one agent phase.
Span nesting via LangFuse SDK
Section titled “Span nesting via LangFuse SDK”The gateway reconstructs a proper parent/child trace tree using LangFuse’s native trace.span() and span.generation() nesting APIs — no OTEL needed. The key insight: the client sends X-Trace-Id (shared across all requests in one operation tree) and X-Trace-Span-Id / X-Trace-Parent-Span-Id (describing the tree structure). The gateway uses these to build nested observations within a single LangFuse trace.
How the client sets trace context
Section titled “How the client sets trace context”For a lesson plan generation (parent → 2 children), the Rust agent code sets headers on each HTTP request:
Request 1 — parent agent, multi-turn call 1: X-Trace-Id: trace-abc X-Trace-Span-Id: span-parent X-Trace-Parent-Span-Id: (empty) X-Trace-Operation: generate_parent
Request 2 — parent agent, multi-turn call 2: X-Trace-Id: trace-abc X-Trace-Span-Id: span-parent ← same span, another LLM turn X-Trace-Parent-Span-Id: (empty) X-Trace-Operation: generate_parent
Request 3 — child slide-1: X-Trace-Id: trace-abc ← same trace tree X-Trace-Span-Id: span-slide-1 X-Trace-Parent-Span-Id: span-parent ← linked to parent X-Trace-Operation: generate X-Trace-Tags: whiteboard,child
Request 4 — child slide-2: X-Trace-Id: trace-abc X-Trace-Span-Id: span-slide-2 X-Trace-Parent-Span-Id: span-parent X-Trace-Operation: generate X-Trace-Tags: whiteboard,childThe trace_id is generated once per top-level operation (e.g. one “Generate” button click). The span_id is generated per agent phase. The parent_span_id links children to their parent. In Rust, run_generate_parent sets these:
// In run_generate_parent (crates/session/src/agent.rs):let trace_id = uuid();let parent_span_id = uuid();
// Phase 1: parent agent — set on TraceContext before calling chat_streamlet parent_ctx = TraceContext { trace_id: trace_id.clone(), span_id: parent_span_id.clone(), parent_span_id: None, operation: "generate_parent".into(), ..};
// Phase 2: each child — set on TraceContext before calling child's chat_streamlet child_ctx = TraceContext { trace_id: trace_id.clone(), // same tree span_id: uuid(), // unique per child parent_span_id: Some(parent_span_id.clone()), // linked to parent operation: "generate".into(), ..};How the gateway builds the LangFuse tree
Section titled “How the gateway builds the LangFuse tree”The gateway integration code (shown above) uses the same trace → span → generation nesting for every request. The key behavior that makes this work across multiple requests is LangFuse’s ID-based deduplication:
- Same
traceIdacross requests → all observations land in one trace (e.g. all 4 requests in the lesson plan example sharetrace-abc) - Same
spanIdacross requests → multiple generations nest under one span (e.g. the parent agent’s multi-turn calls both usespan-parent, so both LLM turns appear as sibling generations under it) parentObservationIdlinks child spans to parent spans within the trace (e.g.span-slide-1andspan-slide-2both referencespan-parent)
No special gateway logic is needed per request — the same integration code runs identically for every request. The trace tree structure emerges entirely from the IDs the client sets in headers.
Resulting LangFuse trace tree
Section titled “Resulting LangFuse trace tree”Trace: "generate_parent" (trace-abc) └─ Span: "generate_parent" (span-parent) ├─ Generation: cerebras.chat (turn 1) │ model: llama-4-scout, tokens: 1200/800 └─ Generation: cerebras.chat (turn 2) model: llama-4-scout, tokens: 400/200 └─ Span: "generate" (span-slide-1, parent: span-parent) └─ Generation: cerebras.chat model: llama-4-scout, tokens: 800/600 └─ Span: "generate" (span-slide-2, parent: span-parent) └─ Generation: cerebras.chat model: llama-4-scout, tokens: 700/500This gives full semantic nesting — identical to what OTEL span trees would provide — using only HTTP headers and LangFuse’s native SDK. No OpenTelemetry, no tracing subscriber, no span propagation across threads.
Trade-off: gen_ai.* OTEL attributes
Section titled “Trade-off: gen_ai.* OTEL attributes”Rig v0.29 internally emits rich tracing spans with OpenTelemetry gen_ai semantic convention attributes:
// Rig's OpenAI provider creates these spans automatically:info_span!( "chat", gen_ai.operation.name = "chat", gen_ai.provider.name = "openai", gen_ai.request.model = self.model, gen_ai.usage.input_tokens = Empty, gen_ai.usage.output_tokens = Empty, gen_ai.response.id = Empty, gen_ai.input.messages = ..., gen_ai.output.messages = ...,);With the gateway approach, these spans are not exported (there’s no OTEL subscriber in the Rust process). The gateway reconstructs equivalent data from HTTP request/response bodies:
| Data Point | Rig OTEL (unused) | Gateway (actual source) |
|---|---|---|
| Model name | gen_ai.request.model | Parsed from request body JSON |
| Provider | gen_ai.provider.name | Extracted from URL path segment |
| Input tokens | gen_ai.usage.input_tokens | Parsed from response body / SSE final chunk |
| Output tokens | gen_ai.usage.output_tokens | Parsed from response body / SSE final chunk |
| Prompt content | gen_ai.input.messages | Full request body captured |
| Completion | gen_ai.output.messages | Full response body captured |
| Response ID | gen_ai.response.id | Parsed from response body |
| Latency | Span duration | Date.now() delta in the worker |
What we genuinely lose by not using Rig’s OTEL spans:
- Client-side tool execution timing — Rig spans would measure how long each tool call took to execute in Rust. The gateway only sees the time between LLM requests. Mitigated by the optional
X-Trace-Tool-Callsheader (see Tool Call Visibility below). - Internal Rig metadata — response model name (can differ from request model), system prompt content (set via Rig’s builder, not in the HTTP body).
These are acceptable losses. Rig’s tracing spans remain useful for local development by adding a tracing-subscriber fmt layer — they just don’t export to LangFuse.
Tool call visibility
Section titled “Tool call visibility”Tool calls happen client-side (Rust) — the gateway only sees the resulting LLM requests. To capture tool call detail, two complementary approaches:
Approach 1: Tool metadata in request body.
The LLM request body already contains the tool definitions and tool results (as conversation history). LangFuse captures the full request body as input, so tool calls are visible in the prompt inspector.
Approach 2: Client-side tool call headers (optional, future).
Add a X-Trace-Tool-Calls header with a compact JSON summary:
X-Trace-Tool-Calls: [{"name":"set_title","duration_ms":12},{"name":"add_element","duration_ms":45}]The gateway records this as trace metadata. This is optional and can be added incrementally.
Fallback tracking
Section titled “Fallback tracking”The ai-gateway already has fallback logic (try default key → numbered aliases on 429/5xx). LangFuse integration captures this naturally:
generation.update({ metadata: { keyAlias: keyLabel(usedKey), fallbackAttempt: attemptIndex, // 0 = first try, 1+ = fallback fallbackReason: previousStatus, // 429, 500, etc. },});This enables filtering in LangFuse for “requests that required fallback” — useful for monitoring rate limit pressure.
Error isolation
Section titled “Error isolation”LangFuse must never block or break the LLM proxy path. All LangFuse operations are wrapped in try/catch:
- If
getLangfuse(env)returnsnull(missing secrets), the gateway proxies normally with no tracing. - If trace/span/generation creation throws, the error is logged to
console.errorand the request proceeds. - If
flushAsync()fails inwaitUntil, it fails silently after the response is already sent. - The gateway continues to function identically if LangFuse is down, misconfigured, or rate-limited.
Dependency graph
Section titled “Dependency graph”infrastructure/ai-gateway (gains langfuse dependency + integration code)infrastructure/jwt-worker (unchanged — Service Binding forwards X-Trace-* headers in-process)
crates/core (adds TraceContext to AgentConfig)crates/platform/ai (sets X-Trace-* headers on outgoing requests via Rig's http_headers)crates/session (passes TraceContext through agent spawn)crates/api (exposes TraceContext fields to FRB)No new Rust crates. No OTEL pipeline. No secrets on the client.
Implementation Plan
Section titled “Implementation Plan”Phase 1: Gateway-only (deployable independently, no client changes)
Section titled “Phase 1: Gateway-only (deployable independently, no client changes)”- Add
langfuseto ai-gateway —npm install langfuse, addLANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY,LANGFUSE_BASE_URLsecrets/vars to wrangler.toml - Update handler signature — add
ctx: ExecutionContextas third parameter tofetch() - Update CORS — add
X-Trace-*headers tohandleCors()andsetCorsHeaders()allowlists - Instrument ai-gateway — create trace + generation per request, flush via
ctx.waitUntil. Buffer body asarrayBuffer(preserving existing behavior), decode to string for LangFuse only. Wrap all LangFuse calls in try/catch. - Handle streaming — log trace without token usage (Option B). Streaming token capture is a follow-up.
- Verify SDK compatibility — deploy to dev, run an LLM call, confirm trace appears in LangFuse Cloud. If the
langfusenpm package fails in Workers (Node.js API dependency), fall back tolangfuse-coreor direct REST API calls.
Phase 2: Client trace context (rich hierarchy)
Section titled “Phase 2: Client trace context (rich hierarchy)”- Add
TraceContexttoAgentConfig—session_id(from Dart), plus internal fieldstrace_id,span_id,parent_span_id,operation,tagsinmodality_core - Set
X-Trace-*headers in runner — passTraceContexttobuild_openai_model/build_gemini_model, set viaClientBuilder::http_headers(). Eachchat_streamcall builds a new client with the correct context. - Wire trace context through agent spawn —
run_generate/run_generate_parent/run_agent_loopcreateTraceContextwith appropriate IDs and pass tochat_stream - Pass
session_idfrom Dart — through FRB →AgentConfig→ agent thread - Run FRB codegen — regenerate Dart bindings for new
AgentConfigfields - Validate full flow — confirm session grouping, user attribution, and parent/child linking in LangFuse
Alternatives Considered
Section titled “Alternatives Considered”Client-side OTEL integration (Rust) — The original direction of this RFC. A modality_telemetry crate would initialize an OpenTelemetry pipeline in the Flutter app, exporting Rig’s gen_ai.* spans to LangFuse. Rejected because:
- Security — LangFuse API keys would be embedded in the client binary, extractable by anyone who decompiles the app. An attacker could write arbitrary traces, poisoning all observability data.
- Complexity — Required a dedicated OTEL runtime thread (FRB has no global tokio runtime), span propagation fixes at 3+
std::thread::spawnsites, and workarounds for Rig’s span reuse behavior (Span::none()trick). - Scope — Only captured LLM calls from the Rust client. The gateway captures ALL LLM traffic regardless of client.
Cloudflare AI Gateway built-in analytics — CF AI Gateway has native logging and analytics. Rejected as the sole solution because it lacks session grouping, user attribution, prompt/completion inspection, and cost dashboards. However, it complements LangFuse — CF handles rate limiting and key management, LangFuse handles observability.
LangFuse via OpenTelemetry at the gateway — Instead of the JS SDK, export OTEL traces from the Cloudflare Worker. Rejected because Cloudflare Workers don’t have native OTEL support, and the LangFuse JS SDK is purpose-built for this use case with a simpler API.
Custom observability dashboard — Build our own with ClickHouse/Grafana. Rejected because LangFuse provides LLM-specific features (token tracking, cost calculation, prompt inspection) that would take months to build. Can always migrate later — the gateway integration is the stable interface.
Unresolved Questions
Section titled “Unresolved Questions”LangFuse JS SDK in Cloudflare Workers — The langfuse npm package uses fetch and standard Web APIs, which should work in Cloudflare Workers. However, it may rely on Node.js APIs (timers, process.env) for its internal batching and flush logic. The ai-gateway already has nodejs_compat enabled in wrangler.toml, which may cover this. Needs a quick spike: npm install langfuse in the ai-gateway, call new Langfuse(...), and verify flushAsync() completes in a waitUntil context. If incompatible, fallback options: use LangFuse’s REST API directly, or use the langfuse-core package which has fewer Node dependencies.
Streaming token usage extraction (follow-up) — Not all providers include usage in the final SSE chunk during streaming. OpenAI does (when stream_options.include_usage is set), but Cerebras and Google may not. Needs testing per provider. Initial deployment logs traces without token counts for streaming requests.
Rate limiting on LangFuse ingestion — At high volume, the langfuse.flushAsync() call in waitUntil could add latency or fail silently. LangFuse Cloud has ingestion rate limits. Needs monitoring after deployment. Mitigation: LangFuse SDK has built-in batching and retry, and our error isolation ensures failures don’t affect the proxy path.