LangFuse Observability

Summary

Add production AI observability by integrating LangFuse at the AI Gateway (Cloudflare Worker) level. The gateway already intercepts every LLM request/response — it gains a LangFuse integration that records traces with model, token usage, latency, and cost data. The client (Flutter/Rust) passes trace context headers (X-Trace-Id, X-Trace-Span-Id, X-Trace-Parent-Span-Id, X-Trace-Session-Id, X-Trace-Operation) through the gateway. The gateway uses LangFuse’s native SDK to build nested trace trees — parent agent spans containing child agent spans containing LLM generations — giving full semantic hierarchy without OTEL, and without exposing any secrets to the client.

Motivation

The AI pipeline currently has two observability paths, both inadequate for production:

File-based logging (logging.rs in Rust) — flat text written to ai_chat.log via direct file I/O. No structure, no aggregation, no cost tracking. Exists to debug tool calls during development.
Cloudflare console.log (in ai-gateway) — JSON log of {uid, provider, status, key, requestId} per request. Visible in wrangler tail but no aggregation, no token tracking, no prompt/completion capture.

We need:

Per-session trace grouping — all LLM calls within an editing session linked together
Per-user cost tracking — token usage attributed to users for billing/monitoring
Latency breakdown — time spent in LLM calls, visible per provider and model
Token usage tracking — input/output tokens per generation, aggregated by model
Error visibility — rate limits (429s), provider errors (5xx), and fallback patterns
Parent/child hierarchy — lesson plan generation shows parent agent → child whiteboard agents as linked traces
Tool call visibility — which tools the agent called and with what arguments

Why the gateway, not the client

The Rust/Rig code runs on the user’s device (compiled into the Flutter app). Integrating LangFuse at the Rust level would require embedding LangFuse API keys in the client binary. This is a security risk:

Key extraction — anyone who decompiles the app can extract the LangFuse secret key
Trace poisoning — with the key, an attacker can write arbitrary traces to LangFuse, corrupting all observability data (fake token counts, phantom sessions, misleading error rates)
Data exfiltration — depending on key permissions, the attacker could read other users’ prompts and AI responses

The AI gateway is the trust boundary. It’s server-side infrastructure we control, it already sees every LLM request/response, and it already has the authenticated user ID (X-Uid) and request ID (X-Request-Id). LangFuse keys stay server-side.

Why LangFuse

JavaScript SDK — native integration for Cloudflare Workers
Open source, self-hostable — start with cloud, move to self-hosted when volume justifies
Session/user grouping — first-class concepts in the data model
Cost calculation — automatic from model name + token counts
Prompt/completion capture — full request/response bodies for debugging
Purpose-built for LLM observability — token tracking, cost dashboards, and prompt inspection out of the box, unlike general-purpose tools (Datadog, Grafana)

Data & privacy

LangFuse captures full request and response bodies. If user prompts contain personal information (student names, learning context), this data is stored in LangFuse Cloud.

Mitigations:

LangFuse Cloud is SOC 2 Type II compliant with data processing in the US/EU.
Self-hosting is a planned follow-up once volume justifies it — this keeps all data on our own infrastructure.
For the initial deployment, prompt/completion capture is enabled by default (essential for debugging). If privacy review requires it, we can truncate or hash prompt bodies before sending to LangFuse — this is a one-line change in the gateway integration code.
No student-identifiable data is stored in LangFuse metadata fields — only uid (Firebase UID), which is opaque without access to our user database.

Design

Architecture

Flutter App (untrusted — no secrets)
    │
    │  POST /ai/cerebras/v1/chat/completions
    │  Headers:
    │    Authorization: Bearer {firebase_jwt}
    │    X-Trace-Session-Id: {editing_session_id}
    │    X-Trace-Id: {operation_tree_id}
    │    X-Trace-Span-Id: {agent_phase_id}
    │    X-Trace-Parent-Span-Id: {parent_phase_id}  (optional)
    │    X-Trace-Operation: generate | chat | generate_parent
    │    X-Trace-Tags: whiteboard,lesson-plan        (optional)
    │
    ▼
jwt-worker (Firebase JWT + Oso auth)
    │  Sets: X-Uid, X-Request-Id
    │  Forwards: X-Trace-* headers via Service Binding (in-process)
    ▼
ai-gateway Worker ◄── LangFuse integration here
    │  1. Create/reuse LangFuse trace (by X-Trace-Id)
    │  2. Create/reuse span (by X-Trace-Span-Id, nested under parent)
    │  3. Proxy to CF AI Gateway → LLM Provider
    │  4. Record generation under span (model, tokens, latency, status)
    │  5. Flush trace (via waitUntil)
    ▼
CF AI Gateway → LLM Provider (Cerebras, OpenAI, etc.)

Gateway LangFuse integration

The ai-gateway worker (infrastructure/ai-gateway/src/index.ts) gains LangFuse as a dependency and instruments every request.

Dependencies

{
  "dependencies": {
    "langfuse": "^3.0.0"
  }
}

Environment configuration

# wrangler.toml — add to each environment
[env.dev.vars]
LANGFUSE_BASE_URL = "https://us.cloud.langfuse.com"

# Secrets (set via `wrangler secret put`):
# LANGFUSE_PUBLIC_KEY
# LANGFUSE_SECRET_KEY

Handler signature change

The current ai-gateway handler does not accept an ExecutionContext:

// BEFORE
async fetch(request: Request, env: Env): Promise<Response>

// AFTER — ctx is required for waitUntil
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response>

This is required because ctx.waitUntil() is the only way to defer work (LangFuse flush) after the response is returned to the client.

CORS update

The current CORS headers in both handleCors() and setCorsHeaders() only allow Authorization, Content-Type, X-Uid, X-Request-Id. The new X-Trace-* headers must be added:

const ALLOWED_HEADERS = [
  'Authorization', 'Content-Type', 'X-Uid', 'X-Request-Id',
  'X-Trace-Id', 'X-Trace-Span-Id', 'X-Trace-Parent-Span-Id',
  'X-Trace-Session-Id', 'X-Trace-Operation', 'X-Trace-Tags',
].join(', ');

Note: The jwt-worker → ai-gateway path uses a Cloudflare Service Binding (env.AI_GATEWAY.fetch(forwardReq) in ai-routing.ts), which is in-process and bypasses CORS. However, the jwt-worker itself negotiates CORS with the browser and forwards all headers — so the jwt-worker’s own CORS config must also include these headers if it has an explicit allowlist.

Integration code

The LangFuse client is initialized as a module-level singleton (Workers reuse isolates across requests within the same instance):

import { Langfuse } from "langfuse";

// Module-level — persists across requests within the same Worker isolate.
// Each isolate creates its own instance; unflushed data is lost on eviction,
// which is why we flush via waitUntil on every request.
let langfuse: Langfuse | null = null;

function getLangfuse(env: Env): Langfuse | null {
  // Gracefully degrade if secrets aren't configured
  if (!env.LANGFUSE_PUBLIC_KEY || !env.LANGFUSE_SECRET_KEY) return null;
  if (!langfuse) {
    langfuse = new Langfuse({
      publicKey: env.LANGFUSE_PUBLIC_KEY,
      secretKey: env.LANGFUSE_SECRET_KEY,
      baseUrl: env.LANGFUSE_BASE_URL,
    });
  }
  return langfuse;
}

Per-request instrumentation — all LangFuse calls are wrapped in try/catch so a LangFuse failure never affects the LLM response path:

// In the request handler, after auth check:
const lf = getLangfuse(env);

// Extract trace context headers
const traceId = request.headers.get("X-Trace-Id") ?? requestId;
const spanId = request.headers.get("X-Trace-Span-Id") ?? requestId;
const parentSpanId = request.headers.get("X-Trace-Parent-Span-Id");
const traceSessionId = request.headers.get("X-Trace-Session-Id");
const traceOperation = request.headers.get("X-Trace-Operation") ?? "unknown";
const traceTags = request.headers.get("X-Trace-Tags")?.split(",").filter(Boolean) ?? [];

// Buffer request body for both forwarding and logging.
// The existing code uses request.arrayBuffer() — we keep binary for forwarding
// and decode to string only for LangFuse logging.
const bodyBuffer = await request.arrayBuffer();
const bodyText = new TextDecoder().decode(bodyBuffer);

let trace, span, generation;
try {
  if (lf) {
    // Create or reuse trace (LangFuse deduplicates by ID)
    trace = lf.trace({
      id: traceId,
      name: traceOperation,
      sessionId: traceSessionId ?? undefined,
      userId: uid,
      tags: [provider, ...traceTags],
      metadata: { requestId, provider, environment: env.ENVIRONMENT },
    });

    // Create a span for this agent phase (nested under parent if provided)
    span = trace.span({
      id: spanId,
      name: traceOperation,
      parentObservationId: parentSpanId ?? undefined,
    });

    // Create generation nested under span (before proxying)
    generation = span.generation({
      name: `${provider}.chat`,
      model: extractModelFromBody(bodyText),
      input: bodyText,
      metadata: {
        keyAlias: keyLabel(usedKey),
        fallbackAttempt: attemptIndex,
      },
    });
  }
} catch (e) {
  console.error("[langfuse] trace creation failed:", e);
}

// Proxy to CF AI Gateway — forward the original binary body
const upstreamResponse = await fetch(gatewayUrl, {
  method: "POST",
  headers: forwardHeaders,
  body: bodyBuffer,
});

// Parse response for token usage (non-streaming path)
if (!isStreaming) {
  const responseBody = await upstreamResponse.text();
  const parsed = JSON.parse(responseBody);

  try {
    generation?.end({
      output: responseBody,
      usage: parsed.usage ? {
        inputTokens: parsed.usage.prompt_tokens,
        outputTokens: parsed.usage.completion_tokens,
        totalTokens: parsed.usage.total_tokens,
      } : undefined,
      statusMessage: upstreamResponse.ok ? undefined : `HTTP ${upstreamResponse.status}`,
      level: upstreamResponse.ok ? "DEFAULT" : "ERROR",
    });
    if (lf) ctx.waitUntil(lf.flushAsync());
  } catch (e) {
    console.error("[langfuse] generation end failed:", e);
  }

  return new Response(responseBody, { status: upstreamResponse.status, headers: responseHeaders });
}

// Streaming path — see below

`extractModelFromBody` helper

Extracts the model name from the request body. All supported providers use the OpenAI-compatible { "model": "..." } format since they go through CF AI Gateway:

function extractModelFromBody(bodyText: string): string | undefined {
  try {
    const parsed = JSON.parse(bodyText);
    return parsed.model ?? undefined;
  } catch {
    return undefined;
  }
}

Streaming responses

For the initial deployment, streaming responses log the trace without token usage or completion body (Option B). This is the simplest correct approach — adding SSE parsing is a follow-up.

SSE chunk boundaries do not align with ReadableStream read boundaries (a single read can contain partial lines or multiple events), making correct SSE parsing non-trivial. Rather than ship a buggy parser, we log what we can and add usage capture later.

if (isStreaming) {
  try {
    generation?.end({
      // No output or usage for streaming — added in follow-up
      statusMessage: upstreamResponse.ok ? undefined : `HTTP ${upstreamResponse.status}`,
      level: upstreamResponse.ok ? "DEFAULT" : "ERROR",
    });
    if (lf) ctx.waitUntil(lf.flushAsync());
  } catch (e) {
    console.error("[langfuse] streaming generation end failed:", e);
  }

  // Pass through the stream unmodified
  return new Response(upstreamResponse.body, {
    status: upstreamResponse.status,
    headers: responseHeaders,
  });
}

Follow-up: streaming token capture. When needed, add a TransformStream tee that watches for the final SSE usage chunk. OpenAI includes usage when stream_options.include_usage is set; Cerebras and Google may not. This requires per-provider testing and a proper SSE line parser.

Trace context headers

The client passes trace context as HTTP headers. These are not secrets — they’re metadata for grouping and filtering.

Header	Purpose	Example	Required
`X-Trace-Session-Id`	Groups all LLM calls in an editing session	`"session-abc123"`	No
`X-Trace-Id`	Unique ID for an entire operation tree	`"trace-xyz789"`	No
`X-Trace-Span-Id`	Unique ID for this specific agent phase	`"span-parent"`	No
`X-Trace-Parent-Span-Id`	Parent span for nesting	`"span-parent"` (or empty for root)	No
`X-Trace-Operation`	Names the trace/span in LangFuse UI	`"generate"`, `"chat"`, `"generate_parent"`	No
`X-Trace-Tags`	Comma-separated filterable tags	`"whiteboard,one-shot"`	No

Abuse mitigation: A client with a valid JWT could send unique X-Trace-Id values per request, creating many trace objects in LangFuse (cost amplification). Since trace IDs default to requestId when not provided (and requestId is already one-per-request), the attack surface only exists when explicit trace IDs are sent. Mitigation: the gateway validates that X-Trace-Id is a reasonable UUID format and ignores malformed values, falling back to requestId.

Client-side header injection (Rust)

The TraceContext struct carries trace hierarchy through the Rust pipeline:

pub struct TraceContext {
    pub session_id: Option<String>,     // from Dart — editing session ID
    pub trace_id: String,               // generated in Rust — unique per operation tree
    pub span_id: String,                // generated in Rust — unique per agent phase
    pub parent_span_id: Option<String>, // generated in Rust — links child → parent
    pub operation: String,              // set in Rust — "generate", "chat", "generate_parent"
    pub tags: Vec<String>,              // from Dart + Rust — ["whiteboard", "one-shot"]
}

Only session_id comes from Dart (via AgentConfig). The trace_id, span_id, and parent_span_id are generated in Rust because the agent orchestration code (run_generate_parent, run_agent_loop in crates/session/src/agent.rs) is the only layer that knows the multi-phase parent/child structure. Dart doesn’t know how many LLM calls a generation will make or which are parent vs child.

Rig header injection mechanism

Rig v0.29’s Client<Ext, H> stores headers as Arc<HeaderMap> — immutable after construction. The post() and post_sse() methods copy these default headers onto every outgoing request. This works for static headers but not for per-request trace context (each multi-turn call within a single chat_stream needs the same span_id but a different generation identifier is already handled by LangFuse’s dedup).

The key insight: trace_id and span_id are stable within a single chat_stream call (one operation tree, one agent phase). They only change between calls (e.g., parent phase → child phase). Since each chat_stream call constructs a new Rig Agent with a new Client, we can set default headers at client construction time:

fn build_openai_model(
    config: &AiConfig,
    trace_ctx: Option<&TraceContext>,
) -> Result<openai::CompletionModel, RunnerError> {
    let mut builder = openai::CompletionsClient::builder()
        .api_key(&config.api_key)
        .base_url(&config.base_url);

    if let Some(ctx) = trace_ctx {
        let mut headers = http::HeaderMap::new();
        headers.insert("X-Trace-Id", ctx.trace_id.parse().unwrap());
        headers.insert("X-Trace-Span-Id", ctx.span_id.parse().unwrap());
        if let Some(ref parent) = ctx.parent_span_id {
            headers.insert("X-Trace-Parent-Span-Id", parent.parse().unwrap());
        }
        if let Some(ref session) = ctx.session_id {
            headers.insert("X-Trace-Session-Id", session.parse().unwrap());
        }
        headers.insert("X-Trace-Operation", ctx.operation.parse().unwrap());
        if !ctx.tags.is_empty() {
            headers.insert("X-Trace-Tags", ctx.tags.join(",").parse().unwrap());
        }
        builder = builder.http_headers(headers);
    }

    let client = builder.build().map_err(|e| RunnerError::Config(e.to_string()))?;
    Ok(client.completion_model(&config.model))
}

Rig’s ClientBuilder::http_headers(headers) sets the HeaderMap that gets Arc-wrapped at build time and applied to every request via post() / post_sse(). Since a new client is built per chat_stream call, each call gets the correct trace context. Multi-turn requests within the same call share the same span_id — which is correct, as they represent multiple LLM turns within one agent phase.

Span nesting via LangFuse SDK

The gateway reconstructs a proper parent/child trace tree using LangFuse’s native trace.span() and span.generation() nesting APIs — no OTEL needed. The key insight: the client sends X-Trace-Id (shared across all requests in one operation tree) and X-Trace-Span-Id / X-Trace-Parent-Span-Id (describing the tree structure). The gateway uses these to build nested observations within a single LangFuse trace.

How the client sets trace context

For a lesson plan generation (parent → 2 children), the Rust agent code sets headers on each HTTP request:

Request 1 — parent agent, multi-turn call 1:
  X-Trace-Id: trace-abc
  X-Trace-Span-Id: span-parent
  X-Trace-Parent-Span-Id: (empty)
  X-Trace-Operation: generate_parent

Request 2 — parent agent, multi-turn call 2:
  X-Trace-Id: trace-abc
  X-Trace-Span-Id: span-parent          ← same span, another LLM turn
  X-Trace-Parent-Span-Id: (empty)
  X-Trace-Operation: generate_parent

Request 3 — child slide-1:
  X-Trace-Id: trace-abc                 ← same trace tree
  X-Trace-Span-Id: span-slide-1
  X-Trace-Parent-Span-Id: span-parent   ← linked to parent
  X-Trace-Operation: generate
  X-Trace-Tags: whiteboard,child

Request 4 — child slide-2:
  X-Trace-Id: trace-abc
  X-Trace-Span-Id: span-slide-2
  X-Trace-Parent-Span-Id: span-parent
  X-Trace-Operation: generate
  X-Trace-Tags: whiteboard,child

The trace_id is generated once per top-level operation (e.g. one “Generate” button click). The span_id is generated per agent phase. The parent_span_id links children to their parent. In Rust, run_generate_parent sets these:

// In run_generate_parent (crates/session/src/agent.rs):
let trace_id = uuid();
let parent_span_id = uuid();

// Phase 1: parent agent — set on TraceContext before calling chat_stream
let parent_ctx = TraceContext {
    trace_id: trace_id.clone(),
    span_id: parent_span_id.clone(),
    parent_span_id: None,
    operation: "generate_parent".into(),
    ..
};

// Phase 2: each child — set on TraceContext before calling child's chat_stream
let child_ctx = TraceContext {
    trace_id: trace_id.clone(),       // same tree
    span_id: uuid(),                  // unique per child
    parent_span_id: Some(parent_span_id.clone()), // linked to parent
    operation: "generate".into(),
    ..
};

How the gateway builds the LangFuse tree

The gateway integration code (shown above) uses the same trace → span → generation nesting for every request. The key behavior that makes this work across multiple requests is LangFuse’s ID-based deduplication:

Same traceId across requests → all observations land in one trace (e.g. all 4 requests in the lesson plan example share trace-abc)
Same spanId across requests → multiple generations nest under one span (e.g. the parent agent’s multi-turn calls both use span-parent, so both LLM turns appear as sibling generations under it)
parentObservationId links child spans to parent spans within the trace (e.g. span-slide-1 and span-slide-2 both reference span-parent)

No special gateway logic is needed per request — the same integration code runs identically for every request. The trace tree structure emerges entirely from the IDs the client sets in headers.

Resulting LangFuse trace tree

Trace: "generate_parent" (trace-abc)
  └─ Span: "generate_parent" (span-parent)
       ├─ Generation: cerebras.chat (turn 1)
       │    model: llama-4-scout, tokens: 1200/800
       └─ Generation: cerebras.chat (turn 2)
            model: llama-4-scout, tokens: 400/200
  └─ Span: "generate" (span-slide-1, parent: span-parent)
       └─ Generation: cerebras.chat
            model: llama-4-scout, tokens: 800/600
  └─ Span: "generate" (span-slide-2, parent: span-parent)
       └─ Generation: cerebras.chat
            model: llama-4-scout, tokens: 700/500

This gives full semantic nesting — identical to what OTEL span trees would provide — using only HTTP headers and LangFuse’s native SDK. No OpenTelemetry, no tracing subscriber, no span propagation across threads.

Trade-off: gen_ai.* OTEL attributes

Rig v0.29 internally emits rich tracing spans with OpenTelemetry gen_ai semantic convention attributes:

// Rig's OpenAI provider creates these spans automatically:
info_span!(
    "chat",
    gen_ai.operation.name = "chat",
    gen_ai.provider.name = "openai",
    gen_ai.request.model = self.model,
    gen_ai.usage.input_tokens = Empty,
    gen_ai.usage.output_tokens = Empty,
    gen_ai.response.id = Empty,
    gen_ai.input.messages = ...,
    gen_ai.output.messages = ...,
);

With the gateway approach, these spans are not exported (there’s no OTEL subscriber in the Rust process). The gateway reconstructs equivalent data from HTTP request/response bodies:

Data Point	Rig OTEL (unused)	Gateway (actual source)
Model name	`gen_ai.request.model`	Parsed from request body JSON
Provider	`gen_ai.provider.name`	Extracted from URL path segment
Input tokens	`gen_ai.usage.input_tokens`	Parsed from response body / SSE final chunk
Output tokens	`gen_ai.usage.output_tokens`	Parsed from response body / SSE final chunk
Prompt content	`gen_ai.input.messages`	Full request body captured
Completion	`gen_ai.output.messages`	Full response body captured
Response ID	`gen_ai.response.id`	Parsed from response body
Latency	Span duration	`Date.now()` delta in the worker

What we genuinely lose by not using Rig’s OTEL spans:

Client-side tool execution timing — Rig spans would measure how long each tool call took to execute in Rust. The gateway only sees the time between LLM requests. Mitigated by the optional X-Trace-Tool-Calls header (see Tool Call Visibility below).
Internal Rig metadata — response model name (can differ from request model), system prompt content (set via Rig’s builder, not in the HTTP body).

These are acceptable losses. Rig’s tracing spans remain useful for local development by adding a tracing-subscriber fmt layer — they just don’t export to LangFuse.

Tool call visibility

Tool calls happen client-side (Rust) — the gateway only sees the resulting LLM requests. To capture tool call detail, two complementary approaches:

Approach 1: Tool metadata in request body. The LLM request body already contains the tool definitions and tool results (as conversation history). LangFuse captures the full request body as input, so tool calls are visible in the prompt inspector.

Approach 2: Client-side tool call headers (optional, future). Add a X-Trace-Tool-Calls header with a compact JSON summary:

X-Trace-Tool-Calls: [{"name":"set_title","duration_ms":12},{"name":"add_element","duration_ms":45}]

The gateway records this as trace metadata. This is optional and can be added incrementally.

Fallback tracking

The ai-gateway already has fallback logic (try default key → numbered aliases on 429/5xx). LangFuse integration captures this naturally:

generation.update({
  metadata: {
    keyAlias: keyLabel(usedKey),
    fallbackAttempt: attemptIndex,  // 0 = first try, 1+ = fallback
    fallbackReason: previousStatus, // 429, 500, etc.
  },
});

This enables filtering in LangFuse for “requests that required fallback” — useful for monitoring rate limit pressure.

Error isolation

LangFuse must never block or break the LLM proxy path. All LangFuse operations are wrapped in try/catch:

If getLangfuse(env) returns null (missing secrets), the gateway proxies normally with no tracing.
If trace/span/generation creation throws, the error is logged to console.error and the request proceeds.
If flushAsync() fails in waitUntil, it fails silently after the response is already sent.
The gateway continues to function identically if LangFuse is down, misconfigured, or rate-limited.

Dependency graph

infrastructure/ai-gateway   (gains langfuse dependency + integration code)
infrastructure/jwt-worker    (unchanged — Service Binding forwards X-Trace-* headers in-process)

crates/core                  (adds TraceContext to AgentConfig)
crates/platform/ai           (sets X-Trace-* headers on outgoing requests via Rig's http_headers)
crates/session               (passes TraceContext through agent spawn)
crates/api                   (exposes TraceContext fields to FRB)

No new Rust crates. No OTEL pipeline. No secrets on the client.

Implementation Plan

Phase 1: Gateway-only (deployable independently, no client changes)

Add langfuse to ai-gateway — npm install langfuse, add LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL secrets/vars to wrangler.toml
Update handler signature — add ctx: ExecutionContext as third parameter to fetch()
Update CORS — add X-Trace-* headers to handleCors() and setCorsHeaders() allowlists
Instrument ai-gateway — create trace + generation per request, flush via ctx.waitUntil. Buffer body as arrayBuffer (preserving existing behavior), decode to string for LangFuse only. Wrap all LangFuse calls in try/catch.
Handle streaming — log trace without token usage (Option B). Streaming token capture is a follow-up.
Verify SDK compatibility — deploy to dev, run an LLM call, confirm trace appears in LangFuse Cloud. If the langfuse npm package fails in Workers (Node.js API dependency), fall back to langfuse-core or direct REST API calls.

Phase 2: Client trace context (rich hierarchy)

Add TraceContext to AgentConfig — session_id (from Dart), plus internal fields trace_id, span_id, parent_span_id, operation, tags in modality_core
Set X-Trace-* headers in runner — pass TraceContext to build_openai_model / build_gemini_model, set via ClientBuilder::http_headers(). Each chat_stream call builds a new client with the correct context.
Wire trace context through agent spawn — run_generate / run_generate_parent / run_agent_loop create TraceContext with appropriate IDs and pass to chat_stream
Pass session_id from Dart — through FRB → AgentConfig → agent thread
Run FRB codegen — regenerate Dart bindings for new AgentConfig fields
Validate full flow — confirm session grouping, user attribution, and parent/child linking in LangFuse

Alternatives Considered

Client-side OTEL integration (Rust) — The original direction of this RFC. A modality_telemetry crate would initialize an OpenTelemetry pipeline in the Flutter app, exporting Rig’s gen_ai.* spans to LangFuse. Rejected because:

Security — LangFuse API keys would be embedded in the client binary, extractable by anyone who decompiles the app. An attacker could write arbitrary traces, poisoning all observability data.
Complexity — Required a dedicated OTEL runtime thread (FRB has no global tokio runtime), span propagation fixes at 3+ std::thread::spawn sites, and workarounds for Rig’s span reuse behavior (Span::none() trick).
Scope — Only captured LLM calls from the Rust client. The gateway captures ALL LLM traffic regardless of client.

Cloudflare AI Gateway built-in analytics — CF AI Gateway has native logging and analytics. Rejected as the sole solution because it lacks session grouping, user attribution, prompt/completion inspection, and cost dashboards. However, it complements LangFuse — CF handles rate limiting and key management, LangFuse handles observability.

LangFuse via OpenTelemetry at the gateway — Instead of the JS SDK, export OTEL traces from the Cloudflare Worker. Rejected because Cloudflare Workers don’t have native OTEL support, and the LangFuse JS SDK is purpose-built for this use case with a simpler API.

Custom observability dashboard — Build our own with ClickHouse/Grafana. Rejected because LangFuse provides LLM-specific features (token tracking, cost calculation, prompt inspection) that would take months to build. Can always migrate later — the gateway integration is the stable interface.

Unresolved Questions

LangFuse JS SDK in Cloudflare Workers — The langfuse npm package uses fetch and standard Web APIs, which should work in Cloudflare Workers. However, it may rely on Node.js APIs (timers, process.env) for its internal batching and flush logic. The ai-gateway already has nodejs_compat enabled in wrangler.toml, which may cover this. Needs a quick spike: npm install langfuse in the ai-gateway, call new Langfuse(...), and verify flushAsync() completes in a waitUntil context. If incompatible, fallback options: use LangFuse’s REST API directly, or use the langfuse-core package which has fewer Node dependencies.

Streaming token usage extraction (follow-up) — Not all providers include usage in the final SSE chunk during streaming. OpenAI does (when stream_options.include_usage is set), but Cerebras and Google may not. Needs testing per provider. Initial deployment logs traces without token counts for streaming requests.

Rate limiting on LangFuse ingestion — At high volume, the langfuse.flushAsync() call in waitUntil could add latency or fail silently. LangFuse Cloud has ingestion rate limits. Needs monitoring after deployment. Mitigation: LangFuse SDK has built-in batching and retry, and our error isolation ensures failures don’t affect the proxy path.