Cover Image for Agentic + AI Observability Night SF
Cover Image for Agentic + AI Observability Night SF
Avatar for Agentic + AI Observability
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Join us for an Agentic + AI Observability Night on Thursday, March 26 from 3:30pm - 9:00pm PST in SF, our marquee multi-track event on how to rigorously evaluate, ship, and monitor AI agents in production.

This event is built for engineers, ML practitioners, GenAI engineers, and AI startup founders who are already running agents in production (or about to) and need a serious evaluation and observability practice, not just a demo. Across several tracks and multiple speakers from leading AI companies, we’ll go deep on evaluation design, harnesses, tracing, and metrics for agentic systems, plus the architectures and tooling that make those evaluations actionable. Whether you’re at an early-stage startup or an established company, if you care about getting AI agents into production and keeping them healthy, we hope you will join us!

You’ll hear from the following leaders in the AI space: Databricks, Anthropic, Factory.ai, Arize, Sentry.io, and LanceDB!


Why you should attend

  • Learn modern eval patterns: How teams design evaluation suites, datasets, and pipelines for multi-step agents, tools, and RAG-heavy workflows.

  • ​See how evaluations drive architecture: Real examples of how evaluation results shape data/feature platforms, retrieval setups, and tool orchestration.

  • ​Go beyond logs: Build structured traces, evals, and performance dashboards that explain why agents behave the way they do over time.

  • Live demos of observability and evaluation tools, tracing integrations, and evaluation workflows purpose-built for agentic systems.

  • ​Learn from multiple AI companies: Multiple speakers share battle-tested patterns, failure modes, and “never again” stories from real production agents.


AGENDA

  • 3:30pm: Registration & Mingling 

  • 5:00pm: Keynote

  • 5:45pm: Breakout #1 (2 tracks)

    • AI Agent Debugging: Four Lessons from Shipping Alyx (Aparna Dhinakaran, Founder - CPO, Arize)

    • Skills and Security (Greg Pstrucha, AI/ML Team, Sentry)

  • 6:30pm: Breakout #2 (2 tracks)

    • Cascading Failures in Multi-Agent Systems: Tracing and Evaluating Multi-Agent Deployments (Oleksandra Bovkun, Sr. Developer Advocate, Databricks)

    • New Data Challenges in AI Observability (Lei Xu, Co-founder / CTO, LanceDB)

  • 7:15pm: Breakout #3 (2 tracks)

    • Demystifying Evals for AI Agents (Marius Buleandra, Member of Technical Staff, Anthropic)

    • Self-healing agents: How Droid closes its own feedback loop (David Gomez-Urquiza, Member of Technical Staff, Factory.ai)

  • 7:45pm: Reception

  • 9:00pm: Goodnight


​Session Descriptions

  • Skills and Security

    • Agent skills have gained a lot of popularity recently as a way to extend agent capabilities in day to day development. However, skills can also be malicious and introduce a new attack vector that companies need to be aware of. This talk will dive into the security risks that come with skill adoption and what teams can do to stay ahead of them.

  • New Data Challenges in AI Observability

    • AI systems, especially sophisticated agents, require robust observability for reliability and performance. The sheer volume and complexity of multimodal data present significant data challenges for traditional monitoring solutions. Imagine doing timeline correlations to get to RCA on traces, embeddings, metadata, and even raw blobs. Or searching for the right culprit that resulted in a bad result. We’ll talk about how LanceDB’s architecture enables AI observability systems to easily conduct real-time analysis and efficient root cause analysis, no matter how messy the agent logs are.

  • Demystifying Evals for AI Agents

    • Before Anthropic, I built eval tooling for voice AI teams. Most of them still reviewed calls by hand for hours daily, and they weren't wrong to. Manual review catches things evals miss. But agent evals have come a long way since then. This talk covers how to build ones worth running: designing graders that grade outcomes instead of paths, reasoning about non-determinism with pass@k vs pass^k, and what we learned at Anthropic when one of our models found a loophole in our own test.

  • Cascading Failures in Multi-Agent Systems: Tracing and Evaluating Multi-Agent Deployments

    • Multi-agent systems shift the evaluation challenge from individual model outputs to the integrity of the coordination layer. When a supervisor agent delegates a task with flawed context, the error propagates and amplifies through the chain, leading to distributed hallucinations that bypass traditional end-to-end testing. Debugging these systems requires treating them as distributed networks rather than isolated LLM calls.
      We'll cover tracing, evaluation, and governance for multi-agent systems and how to ensure that your agentic workflows remain reliable, transparent, and secure at scale. This session provides a technical deep dive into solving observability challenges in complex agentic workflows. We explore how to move from black-box testing to a transparent architecture using MLflow tracing.

  • Self-healing agents: How Droid closes its own feedback loop

    • Every team shipping software hits the same wall: users stumble upon a bug, someone digs through logs and stack traces, reproduces the issue, and pushes a fix. At Factory, we gave the agent the tools to do that itself.
      We've built a system where Droid triages its own bug reports, queries production observability data, reproduces failures in a sandboxed terminal, and ships fixes as pull requests. This talk walks through the tools that make this possible. Who this is for: Engineers building or using agentic systems who want a practical playbook for making their systems self-diagnosing and self-healing, and who believe the best way to scale an engineering team is to make the AI agent part of its own oncall rotation.

  • AI Agent Debugging: Four Lessons from Shipping Alyx

    • We shipped Alyx, Arize's agent for AX, and it broke in ways we didn't expect: it forgot multi-step requests, got buried under tool output and huge experiment payloads, and "looked fine" right up until small changes caused quiet regressions. This talk shares four fixes that held up in production: enforce rules in code with a plan tool plus a finish gate that throws when work is incomplete; keep large structured data out of context with handles, structure-preserving previews, and jq/grep-style querying under hard output limits; turn great production sessions into tests with golden traces, fact-based assertions, and trajectory replay scored semantically; and speed up root-cause debugging by chaining traces, APM, and infra logs with markdown runbooks ("skills") and read-only wrappers.

Location
Terra Gallery & Event Venue
511 Harrison St, San Francisco, CA 94105, USA
Avatar for Agentic + AI Observability