Skip to content

feat: improve app insights telemetry#226

Open
Shreyas-Microsoft wants to merge 6 commits into
devfrom
psl-sw/37816-app-insights-telemetry
Open

feat: improve app insights telemetry#226
Shreyas-Microsoft wants to merge 6 commits into
devfrom
psl-sw/37816-app-insights-telemetry

Conversation

@Shreyas-Microsoft
Copy link
Copy Markdown
Collaborator

Purpose

This pull request introduces robust Application Insights and OpenTelemetry integration for the backend API, providing production-grade observability while minimizing noise and ingestion costs. It adds environment-driven configuration for telemetry, custom span processors to filter noisy spans (such as per-chunk ASGI responses and Cosmos DB dependencies), and a lightweight helper for emitting structured business events. Logging verbosity for known noisy packages is also clamped to keep logs actionable. The infrastructure Bicep templates are updated to avoid duplicate log ingestion.

Observability and Telemetry Integration:

  • Added Application Insights and OpenTelemetry support, configurable via the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable. This includes wiring up Azure Monitor and FastAPI instrumentation, with exclusion of health/startup probe routes from telemetry to avoid flooding. [1] [2] [3] [4] [5]
  • Introduced libs/logging/event_utils.py with the track_event_if_configured helper for emitting custom App Insights events, which is safe to call even when telemetry is not configured. [1] [2] [3]

Noise Suppression and Filtering:

  • Added custom OpenTelemetry SpanProcessor implementations (DropASGIResponseBodySpanProcessor, DropCosmosDependencySpanProcessor) in libs/logging/span_filters.py to filter out high-volume, low-value spans before export, reducing cost and improving signal-to-noise. [1] [2] [3]
  • Hard-clamped logging levels to WARNING for known noisy packages (e.g., Azure SDK internals, Cosmos diagnostics), regardless of operator configuration, to prevent log flooding in Application Insights. [1] [2]

Infrastructure Improvements:

  • Updated Bicep modules (infra/main.bicep, infra/main_custom.bicep) to document and prevent duplicate log ingestion by removing redundant diagnosticSettings when Application Insights is already wired to Log Analytics via workspaceResourceId. [1] [2]

These changes collectively ensure that the backend API emits actionable telemetry and business events to Application Insights with minimal configuration, while keeping the signal clean and ingestion costs predictable.

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

  • ...

Other Information

Shreyas-Microsoft and others added 5 commits May 6, 2026 19:11
Adds the small foundation needed for the rest of the App Insights
wiring on this branch:

- New `libs/logging/event_utils.py` with `track_event_if_configured`,
  a tiny gated wrapper around `azure.monitor.events.extension.track_event`
  that no-ops when `APPLICATIONINSIGHTS_CONNECTION_STRING` is unset.
  Lazy-imports the SDK and swallows export errors so telemetry can
  never break a request. Single warn-once latch keeps logs clean in
  unit tests / local dev.
- New `libs/logging/span_filters.py` with two custom `SpanProcessor`
  implementations:
    * `DropASGIResponseBodySpanProcessor` strips `http.response.body`
      child spans (one per streamed chunk) so streamed downloads / SSE
      do not flood Application Insights.
    * `DropCosmosDependencySpanProcessor` drops Cosmos DB dependency
      spans (matched on `db.system==cosmosdb` and the
      `documents.azure.com` peer host) so the Application Map is not
      dominated by per-call Cosmos operations.
- Adds `azure-monitor-events-extension` and (explicitly)
  `opentelemetry-instrumentation-fastapi` to `pyproject.toml`, then
  refreshes `uv.lock` via `uv lock` (no hand edits).

No call sites change in this commit; the new helpers are wired into
`Application.initialize` and the routers in the next two commits.

Implements groundwork for AC #1, #2, #3, #4 of AB#37816.
Work item: https://dev.azure.com/CSACTOSOL/CSA%20Solutioning/_workitems/edit/37816

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation.initialize

Glues the helpers from the previous commit into the FastAPI startup
path so AC #1, #2, #3, #4 of AB#37816 are satisfied at the code
layer:

`Application.initialize`
- New `_configure_azure_monitor`: when
  `APPLICATIONINSIGHTS_CONNECTION_STRING` is set, calls
  `azure.monitor.opentelemetry.configure_azure_monitor` with
  `enable_live_metrics=True` and the two custom span processors
  (`DropASGIResponseBodySpanProcessor`,
  `DropCosmosDependencySpanProcessor`) added in the previous commit.
  Connection string is read from env, never logged.
- New `_instrument_fastapi`: attaches
  `FastAPIInstrumentor.instrument_app(self.app, excluded_urls="health,startup")`
  AFTER all routers are registered so every business route is
  instrumented but the liveness / startup probes do not flood
  Application Insights with empty request rows.
- Both helpers fail-soft: any exception during telemetry setup is
  logged and swallowed, so a misconfigured App Insights instance can
  never block backend startup.
- Both helpers are no-ops when the connection string is unset, so
  local dev / unit tests behave exactly as they did before.

`Application_Base`
- Adds a `_NOISY_LOGGER_PACKAGES` hard-suppression list
  (`azure.core.pipeline.policies.http_logging_policy`, `azure.cosmos`,
  `opentelemetry.sdk`,
  `azure.monitor.opentelemetry.exporter.export._base`) clamped to
  WARNING regardless of the operator-supplied `AZURE_LOGGING_PACKAGES`.
  Without this clamp, the App Insights logs view is dominated by per-
  request HTTP policy logs and per-call Cosmos diagnostics.
- The clamp is one-way: never lowers a level the operator has
  explicitly raised below WARNING.

`.env.example`
- Documents `APPLICATIONINSIGHTS_CONNECTION_STRING` (commented) so
  local dev mirrors the Bicep-injected production env.

Validation
- `uv run pytest src/tests --no-cov -c /dev/null` -> 44 passed
  (with `PYTHONPATH=src/app`, matching the existing repo convention).
- Manual `Application()` smoke with the env var unset -> single
  INFO line, 26 routes registered, no exceptions.
- Manual `Application()` smoke with a fake conn string -> Azure
  Monitor configured, FastAPIInstrumentor attached, exporter
  fails-soft on DNS resolution, app continues.

Implements AC #1, #2, #3, #4 of AB#37816.
Work item: https://dev.azure.com/CSACTOSOL/CSA%20Solutioning/_workitems/edit/37816

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Application Insights custom-event emission and per-request span
annotation to every business handler in:
- routers/router_process.py
- routers/router_files.py
- routers/router_debug.py

For each handler:
- On success, emits a `<RouteName>Success` event via
  `track_event_if_configured` with the relevant domain ids
  (process_id, file_id, filename, counts, kill_state, ...).
- On exception, emits a `<RouteName>Error` event carrying
  `error`, `error_type`, and (when available) `status_code`.
  Then calls `current_span.record_exception(e)` and
  `set_status(StatusCode.ERROR)` so the failure is also visible in
  the App Insights end-to-end transaction view.
- Stamps domain ids onto the active span via `set_attribute` so
  custom queries on `requests | extend customDimensions` can pivot
  by `process_id` / `file_id` directly without re-parsing message
  strings.

Two small private helpers `_annotate_span` and
`_record_exception_on_span` are duplicated into both `router_process.py`
and `router_files.py` (rather than placed in `libs/logging/`) to keep
the OpenTelemetry import surface visible at the call site and to
avoid a new public helper that the rest of the codebase might
accidentally depend on.

`routers/http_probes.py` is left untouched — its routes match the
`excluded_urls="health,startup"` configured on `FastAPIInstrumentor`
in the previous commit and so are intentionally not telemetered.

The pre-existing `ILoggerService.log_info(f"...")` / `log_error(...)`
calls are preserved as-is. The new code emitted by this commit uses
`%s` lazy formatting in its log strings.

Validation
- `uv run pytest src/tests --no-cov -c /dev/null` -> 44 passed
  (with `PYTHONPATH=src/app`).
- `Application()` smoke-init -> 26 routes registered, all routers
  import cleanly.

Implements AC #3 of AB#37816 (telemetry visibility) and contributes
to AC #4 (continuous log collection).
Work item: https://dev.azure.com/CSACTOSOL/CSA%20Solutioning/_workitems/edit/37816

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ts gating

Adds the first dedicated test package for the new
`libs/logging/event_utils.py` helper at
`src/tests/logging/test_event_utils.py` (mirroring the existing
`src/tests/<area>/test_<thing>.py` layout — no new tooling, no new
config).

Coverage:

- `test_no_op_when_connection_string_unset`: with the env var absent,
  the helper does NOT call `track_event` and the SDK module is never
  imported (verified via a `sys.modules` fake).
- `test_warning_fires_only_once_per_process`: subsequent unconfigured
  calls are silent (one-shot warning latch).
- `test_unconfigured_warning_message_is_stable`: pins the exact
  warning template that the rest of the system asserts against.
- `test_forwards_to_track_event_when_configured`: with the env var
  set, the helper forwards the (name, properties) pair through to
  `azure.monitor.events.extension.track_event`.
- `test_none_properties_normalised_to_empty_dict`: `properties=None`
  surfaces as `{}` to the SDK so `customDimensions` is always an
  object.
- `test_whitespace_only_connection_string_is_treated_as_unset`:
  belt-and-braces gating against accidentally-blank env values.
- `test_sdk_exception_is_swallowed`: a `RuntimeError` thrown by the
  SDK is caught and logged; telemetry must never break a request.

The tests use a `sys.modules` fake for `azure.monitor.events.extension`
so they never import the real Azure Monitor SDK at runtime — keeps
the suite hermetic, fast (<1s), and CI-friendly.

Validation
- `uv run pytest src/tests --no-cov -c /dev/null` -> 51 passed
  (44 pre-existing + 7 new), with `PYTHONPATH=src/app`.

Implements verification scope of AC #3 of AB#37816.
Work item: https://dev.azure.com/CSACTOSOL/CSA%20Solutioning/_workitems/edit/37816

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ts AVM module

The AVM `insights/component@0.6.0` module already wires Application
Insights to Log Analytics through the `workspaceResourceId` parameter
(workspace-based App Insights). Adding a separate `diagnosticSettings`
entry pointing at the SAME workspace causes the platform-level
`AppAvailabilityResults`, `AppRequests`, etc. tables to be ingested
twice — once via the workspace-based App Insights pipeline and again
via the diagnostic-settings forwarder.

Removes the redundant line from both:
- `infra/main.bicep`           (line 305)
- `infra/main_custom.bicep`    (line 283)

The custom-event / OpenTelemetry path the rest of this branch wires
up sends data straight to App Insights via
`APPLICATIONINSIGHTS_CONNECTION_STRING`, which is already injected
into the backend-api and processor container apps by the same Bicep
files (no change required to those env-var blocks).

Validation
- `bicep build infra/main.bicep --outfile <tmp>`        -> OK, no diagnostics.
- `bicep build infra/main_custom.bicep --outfile <tmp>` -> OK, only a
  pre-existing BCP334 warning at line 694 (unrelated to this edit).

Implements AC #4 of AB#37816 (continuous, non-duplicated log collection).
Reference: microsoft/Conversation-Knowledge-Mining-Solution-Accelerator#811
Work item: https://dev.azure.com/CSACTOSOL/CSA%20Solutioning/_workitems/edit/37816

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
src/backend-api/src/app
   application.py601378%111–112, 114, 124, 127–128, 145–146, 148, 151, 155–156, 236
src/backend-api/src/app/libs/base
   application_base.py491373%37, 41, 62, 71, 74, 77–78, 83, 92, 100–103
src/backend-api/src/app/libs/logging
   __init__.py00100% 
   event_utils.py260100% 
   span_filters.py22863%110–117
src/backend-api/src/app/routers
   router_debug.py251060%41–42, 48–52, 55–57
   router_files.py881484%34–40, 48–52, 89, 96
   router_process.py4278181%42–48, 56–60, 93, 107–108, 116–117, 148–149, 157–158, 181–182, 190–191, 214, 218, 225, 232, 333, 337, 344, 411–413, 422–423, 447, 454, 519, 526, 599, 658–660, 668–669, 696, 729–730, 739–740, 780, 830–832, 842–846, 855–856, 884, 990–992, 1000–1001, 1027, 1060, 1063, 1087–1088, 1092–1093, 1109–1111, 1119–1120
TOTAL332529291% 

Tests Skipped Failures Errors Time
595 0 💤 0 ❌ 0 🔥 20.878s ⏱️

… timeout/connect events

Aligns the property shape of the Timeout/Connect error events with every
other error event emitted by the router (which all include both
`error` and `error_type`). Without `error`, an App Insights query
like `customEvents | where customDimensions.error contains "..."`
silently misses these two failure paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator Author

@Shreyas-Microsoft Shreyas-Microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry / Event-Mapping Audit — PR #226

Pulled the branch (70fe7cb), traced every frontend call through to its router, and checked the event coverage end-to-end. Net assessment: events map the user-visible flow accurately, with one inconsistency I fixed and a few notes for follow-up.

Frontend → Backend → Event mapping

Walked through src/frontend/src/{api,slices,pages} and confirmed all routes the UI actually hits emit both success and error events:

User action API call Event family
Create a batch POST /api/process/create CreateProcess*
Upload a single file POST /api/file/upload UploadFile*
Upload many files POST /api/process/upload UploadFiles*
Start migration POST /api/process/start-processing StartProcessing*
Poll status (SSE-style) GET /api/process/status/{id}/render/ RenderProcessStatus*
Get summary GET /api/process/process-summary/{id} GetProcessSummary*
View a file GET /api/process/{id}/file/{filename} GetFileContent*
Delete file / batch DELETE /api/process/delete-file/... · delete-process/{id} DeleteFile* · DeleteProcess*
Cancel migration POST /api/process/cancel/{id} + GET /api/process/cancel/{id}/status CancelProcess* · GetCancelStatus*
Download converted ZIP GET /api/process/{id}/download DownloadProcessFiles*

Fix I pushed (commit 6589e61)

Inconsistent property shape on TimeoutException / ConnectError paths. In cancel_process (~L948) and get_cancel_status (~L1073), the timeout / connect-error events emitted only error_type while every other error event in the PR also includes error: str(e). That means an App Insights KQL query like customEvents | where customDimensions.error contains "..." silently misses these two failure paths. Added the error field for shape consistency. All 51 backend-api tests still green.

Observations worth a follow-up (not blocking this PR)

  1. Pre-existing bug, surfaced by the audit (not from this PR). src/backend-api/src/app/routers/router_process.py lines 79–83 stack two decorators on the same handler:

    PROCESS_AGENT_ACTIVITIES = "/status/{process_id}/activities"
    ...
    @router.get(process_router_paths.PROCESS_AGENT_ACTIVITIES, response_model=Process)
    @router.post("/create")
    async def create(request: Request):

    git blame shows this came from the initial commit (084317d), not from this branch. Calling GET /status/{id}/activities would invoke create() — which doesn't take a process_id path param and would emit CreateProcessSuccess for a brand-new process. From a telemetry standpoint this means the /activities route is invisible in custom events; from an API correctness standpoint it's a separate bug worth a dedicated fix outside this PR.

  2. Success event is emitted before the ZIP is streamed in download_process_files (~L631). For the current implementation it's correct because io.BytesIO(zip_buffer.read()) materialises the whole archive before the response — the work is done. Worth keeping in mind if anyone later converts this to a true chunked / SSE stream; the same pattern would then falsely report success before any bytes leave the server.

  3. Filenames in event properties. UploadFileSuccess, DeleteFileSuccess, GetFileContentSuccess etc. include filename. Useful for triage, but worth a quick sanity-check with whoever owns compliance — file names in a migration workload can carry business-sensitive identifiers.

  4. enable_live_metrics=True is hard-coded in Application._configure_azure_monitor. Great for the maintenance window described in AC #2 and for the demo, but I'd consider gating it on an env var so production deployments can disable the always-on Live Metrics channel if they don't need it.

  5. Router unit-test coverage is still 0%. The new test_event_utils.py is solid (190 lines covering the gating helper and the lazy-import path). The router-level emissions remain uncovered — that was already true on dev, so not in scope here, but worth queuing a follow-up so the event-name contract doesn't silently drift.

Branch state

  • Pushed 6589e61 on top of 70fe7cb.
  • 51/51 backend-api unit tests pass locally.
  • PR is MERGEABLE; all CI jobs were green on the previous head and should stay green for this one-character-class change.

LGTM with the fix applied — no further blockers from my side.

@github-actions
Copy link
Copy Markdown

Coverage

Processor Coverage Report •
FileStmtsMissCoverMissing
TOTAL572571987% 
report-only-changed-files is enabled. No files were changed during this commit :)

Tests Skipped Failures Errors Time
812 0 💤 0 ❌ 0 🔥 19.726s ⏱️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant