Improvements to the results analysis app by fanny-riols · Pull Request #77 · ServiceNow/eva

fanny-riols · 2026-04-24T20:30:04Z

Summary

Cascading System filter: replaced the flat Model filter in the cross-run comparison with a System filter that shows a human-readable label and is constrained by the selected Provider, so irrelevant systems are hidden automatically.
Complete-runs-only & hide-incomplete toggles: two new sidebar toggles filter out partial runs from the comparison table and scatter plot; both are URL-bound so the view is shareable.
Per-sample heatmap: a new interactive heatmap below the comparison tables shows per-record metric scores across all runs, with a metric selector, swap-axes toggle, and correct RdYlGn / RdYlGn_r colorscale based on whether lower or higher is better.
Pinned columns & row numbers: the #, link, and System columns are now pinned in the dataframe so they stay visible when scrolling horizontally; the link column is labelled "Run".
URL-bound state: the latest-run toggle and show-sub-metrics toggle are now bound to query params, so deep-linking to a specific view works end-to-end.
Lower-is-better metric set: restricted the inverted colorscale to response_speed and stt_wer only (was applied too broadly).
Latest-run filter fix: timestamp-only folders now derive their system name from config.json so that different models sharing the same folder format are treated as distinct systems and deduplicated correctly.

Two bugs in filter_latest_runs: - Timestamp-only folders (no system suffix) used the full folder name as system identity, so multiple haiku runs would never deduplicate. Now loads config.json to derive the system name the same way the display does. - When multiple output directories are entered, runs were concatenated per-directory without a global sort, so an older run from dir1 could win over a newer run from dir2. Now sorted globally by folder name before filtering.

The previous change made timestamp-only folders (no system suffix) derive their system name from config.json. This caused a newer timestamp-only run to claim the same system name as an older suffixed run with a matching model config, silently suppressing the suffixed run in filter_latest_runs. Timestamp-only folders keep their full folder name as system name (unique), so they never suppress suffixed runs with matching model configs.

…ison

Timestamp-only folders (no system suffix) derive their system identity from config.json so that runs with different models (e.g. different LLMs with the same STT/TTS) are treated as distinct systems, and multiple runs of the same model are deduplicated to the latest one.

Bind 'Show sub-metrics' to query params so it persists in shared links. Add 'Hide incomplete results' toggle (default on, URL-bound) that drops rows missing any EVA-A or EVA-X metric from the table and scatter plot; diagnostic and validation metrics are ignored.

… comparison tables

Renders an interactive Plotly heatmap below the results tables with sample IDs on one axis, systems on the other, and cell color encoding the selected metric. Includes a metric dropdown, RdYlGn/RdYlGn_r colorscale auto-selected by directionality, and a Swap Axes toggle that reverses sample order on the y-axis to preserve top-to-bottom reading order.

Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.

gabegma · 2026-04-25T02:53:30Z

+# Metric names for which lower values are better
+_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}
+


I detect automatically now from #65

JosephMarinier · 2026-04-25T18:18:32Z

+        sub_df.insert(0, "#", range(1, len(sub_df) + 1))
+        sub_df.insert(1, "link", link_series)
        sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})
        score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]
        styled = sub_df.style.map(_color_cell, subset=score_cols)
        styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")
        st.dataframe(
            styled,
            hide_index=True,


Is it important that the # column start at 1? If it can start at 0, we can simply show the index column by removing hide_index=True.

Suggested change

sub_df.insert(0, "#", range(1, len(sub_df) + 1))

sub_df.insert(1, "link", link_series)

sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})

score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]

styled = sub_df.style.map(_color_cell, subset=score_cols)

styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")

st.dataframe(

styled,

hide_index=True,

sub_df.insert(0, "link", link_series)

sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})

score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]

styled = sub_df.style.map(_color_cell, subset=score_cols)

styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")

st.dataframe(

styled,

JosephMarinier · 2026-04-25T18:19:05Z

            hide_index=True,
-            column_config={"link": st.column_config.LinkColumn(" ", display_text="🔍", width=40)},
+            column_config={
+                "#": st.column_config.NumberColumn("#", width=50, pinned=True),


If we apply this suggestion, we can also remove this since the index column is pinned by default.

Suggested change

"#": st.column_config.NumberColumn("#", width=50, pinned=True),

JosephMarinier · 2026-04-25T18:44:40Z

+            ctrl_col, swap_col = st.columns([4, 1])
+            with ctrl_col:
+                selected_heatmap_metric = st.selectbox(
+                    "Metric",
+                    available_heatmap_metrics,
+                    format_func=_format_metric_name,
+                    key="heatmap_metric",
+                )
+            with swap_col:
+                st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True)
+                swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")


Streamlit now has a st.space() function that can be used instead of st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True). Also, we can use st.container(horizontal=True) to that items flow naturally, instead of forcing arbitrary columns with st.columns(). Lastly, with a few tweaks, we can center the toggle to the selectbox:

Suggested change

ctrl_col, swap_col = st.columns([4, 1])

with ctrl_col:

selected_heatmap_metric = st.selectbox(

"Metric",

available_heatmap_metrics,

format_func=_format_metric_name,

key="heatmap_metric",

)

with swap_col:

st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True)

swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")

with st.container(horizontal=True, vertical_alignment="center", gap="medium"):

selected_heatmap_metric = st.selectbox(

"Metric",

available_heatmap_metrics,

format_func=_format_metric_name,

key="heatmap_metric",

)

with st.container(width="content", gap=None):

st.space(28) # Height of the selectbox's label.

swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")

- Remove redundant _LOWER_BETTER_METRICS set and _is_lower_better(); use _is_lower_is_better() (registry-driven) for heatmap colorscale instead - Drop manual # row counter column; show default dataframe index - Replace st.columns + markdown padding with st.container(horizontal=True) + st.space() for heatmap controls

fanny-riols added 13 commits April 17, 2026 15:27

Persist latest-run toggle in URL and propagate to run overview links

11727c5

Replace Model filter with cascading System filter in cross-run compar…

e94f9c3

…ison

Add 'Complete runs only' toggle to cross-run comparison

179b75e

Apply complete-runs-only filter to scatter plot

fa24303

Add pinned row IDs, link label, and pinned System column to cross-run…

079e210

… comparison tables

Restrict lower-is-better to response_speed and stt_wer only

3e8b5d3

Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.

Merge branch 'main' into pr/fr/results_app

1d9855d

Hide app errors from UI, log file-load failures to stderr

9f7dfc3

gabegma reviewed Apr 25, 2026

View reviewed changes

JosephMarinier reviewed Apr 25, 2026

View reviewed changes

JosephMarinier approved these changes Apr 25, 2026

View reviewed changes

gabegma and others added 2 commits April 25, 2026 16:08

Merge branch 'main' into pr/fr/results_app

d648c86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the results analysis app#77

Improvements to the results analysis app#77
fanny-riols wants to merge 15 commits intomainfrom
pr/fr/results_app

fanny-riols commented Apr 24, 2026 •

edited

Loading

Uh oh!

gabegma Apr 25, 2026

Uh oh!

JosephMarinier Apr 25, 2026

Uh oh!

JosephMarinier Apr 25, 2026 •

edited

Loading

Uh oh!

JosephMarinier Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Metric names for which lower values are better
		_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}

Conversation

fanny-riols commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gabegma Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JosephMarinier Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fanny-riols commented Apr 24, 2026 •

edited

Loading

JosephMarinier Apr 25, 2026 •

edited

Loading