Skip to content

Improvements to the results analysis app#77

Draft
fanny-riols wants to merge 15 commits intomainfrom
pr/fr/results_app
Draft

Improvements to the results analysis app#77
fanny-riols wants to merge 15 commits intomainfrom
pr/fr/results_app

Conversation

@fanny-riols
Copy link
Copy Markdown
Collaborator

@fanny-riols fanny-riols commented Apr 24, 2026

Summary

  • Cascading System filter: replaced the flat Model filter in the cross-run comparison with a System filter that shows a human-readable label and is constrained by the selected Provider, so irrelevant systems are hidden automatically.
  • Complete-runs-only & hide-incomplete toggles: two new sidebar toggles filter out partial runs from the comparison table and scatter plot; both are URL-bound so the view is shareable.
  • Per-sample heatmap: a new interactive heatmap below the comparison tables shows per-record metric scores across all runs, with a metric selector, swap-axes toggle, and correct RdYlGn / RdYlGn_r colorscale based on whether lower or higher is better.
  • Pinned columns & row numbers: the #, link, and System columns are now pinned in the dataframe so they stay visible when scrolling horizontally; the link column is labelled "Run".
  • URL-bound state: the latest-run toggle and show-sub-metrics toggle are now bound to query params, so deep-linking to a specific view works end-to-end.
  • Lower-is-better metric set: restricted the inverted colorscale to response_speed and stt_wer only (was applied too broadly).
  • Latest-run filter fix: timestamp-only folders now derive their system name from config.json so that different models sharing the same folder format are treated as distinct systems and deduplicated correctly.

Two bugs in filter_latest_runs:
- Timestamp-only folders (no system suffix) used the full folder name as
  system identity, so multiple haiku runs would never deduplicate. Now loads
  config.json to derive the system name the same way the display does.
- When multiple output directories are entered, runs were concatenated
  per-directory without a global sort, so an older run from dir1 could win
  over a newer run from dir2. Now sorted globally by folder name before filtering.
The previous change made timestamp-only folders (no system suffix) derive
their system name from config.json. This caused a newer timestamp-only run
to claim the same system name as an older suffixed run with a matching model
config, silently suppressing the suffixed run in filter_latest_runs.

Timestamp-only folders keep their full folder name as system name (unique),
so they never suppress suffixed runs with matching model configs.
Timestamp-only folders (no system suffix) derive their system identity
from config.json so that runs with different models (e.g. different LLMs
with the same STT/TTS) are treated as distinct systems, and multiple runs
of the same model are deduplicated to the latest one.
Bind 'Show sub-metrics' to query params so it persists in shared links.
Add 'Hide incomplete results' toggle (default on, URL-bound) that drops
rows missing any EVA-A or EVA-X metric from the table and scatter plot;
diagnostic and validation metrics are ignored.
Renders an interactive Plotly heatmap below the results tables with
sample IDs on one axis, systems on the other, and cell color encoding
the selected metric. Includes a metric dropdown, RdYlGn/RdYlGn_r
colorscale auto-selected by directionality, and a Swap Axes toggle
that reverses sample order on the y-axis to preserve top-to-bottom
reading order.
Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.
Comment thread apps/analysis.py Outdated
Comment on lines +83 to +85
# Metric names for which lower values are better
_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I detect automatically now from #65

Comment thread apps/analysis.py Outdated
Comment on lines 1184 to 1192
sub_df.insert(0, "#", range(1, len(sub_df) + 1))
sub_df.insert(1, "link", link_series)
sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})
score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]
styled = sub_df.style.map(_color_cell, subset=score_cols)
styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")
st.dataframe(
styled,
hide_index=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important that the # column start at 1? If it can start at 0, we can simply show the index column by removing hide_index=True.

Suggested change
sub_df.insert(0, "#", range(1, len(sub_df) + 1))
sub_df.insert(1, "link", link_series)
sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})
score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]
styled = sub_df.style.map(_color_cell, subset=score_cols)
styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")
st.dataframe(
styled,
hide_index=True,
sub_df.insert(0, "link", link_series)
sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename})
score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics]
styled = sub_df.style.map(_color_cell, subset=score_cols)
styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—")
st.dataframe(
styled,

Comment thread apps/analysis.py Outdated
hide_index=True,
column_config={"link": st.column_config.LinkColumn(" ", display_text="🔍", width=40)},
column_config={
"#": st.column_config.NumberColumn("#", width=50, pinned=True),
Copy link
Copy Markdown
Collaborator

@JosephMarinier JosephMarinier Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we apply this suggestion, we can also remove this since the index column is pinned by default.

Suggested change
"#": st.column_config.NumberColumn("#", width=50, pinned=True),

Comment thread apps/analysis.py Outdated
Comment on lines +1237 to +1247
ctrl_col, swap_col = st.columns([4, 1])
with ctrl_col:
selected_heatmap_metric = st.selectbox(
"Metric",
available_heatmap_metrics,
format_func=_format_metric_name,
key="heatmap_metric",
)
with swap_col:
st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True)
swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streamlit now has a st.space() function that can be used instead of st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True). Also, we can use st.container(horizontal=True) to that items flow naturally, instead of forcing arbitrary columns with st.columns(). Lastly, with a few tweaks, we can center the toggle to the selectbox:

Suggested change
ctrl_col, swap_col = st.columns([4, 1])
with ctrl_col:
selected_heatmap_metric = st.selectbox(
"Metric",
available_heatmap_metrics,
format_func=_format_metric_name,
key="heatmap_metric",
)
with swap_col:
st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True)
swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")
with st.container(horizontal=True, vertical_alignment="center", gap="medium"):
selected_heatmap_metric = st.selectbox(
"Metric",
available_heatmap_metrics,
format_func=_format_metric_name,
key="heatmap_metric",
)
with st.container(width="content", gap=None):
st.space(28) # Height of the selectbox's label.
swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes")

gabegma and others added 2 commits April 25, 2026 16:08
- Remove redundant _LOWER_BETTER_METRICS set and _is_lower_better(); use _is_lower_is_better() (registry-driven) for heatmap colorscale instead
- Drop manual # row counter column; show default dataframe index
- Replace st.columns + markdown padding with st.container(horizontal=True) + st.space() for heatmap controls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants