Draft
Conversation
Two bugs in filter_latest_runs: - Timestamp-only folders (no system suffix) used the full folder name as system identity, so multiple haiku runs would never deduplicate. Now loads config.json to derive the system name the same way the display does. - When multiple output directories are entered, runs were concatenated per-directory without a global sort, so an older run from dir1 could win over a newer run from dir2. Now sorted globally by folder name before filtering.
The previous change made timestamp-only folders (no system suffix) derive their system name from config.json. This caused a newer timestamp-only run to claim the same system name as an older suffixed run with a matching model config, silently suppressing the suffixed run in filter_latest_runs. Timestamp-only folders keep their full folder name as system name (unique), so they never suppress suffixed runs with matching model configs.
Timestamp-only folders (no system suffix) derive their system identity from config.json so that runs with different models (e.g. different LLMs with the same STT/TTS) are treated as distinct systems, and multiple runs of the same model are deduplicated to the latest one.
Bind 'Show sub-metrics' to query params so it persists in shared links. Add 'Hide incomplete results' toggle (default on, URL-bound) that drops rows missing any EVA-A or EVA-X metric from the table and scatter plot; diagnostic and validation metrics are ignored.
… comparison tables
Renders an interactive Plotly heatmap below the results tables with sample IDs on one axis, systems on the other, and cell color encoding the selected metric. Includes a metric dropdown, RdYlGn/RdYlGn_r colorscale auto-selected by directionality, and a Swap Axes toggle that reverses sample order on the y-axis to preserve top-to-bottom reading order.
Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.
gabegma
reviewed
Apr 25, 2026
Comment on lines
+83
to
+85
| # Metric names for which lower values are better | ||
| _LOWER_BETTER_METRICS = {"response_speed", "stt_wer"} | ||
|
|
Comment on lines
1184
to
1192
| sub_df.insert(0, "#", range(1, len(sub_df) + 1)) | ||
| sub_df.insert(1, "link", link_series) | ||
| sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename}) | ||
| score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics] | ||
| styled = sub_df.style.map(_color_cell, subset=score_cols) | ||
| styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—") | ||
| st.dataframe( | ||
| styled, | ||
| hide_index=True, |
Collaborator
There was a problem hiding this comment.
Is it important that the # column start at 1? If it can start at 0, we can simply show the index column by removing hide_index=True.
Suggested change
| sub_df.insert(0, "#", range(1, len(sub_df) + 1)) | |
| sub_df.insert(1, "link", link_series) | |
| sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename}) | |
| score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics] | |
| styled = sub_df.style.map(_color_cell, subset=score_cols) | |
| styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—") | |
| st.dataframe( | |
| styled, | |
| hide_index=True, | |
| sub_df.insert(0, "link", link_series) | |
| sub_df = sub_df.rename(columns={**id_rename, **composite_rename, **col_rename}) | |
| score_cols = [composite_rename[c] for c in composites] + [col_rename[m] for m in metrics] | |
| styled = sub_df.style.map(_color_cell, subset=score_cols) | |
| styled = styled.format(dict.fromkeys(score_cols, "{:.3f}"), na_rep="—") | |
| st.dataframe( | |
| styled, |
| hide_index=True, | ||
| column_config={"link": st.column_config.LinkColumn(" ", display_text="🔍", width=40)}, | ||
| column_config={ | ||
| "#": st.column_config.NumberColumn("#", width=50, pinned=True), |
Collaborator
There was a problem hiding this comment.
If we apply this suggestion, we can also remove this since the index column is pinned by default.
Suggested change
| "#": st.column_config.NumberColumn("#", width=50, pinned=True), |
Comment on lines
+1237
to
+1247
| ctrl_col, swap_col = st.columns([4, 1]) | ||
| with ctrl_col: | ||
| selected_heatmap_metric = st.selectbox( | ||
| "Metric", | ||
| available_heatmap_metrics, | ||
| format_func=_format_metric_name, | ||
| key="heatmap_metric", | ||
| ) | ||
| with swap_col: | ||
| st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True) | ||
| swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes") |
Collaborator
There was a problem hiding this comment.
Streamlit now has a st.space() function that can be used instead of st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True). Also, we can use st.container(horizontal=True) to that items flow naturally, instead of forcing arbitrary columns with st.columns(). Lastly, with a few tweaks, we can center the toggle to the selectbox:
Suggested change
| ctrl_col, swap_col = st.columns([4, 1]) | |
| with ctrl_col: | |
| selected_heatmap_metric = st.selectbox( | |
| "Metric", | |
| available_heatmap_metrics, | |
| format_func=_format_metric_name, | |
| key="heatmap_metric", | |
| ) | |
| with swap_col: | |
| st.markdown("<div style='padding-top:28px'></div>", unsafe_allow_html=True) | |
| swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes") | |
| with st.container(horizontal=True, vertical_alignment="center", gap="medium"): | |
| selected_heatmap_metric = st.selectbox( | |
| "Metric", | |
| available_heatmap_metrics, | |
| format_func=_format_metric_name, | |
| key="heatmap_metric", | |
| ) | |
| with st.container(width="content", gap=None): | |
| st.space(28) # Height of the selectbox's label. | |
| swap_axes = st.toggle("Swap axes", key="heatmap_swap_axes") |
JosephMarinier
approved these changes
Apr 25, 2026
- Remove redundant _LOWER_BETTER_METRICS set and _is_lower_better(); use _is_lower_is_better() (registry-driven) for heatmap colorscale instead - Drop manual # row counter column; show default dataframe index - Replace st.columns + markdown padding with st.container(horizontal=True) + st.space() for heatmap controls
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#, link, and System columns are now pinned in the dataframe so they stay visible when scrolling horizontally; the link column is labelled "Run".response_speedandstt_weronly (was applied too broadly).config.jsonso that different models sharing the same folder format are treated as distinct systems and deduplicated correctly.