Skip to content

Fix TUI bugs and UX issues from real hardware testing#15

Merged
eshork merged 32 commits intomainfrom
tui-fixes
Apr 24, 2026
Merged

Fix TUI bugs and UX issues from real hardware testing#15
eshork merged 32 commits intomainfrom
tui-fixes

Conversation

@eshork
Copy link
Copy Markdown
Contributor

@eshork eshork commented Apr 24, 2026

Summary

Addresses 27 user-reported bugs and UX issues discovered during live testing on an RTX 3080 system booting NeuralDrive from USB, plus a critical GPU acceleration bug where Ollama was silently falling back to CPU-only inference. Includes comprehensive documentation updates across 17 files to reflect all implementation changes. All code changes have been deployed to the live system and verified on hardware.

Commits

Commit Description
6ecd510 Fix TUI bugs and UX issues from real hardware testing (15 files, +1476/-174)
334ef93 Show model metadata and fix button visibility in model list
927df1a Save API key to persistent disk alongside overlay
a1e82c4 Add live clock to dashboard top-right corner
bdfd694 Fix chat screen layout and text wrapping
d0f1ca8 Add model delete and fix chat model persistence
ff34cab Harden wizard finalization, add --wizard flag, and Enter-to-pull
fbac7b6 Harden partition detection, wizard source of truth, and subprocess error checking
6efe83a Move partition snapshot before mkpart to prevent race condition
5534f54 Harden partition creation safety and boot device detection
c0e802c Guard pull button and Enter against concurrent submissions
b0d8a88 Remove dual wizard marker, check all subprocess returns, normalize live-media path, guard _pull_next
64a9514 Fall through to findmnt when live-media PKNAME fails
b1003b1 Fix Header crash on screen transitions and simplify --wizard flag
493fe4e Fix GPU acceleration: load nvidia-uvm at boot and remove cgroup device filter
5e1d376 Escape Rich markup in [GPU]/[CPU] tags so they render visibly
5f8908e Add arrow-key navigation with scroll-follow to installed models list
b555d1f Unify models screen focus: zone-based Tab, arrow-key list+button nav
3c12f7d Models screen: skip disabled buttons, Loading... feedback, column legend
8064d81 Restore _unload_from_vram and add legend column separators
fe0a28f Fix unload race condition and keep manually loaded models in VRAM
4bc2f32 Fix keep_alive: pass integer -1 instead of string
78fbc0d Redesign services screen to match models screen UX
c8e3a71 Remap screen hotkeys to F1-F5: Dash, Models, Svc, Logs, Chat
ea9fcfc Guard service poll timer against widget rebuild race
2a704c6 Allow concurrent model loading and persist Ollama config
37d0330 Widen services Restart button to fit label
2160fa8 Create webui data directory on persistence partition
5dec79c Update documentation to reflect TUI redesign, GPU fixes, and config changes
c621827 Mark VRAM-loaded models with * in chat selector and retain input focus

Issues Addressed

Closes #6, #8, #9, #10, #11, #13. References #4, #5, #7, #12, #14.

TUI Changes

Navigation Overhaul

  • Replaced single-letter hotkeys with F1-F5 function keys (Dashboard, Models, Services, Logs, Chat)
  • Implemented zone-based Tab navigation within screens
  • Arrow key navigation for lists and per-item action buttons
  • Enter key activates focused elements
  • Removed command palette and all hidden hotkeys

Models Screen (complete redesign)

  • Three-zone layout: installed models list, browse catalog, pull-by-name
  • Inline Load/Unload/Delete buttons per model with Left/Right arrow navigation
  • Column legend with metadata: Params / Quant / Disk / VRAM / Status
  • VRAM usage cache persisted to /var/lib/neuraldrive/config/
  • Download progress bar with cancel support
  • Loading... feedback and disabled button skip logic
  • Unload race condition fix (poll /api/ps until confirmed)
  • keep_alive: -1 (integer) for infinite retention on manual loads

Services Screen (complete redesign)

  • ServiceItem widget with inline Start/Stop/Restart buttons (colored: green/red/amber)
  • Arrow key navigation matching models screen behavior
  • Auto-poll every 5 seconds with _loading guard against widget rebuild race

Chat Screen

  • Model selector dropdown with persistence across screen switches
  • VRAM-loaded models marked with * prefix in selector, refreshed every 10 seconds
  • Input focus retained after sending messages (no re-click needed)
  • Streaming responses via @work(exclusive=True)

Dashboard

  • Live system clock in upper-right corner
  • GPU/CPU tags ([GPU]/[CPU]) rendered correctly (escaped Rich markup)

Wizard & First Boot

  • Correct 6-step flow: Welcome → Storage → Security → Network → Models → Done
  • Creates persistence directories including /var/lib/neuraldrive/webui/
  • --wizard CLI flag to force re-run
  • Sentinel file: /etc/neuraldrive/first-boot-complete

Reliability

  • SafeHeader widget catches Textual Header NoMatches bug (#4258)
  • Crash dumps written to /var/lib/neuraldrive/logs/tui-crash-*.log
  • Screenshots saved to /var/lib/neuraldrive/screenshots/

GPU / System Changes

Critical: GPU Acceleration Fix

  • nvidia-uvm: Added modprobe nvidia-current-uvm + nvidia-modprobe -u as ExecStartPre in Ollama service and /etc/modules-load.d/nvidia-uvm.conf for boot-time loading
  • DeviceAllow removed: cgroup v2 eBPF device filters blocked CUDA even with explicit allow rules; removed all DeviceAllow from Ollama service, kept PrivateDevices=no
  • Result: Ollama now uses GPU (was silently CPU-only before)

Ollama Configuration

  • OLLAMA_MAX_LOADED_MODELS=0 (auto, was 1) — concurrent model loading with LRU eviction
  • Persistent config override via EnvironmentFile=-/var/lib/neuraldrive/config/ollama.conf
  • API key synced to persistent disk alongside overlay

Documentation Updates (17 files)

User Guide

  • TUI pages (5 files): Complete rewrite of models, services, chat, dashboard, and main TUI docs with F1-F5 hotkeys, zone-based navigation, and accurate interface descriptions
  • First boot: Corrected wizard steps, sentinel file path, --wizard flag
  • Config/Performance/Recommendations (3 files): Updated OLLAMA_MAX_LOADED_MODELS to 0 (auto), added persistent config override docs, expanded config inventory
  • Services reference: Added GPU access note for Ollama
  • Troubleshooting (2 files): Added nvidia-uvm and cgroup v2 GPU troubleshooting, updated concurrent model support

Developer Guide

  • TUI component: Added chat screen, custom widgets (SafeHeader, ServiceItem, ModelItem), crash dumps, F1-F5 nav
  • Ollama component: DeviceAllow removal, persistent EnvironmentFile, nvidia-uvm ExecStartPre, API usage details
  • First-boot wizard: Corrected trigger mechanism (TUI, not systemd service), sentinel path, wizard steps
  • GPU detection: nvidia-current-uvm Debian naming, device node creation, cgroup v2 note
  • Security architecture: DeviceAllow removal explanation with cgroup v2 eBPF context

Testing

All TUI changes deployed and verified on live hardware (RTX 3080, 500GB USB, Debian 12, kernel 6.1, Ollama 0.21.1). GPU acceleration confirmed: inference compute: CUDA compute=8.6, NVIDIA GeForce RTX 3080, 10.0 GiB available.

eshork added 30 commits April 23, 2026 22:38
Addresses 27 user-reported issues from live testing on an RTX 3080
system booting from USB. All changes deployed and verified on hardware.

Crash handling:
- Override App._handle_exception() to capture Textual runtime crashes
- Write crash dumps to persistent disk (/var/lib/neuraldrive/logs/)
- Screenshots routed to persistent disk via TEXTUAL_SCREENSHOT_LOCATION
- Outer try/except in __main__ catches startup crashes

Chat screen:
- Fix TypeError from RichLog.write(end='') — removed invalid param
- Move streaming response to @work(exclusive=True) to unblock UI
- Add on_screen_resume to refresh model list on every screen visit
- Add model selector (Select widget) on dedicated row with amber border

Models screen:
- Rewrite catalog with two-zone keyboard navigation (list + buttons)
- Arrow keys navigate, Enter/Space toggle, PgUp/PgDn page jump
- Add download cancel button with worker cancellation
- Handle asyncio.CancelledError in _start_pull
- Add model load/unload via Ollama generate API (keep_alive)
- Show both Load and Unload buttons per model (disable irrelevant one)
- Fix ModelItem._size/_name collision with Textual Widget internals

Services screen:
- Fix DuplicateIds crash: await remove_children() before mounting
- Use sudo systemctl for service start/stop/restart
- Arrow-key service selection with yellow highlight
- Use Binding() objects for show/priority params (not 4-element tuples)

Dashboard:
- Expand GPU StatsBox to show Device, VRAM, Temp, Utilization
- Rename 'Loaded Models' to 'Active Models (VRAM)'

Wizard:
- Rewrite _create_persistence_partition(): fix parted start position,
  detect actual free space, immediate mount, correct Ollama dirs,
  proper ownership, restart Ollama after partition creation
- Add YAML config persistence (persistent disk with overlay fallback)

Navigation:
- Replace single-letter hotkeys with F2-F6 function keys (priority=True)
- Remove old silent hotkeys entirely
- Disable command palette via ENABLE_COMMAND_PALETTE=False
  (COMMAND_PALETTE_BINDING=None crashes Textual 8.2.4)

Security:
- Add scoped NOPASSWD sudoers (/etc/sudoers.d/neuraldrive-tui) that
  survives wizard _finalize() stripping NOPASSWD from neuraldrive-admin
- Covers systemctl, parted, mkfs, mount, chpasswd, and file ops

New files:
- utils/config.py: YAML config read/write with persistent/overlay fallback
- utils/hardware.py: Boot device detection, partition enumeration
- etc/sudoers.d/neuraldrive-tui: Scoped NOPASSWD rules for TUI ops
- dev-reset.sh: Development reset script (password, NOPASSWD, sentinel)

Build:
- Add pyyaml to TUI venv dependencies
- Set neuraldrive-tui sudoers permissions in build hook
Display parameter count, quantization level, disk size, and VRAM usage
for each installed model. VRAM is cached to persistent config on first
load so it remains visible after unloading.

Fix model-item height (3->5) so Load/Unload buttons render inside the
bordered container instead of being clipped. Show both buttons per model
with the irrelevant one disabled. Add disabled button styles.
Write api.key and credentials.conf to both /etc/neuraldrive/ (overlay)
and /var/lib/neuraldrive/config/ (persistent disk) when available.
Update wizard completion text to show where the key is stored instead
of telling the user to save it manually.
Updates every 2 seconds alongside the system stats refresh. Shows
HH:MM:SS so the user can tell at a glance the dashboard is live.
- Compact model selector into horizontal row with inline label
- Remove clipping on Select widget (border removed, height auto)
- Enable text wrapping in chat log (wrap=True on RichLog)
- Remove dock:bottom on input row to prevent footer collision
- Center Send button label vertically
Save Select value before refreshing options list, restore it
if the model is still available. Falls back to first model only
when previous selection is no longer present.
- Add red Delete button to each installed model item
- Auto-unload from VRAM before deleting if model is loaded
- Fix httpx DELETE with json body (use client.request instead)
- Preserve selected chat model when returning to chat screen
- Gate sentinel write behind errors check: sentinel is only written
  after config.save() and all prior writes succeed, preventing the
  wizard from being silently skipped after partial failures
- Guard partition detection: reject if lsblk returns base device
  instead of new partition, preventing accidental whole-disk format
- Add --wizard CLI flag to force wizard rerun on demand
- Add on_input_submitted to ModelsScreen so Enter in the pull-input
  field triggers model download
…ror checking

- Launcher now forwards "$@" so neuraldrive-tui --wizard works
- Partition detection uses before/after diff instead of fragile last-line
- Wizard completion uses sentinel file as single source of truth
- config.save() and wizard._sudo_write() check all subprocess return codes
lsblk before-snapshot was taken after mkpart, which could show
the new partition if the kernel auto-detected the table change.
Snapshot now taken before mkpart so the diff is always reliable.
- Abort before mkpart if pre-lsblk snapshot fails (no disk mutation
  without a valid baseline)
- Check partprobe return code; poll lsblk with bounded retry loop
  instead of fixed sleep(2)
- Replace fragile regex in get_boot_device() with lsblk PKNAME
  (supports NVMe, MMC, and sd devices)
- Guard Enter-to-pull against re-submission during active download
Set _pulling=True immediately in both user-facing entry points
before scheduling the @work worker, closing the race window.
Pull button handler now mirrors the Enter-to-pull guard.
…ve-media path, guard _pull_next

- Remove wizard_complete config key write from wizard finalize; sentinel
  file is now the single source of truth for wizard completion
- Remove unused wizard_complete() function from config.py
- Check return codes for all subprocess calls in partition creation:
  mkdir, chown, umount, systemctl (warning-only for restart)
- Normalize live-media= cmdline path through lsblk PKNAME for NVMe/MMC
- Set _pulling=True in _pull_next() before _start_pull() to prevent
  concurrent pull submissions from all entry points
Instead of returning the raw live-media= partition path when lsblk
PKNAME resolution fails, fall through to the findmnt detection path.
This prevents handing an unvalidated partition/symlink path to the
storage wizard for partition creation.
Replace Textual's Header with SafeHeader subclass that catches
NoMatches during title watcher updates. Textual 8.2.4 only catches
NoScreen in the set_title watcher but not NoMatches, causing crashes
when screens are pushed/popped and HeaderTitle hasn't recomposed yet.
This is a known upstream bug (Textualize/textual#4258, PR #4817).

Simplify --wizard: instead of a separate force_wizard constructor
flag, --wizard now removes the sentinel file before launch so the
existing on_mount check triggers the wizard naturally.
…e filter

- Add ExecStartPre to load nvidia-current-uvm module and create
  /dev/nvidia-uvm device nodes before Ollama starts (with - prefix
  for non-fatal failure on non-NVIDIA systems)
- Remove DeviceAllow lines that blocked CUDA access under cgroup v2
- Add nvidia-modprobe to NVIDIA package list for device node creation
- Add /etc/modules-load.d/nvidia-uvm.conf for early boot module load
- Show [GPU]/[CPU] tags with VRAM usage per model on dashboard
Rich interprets [GPU] and [CPU] as style tags and silently drops them.
Escape with backslash-bracket on dashboard. Also change model_item
status from 'VRAM' to 'GPU' for consistency.
Up/Down/PgUp/PgDn navigate between model items with a yellow
highlight border. The scroll container follows the highlighted
item via scroll_visible(), matching the catalog popup behavior.
Tab cycles between zones: model list, Browse button, Pull input,
Pull button. Within the model list zone, Up/Down navigates models
with scroll-follow, Left/Right selects Load/Unload/Delete per
model, Enter activates the selected button. All ModelItem buttons
are non-focusable — navigation is fully managed by the screen.
Left/Right nav now skips disabled buttons (Unload when not loaded,
Load when already loaded). Load button shows 'Loading...' and
disables during VRAM load. Added column header row (Params, Quant,
Disk, VRAM, Status) aligned with model item columns.
Poll /api/ps after unload until model is actually evicted (Ollama
returns 200 before eviction completes). Await remove_children() to
prevent stale widgets. Use keep_alive=-1 for manual loads so models
stay loaded until explicitly unloaded.
Ollama rejects "-1" with 'missing unit in duration', but accepts
the integer -1 for infinite keep-alive.
Each service gets its own row with inline Start/Stop/Restart buttons.
Arrow keys navigate services (Up/Down) and buttons (Left/Right).
Disabled buttons are skipped. Enter activates the highlighted button.
Service status auto-polls every 5 seconds and updates in place.
Poll fires every 5s but _load_services clears and remounts items.
Skip poll while _loading flag is set to avoid NoMatches on .svc-state.
Set OLLAMA_MAX_LOADED_MODELS=0 (auto) so Ollama manages concurrency
based on available VRAM. Add persistent EnvironmentFile override so
config on /var/lib/neuraldrive/config/ollama.conf survives reboots,
falling back to baked-in defaults when persistent disk is unavailable.
Wizard was missing /var/lib/neuraldrive/webui from the directory list,
causing systemd NAMESPACE failure (status=226) when ReadWritePaths
referenced the missing path.
eshork added 2 commits April 24, 2026 11:18
…hanges

Rewrite 17 docs files across user-guide and dev-guide to match the
current implementation after the TUI UX overhaul, GPU/VRAM fixes,
and Ollama configuration changes.

Key updates:
- Replace old single-letter hotkeys with F1-F5 function key nav
- Rewrite models and services screen docs for zone-based navigation
- Correct first-boot wizard steps, sentinel file path, and --wizard flag
- Update OLLAMA_MAX_LOADED_MODELS from 1 to 0 (auto/LRU eviction)
- Document DeviceAllow removal (cgroup v2 eBPF incompatibility)
- Document nvidia-current-uvm module naming and boot-time loading
- Add nvidia-uvm and cgroup v2 GPU troubleshooting sections
- Add persistent config override (EnvironmentFile) documentation
- Document crash dump logging, VRAM cache, and chat model selector
Chat model dropdown now prefixes loaded models with * so users can
see which models are ready without loading delay. A 10-second poll
timer keeps the indicators current as models load/unload.

Input focus is restored after each response completes so users can
type follow-up messages without re-clicking the input box.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No persistence partition auto-creation on first boot

1 participant