Skip to content

FIX: Skip [u] inside links + collapse HTML whitespace per spec#36

Merged
gschlager merged 3 commits into
mainfrom
skip-underline-inside-links
May 11, 2026
Merged

FIX: Skip [u] inside links + collapse HTML whitespace per spec#36
gschlager merged 3 commits into
mainfrom
skip-underline-inside-links

Conversation

@gschlager
Copy link
Copy Markdown
Member

Summary

Three related changes that surfaced from Outlook-style HTML pasted into Discourse:

  • [u] inside [url]/[email]: Discourse's BBCode plugin doesn't re-cook BBCode inside Markdown link text, so [[u]X[/u]](url) stayed literal. UnderlineTag now drops the wrapper when rendering under a Url or Email ancestor (transitively, so [url][b][u]X[/u][/b][/url] works too).
  • HTML whitespace collapsing per spec: text nodes outside <pre>/<code>/<textarea>/<tt> collapse \s+ to a single space; leading whitespace at an element's start is dropped; trailing whitespace at an element's end is trimmed; whitespace immediately before a block-level child collapses against the boundary. Block-ness is opt-in via a new empty marker module AST::Block included by Paragraph, Heading, List, ListItem, Quote, Table/Row/Cell, HorizontalRule, and Align — third-party AST extensions get the same treatment by including the module.
  • <br>/<hr> use VoidHandler instead of inline lambdas: closes a hole where <hr> (a block-level element) didn't expose element_class and silently bypassed the block-trim. Proc-handler support is left in place; @gschlager will remove it as a breaking change in Redesign migration API around Conversion/Parse result types #30.

The pre-existing test that treated raw \n\n in <blockquote> source as a paragraph break is updated — per HTML spec, newlines in source are whitespace, not structural breaks. Use <p> or <br> for actual breaks.

Test plan

  • bundle exec rspec — 3077 examples, 0 failures
  • bin/lint — no offenses
  • bin/mutant run 'Markbridge::Parsers::HTML::Parser*' 'Markbridge::Renderers::Discourse::Tags::UnderlineTag*' 'Markbridge::Parsers::HTML::Handlers::VoidHandler*' — 100% mutation coverage (551 mutations, 0 alive)
  • Each of the three commits passes its own test suite (verified by checking out each SHA in turn)
  • Three real-world Outlook HTML samples (<a><u>Facebook</u></a>, <a>\n<u>Twitter</u></a>, <a href=…>\n<u>X</u></a>) now round-trip to clean Markdown

gschlager added 3 commits May 11, 2026 10:47
Discourse's BBCode plugin cooks `[u]…[/u]` from Markdown source but does
not re-process BBCode inside Markdown link text, so `[[u]X[/u]](url)`
stays literal. Drop the wrapper when rendering under a Url or Email
ancestor (transitively, so it also covers `[url][b][u]X[/u][/b][/url]`).

Imported HTML like `<a><u>Facebook</u></a>` now round-trips cleanly.
Match browser whitespace handling so authors can indent source HTML
without ending up with stray spaces or line breaks in the rendered
Markdown:

- Collapse `\s+` runs in text nodes to a single space, except inside
  `<pre>`, `<code>`, `<textarea>`, or `<tt>` ancestors.
- Drop leading whitespace at the start of an element's content.
- Trim trailing whitespace from an element's last Text child after its
  children are processed.
- Trim trailing whitespace on the parent's previous Text sibling before
  a block-level AST node starts, mirroring the CSS rule that whitespace
  collapses against block boundaries.

Block-ness is opt-in via a new `AST::Block` marker module included by
Paragraph, Heading, List, ListItem, Quote, Table, TableRow, TableCell,
HorizontalRule, and Align. Code is intentionally not marked so inline
`<code>` between text segments keeps its surrounding whitespace.

The pre-existing test that treated raw `\n\n` in `<blockquote>` source
as a paragraph break is updated to reflect the spec — newlines in HTML
source are whitespace, not structural breaks.
The default HTML registry used inline lambdas for `<br>` and `<hr>`,
which meant they didn't expose `element_class` like every other
handler. As a result the `AST::Block`-based whitespace trim could not
see that `<hr>` produces a block-level node, so trailing whitespace
before `<hr>` was silently kept.

Add a small `VoidHandler` (counterpart to `SimpleHandler` for void
elements that take no children) and register `<br>`/`<hr>` through it.
Trailing whitespace before `<hr>` now trims, matching browser layout.

Proc handler support is left in the parser and registry — it stays for
now and will be removed in a separate breaking change.
@gschlager gschlager force-pushed the skip-underline-inside-links branch from aadecbe to 914e9d1 Compare May 11, 2026 09:52
@gschlager gschlager merged commit 2504c08 into main May 11, 2026
8 checks passed
@gschlager gschlager deleted the skip-underline-inside-links branch May 11, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant