Skip to content

Stop trimming escaped spaces off the end regex#635

Open
ratmice wants to merge 2 commits intosoftdevteam:masterfrom
ratmice:regex_trailing_ws
Open

Stop trimming escaped spaces off the end regex#635
ratmice wants to merge 2 commits intosoftdevteam:masterfrom
ratmice:regex_trailing_ws

Conversation

@ratmice
Copy link
Copy Markdown
Collaborator

@ratmice ratmice commented May 3, 2026

This is a first attempt to fix #634 basically it continues to interpret escaped spaces
as a separator between regex and token ident, however it stops trimming the escaped space.

This has pretty thorough testing of the trim function, but only minimal testing of the stream of tokens produce by lrlex in cttests. Understand if you want me to expand on that, but this seemed like a good start.

@ratmice ratmice linked an issue May 3, 2026 that may be closed by this pull request
Comment thread lrlex/src/lib/parser.rs Outdated
// First loop over spaces
for ch in s.chars().rev().into_iter() {
if RE_SPACE_SEP.is_match(ch.encode_utf8(&mut cbuf)) {
last_char_width = ch.width().unwrap_or(0);
Copy link
Copy Markdown
Collaborator Author

@ratmice ratmice May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably wrong, and we want the width of the character in bytes, and what this width function gives is the width of the character in columns.

So I think this should be ch.len_utf8()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 26e8886

Comment thread lrlex/src/lib/parser.rs
let mut total_ws_bytes = 0;
let mut last_ws_char_len = 0;
// First loop over spaces
for ch in s.chars().rev().into_iter() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be something like in s.chars().rev().into_iter().take_while(|x| RE_SPACE_SEP.is_match(...)?

Copy link
Copy Markdown
Collaborator Author

@ratmice ratmice May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so for a couple of reasons, it seems like take_while takes ownership of the self iterator, and returns an iterator over the matching elements.

What we'd really want is skip_while, so the iterator resumes from where the skipping returned false. Unfortunately skip_while doesn't lend itself to having a fold like accumulator, this kind of needs something that does both.

Beyond that I looked at a couple of things,

  • try_reduce could work but is nightly only.
  • scan seemed like it could work, but I couldn't find a way that improved clarity.

So part of the problem is that we want to accumulate into a value,
and then continue iteration from that point exactly where accumulation stopped.

Here is a try_fold version after which we do still need to make a second iterator on a sub-slice of the string.

    // First loop over space characters
    let (last_ws_char_len, mut total_ws_bytes) = match s.chars().rev().into_iter().try_fold((0, 0), |(last_ch_len, total_ws_bytes), ch| {
        if RE_SPACE_SEP.is_match(ch.encode_utf8(&mut cbuf)) {
            let this_ch_len = ch.len_utf8();
            ControlFlow::Continue((this_ch_len, total_ws_bytes + this_ch_len))
        } else {
            ControlFlow::Break((last_ch_len, total_ws_bytes))
        }
    }) {
        // The nightly function into_value would help here
       // https://doc.rust-lang.org/stable/std/ops/enum.ControlFlow.html#method.into_value
        ControlFlow::Continue((x, y)) => (x, y),
        ControlFlow::Break((x, y)) => (x, y),
    };

I kind of think this fold might be an improvement, opinions?

Copy link
Copy Markdown
Collaborator Author

@ratmice ratmice May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to cargo test that try_fold implementation, after getting it to compile, so of course it doesn't seem to work yet.

Nevermind, that was me messing up some other part of the function.

Copy link
Copy Markdown
Collaborator Author

@ratmice ratmice May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious if AI had any ideas to improve the code, and it did have a novel idea.
Using the trim_end_matches function to trim all spaces, then re-add the space if needed.

I ended up rewriting basically all the AI code, but this seems like it might be more idiomatic, or at least less loopy.

fn trim_end_unescaped(s: &str) -> &str {
    let trimmed = s.trim_end_matches(matches_whitespace);
    if trimmed.len() == s.len() {
        return s;
    }
    // If the number of backslashes is odd then the first space in the trimmed portion is escaped so re-add it.
    if trimmed.chars().rev().take_while(|&c| c == '\\').count() % 2 == 1 {
        // Panic safety: the trimmed portion is at least one char long.
        &s[..trimmed.len() + s[trimmed.len()..].chars().next().unwrap().len_utf8()]
    } else {
        trimmed
    }
}

@ltratt
Copy link
Copy Markdown
Member

ltratt commented May 3, 2026

I have one shallow comment: otherwise, this looks good, particularly with the test suite!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regex/token name separator for regexes that end with trailing spaces

2 participants