Stop trimming escaped spaces off the end regex#635
Stop trimming escaped spaces off the end regex#635ratmice wants to merge 2 commits intosoftdevteam:masterfrom
Conversation
| // First loop over spaces | ||
| for ch in s.chars().rev().into_iter() { | ||
| if RE_SPACE_SEP.is_match(ch.encode_utf8(&mut cbuf)) { | ||
| last_char_width = ch.width().unwrap_or(0); |
There was a problem hiding this comment.
I think this is probably wrong, and we want the width of the character in bytes, and what this width function gives is the width of the character in columns.
So I think this should be ch.len_utf8()
| let mut total_ws_bytes = 0; | ||
| let mut last_ws_char_len = 0; | ||
| // First loop over spaces | ||
| for ch in s.chars().rev().into_iter() { |
There was a problem hiding this comment.
Could this be something like in s.chars().rev().into_iter().take_while(|x| RE_SPACE_SEP.is_match(...)?
There was a problem hiding this comment.
I don't think so for a couple of reasons, it seems like take_while takes ownership of the self iterator, and returns an iterator over the matching elements.
What we'd really want is skip_while, so the iterator resumes from where the skipping returned false. Unfortunately skip_while doesn't lend itself to having a fold like accumulator, this kind of needs something that does both.
Beyond that I looked at a couple of things,
try_reducecould work but is nightly only.scanseemed like it could work, but I couldn't find a way that improved clarity.
So part of the problem is that we want to accumulate into a value,
and then continue iteration from that point exactly where accumulation stopped.
Here is a try_fold version after which we do still need to make a second iterator on a sub-slice of the string.
// First loop over space characters
let (last_ws_char_len, mut total_ws_bytes) = match s.chars().rev().into_iter().try_fold((0, 0), |(last_ch_len, total_ws_bytes), ch| {
if RE_SPACE_SEP.is_match(ch.encode_utf8(&mut cbuf)) {
let this_ch_len = ch.len_utf8();
ControlFlow::Continue((this_ch_len, total_ws_bytes + this_ch_len))
} else {
ControlFlow::Break((last_ch_len, total_ws_bytes))
}
}) {
// The nightly function into_value would help here
// https://doc.rust-lang.org/stable/std/ops/enum.ControlFlow.html#method.into_value
ControlFlow::Continue((x, y)) => (x, y),
ControlFlow::Break((x, y)) => (x, y),
};
I kind of think this fold might be an improvement, opinions?
There was a problem hiding this comment.
Oops, I forgot to cargo test that try_fold implementation, after getting it to compile, so of course it doesn't seem to work yet.
Nevermind, that was me messing up some other part of the function.
There was a problem hiding this comment.
I was curious if AI had any ideas to improve the code, and it did have a novel idea.
Using the trim_end_matches function to trim all spaces, then re-add the space if needed.
I ended up rewriting basically all the AI code, but this seems like it might be more idiomatic, or at least less loopy.
fn trim_end_unescaped(s: &str) -> &str {
let trimmed = s.trim_end_matches(matches_whitespace);
if trimmed.len() == s.len() {
return s;
}
// If the number of backslashes is odd then the first space in the trimmed portion is escaped so re-add it.
if trimmed.chars().rev().take_while(|&c| c == '\\').count() % 2 == 1 {
// Panic safety: the trimmed portion is at least one char long.
&s[..trimmed.len() + s[trimmed.len()..].chars().next().unwrap().len_utf8()]
} else {
trimmed
}
}
|
I have one shallow comment: otherwise, this looks good, particularly with the test suite! |
This is a first attempt to fix #634 basically it continues to interpret escaped spaces
as a separator between regex and token ident, however it stops trimming the escaped space.
This has pretty thorough testing of the trim function, but only minimal testing of the stream of tokens produce by lrlex in cttests. Understand if you want me to expand on that, but this seemed like a good start.