Skip to content

XS regular expression matching with empty disjunction alternatives is not always correct #1484

@gibson042

Description

@gibson042

Environment: XS 16.8.1, slot 32 bytes, ID 4 bytes

Description

Regular expression matching fails with some scenarios involving an empty alternative. I haven't exactly pinpointed the failure conditions, but have included some small reproductions.

This issue, while not critical, is more concerning than most because it manifests as non-throwing misbehavior of regular expression patterns that can reasonably be expected.

Also: AFAICT, RegExp.prototype.exec seems to not be affected.

Steps to Reproduce

(() => {
  const re1 = /<[^:>]*(:[^>]*|)>/g; // captures ":…" or "" (fails)
  const re2 = /<[^:>]*(:[^>]*)?>/g; // captures ":…" or `undefined` (works as expected)
  const str = "<foo> <bar> <foo:bar>";
  const matchResult1 = str.match(re1);
  const matchAllResult1 = str.matchAll(re1);
  const matchResult2 = str.match(re2);
  const matchAllResult2 = str.matchAll(re2);

  return {
    matchResult1,
    matchAllResult1: [...(matchAllResult1 ?? [matchAllResult1])],
    matchResult2,
    matchAllResult2: [...(matchAllResult2 ?? [matchAllResult2])],
  };
})();

Actual behavior

{
  "matchResult1": ["<foo:bar>"],
  "matchAllResult1": [
    ["<foo:bar>", ":bar"]
  ],
  "matchResult2": ["<foo>", "<bar>", "<foo:bar>"],
  "matchAllResult2": [
    ["<foo>", undefined],
    ["<bar>", undefined],
    ["<foo:bar>", ":bar"]
  ]
}

Expected behavior

{
  "matchResult1": ["<foo>", "<bar>", "<foo:bar>"],
  "matchAllResult1": [
    ["<foo>", ""],
    ["<bar>", ""],
    ["<foo:bar>", ":bar"]
  ],
  "matchResult2": ["<foo>", "<bar>", "<foo:bar>"],
  "matchAllResult2": [
    ["<foo>", undefined],
    ["<bar>", undefined],
    ["<foo:bar>", ":bar"]
  ]
}
diff
--- actual
+++ expected
 {
-  "matchResult1": ["<foo:bar>"],
+  "matchResult1": ["<foo>", "<bar>", "<foo:bar>"],
   "matchAllResult1": [
+    ["<foo>", ""],
+    ["<bar>", ""],
     ["<foo:bar>", ":bar"]
   ],
   "matchResult2": ["<foo>", "<bar>", "<foo:bar>"],
   "matchAllResult2": [
     ["<foo>", undefined],
     ["<bar>", undefined],
     ["<foo:bar>", ":bar"]
   ]
 }

The first regular expression (with a capture containing an empty alternative) should return matches equivalent to those of the second (with an optional capture).
This is basically driven by CompileSubpattern of Disjunction :: Alternative | Disjunction, which uses MatchTwoAlternatives to match with Alternative if possible and otherwise attempt to match with Disjunction (where Disjunction :: AlternativeAlternative :: [empty] uses EmptyMatcher to always continue).

Second reproduction

(() => {
  const re1 = /[^: ]+(:[^ ]*|)/g; // captures ":…" or "" (fails)
  const re2 = /[^: ]+(:[^ ]*)?/g; // captures ":…" or `undefined` (works as expected)
  const str = "foo bar foo:bar";
  const matchResult1 = str.match(re1);
  const matchAllResult1 = str.matchAll(re1);
  const matchResult2 = str.match(re2);
  const matchAllResult2 = str.matchAll(re2);

  return {
    matchResult1,
    matchAllResult1: [...(matchAllResult1 ?? [matchAllResult1])],
    matchResult2,
    matchAllResult2: [...(matchAllResult2 ?? [matchAllResult2])],
  };
})();

Actual behavior

{
  "matchResult1": ["f", "o", "o", "b", "a", "r", "foo:bar"],
  "matchAllResult1": [
    ["f", ""],
    ["o", ""],
    ["o", ""],
    ["b", ""],
    ["a", ""],
    ["r", ""],
    ["foo:bar", ":bar"]
  ],
  "matchResult2": ["foo", "bar", "foo:bar"],
  "matchAllResult2": [
    ["foo", undefined],
    ["bar", undefined],
    ["foo:bar", ":bar"]
  ]
}

Expected behavior

The first regular expression (with a capture containing an empty alternative) should return matches equivalent to those of the second (with an optional capture).

{
  "matchResult1": ["foo", "bar", "foo:bar"],
  "matchAllResult1": [
    ["foo", ""],
    ["bar", ""],
    ["foo:bar", ":bar"]
  ],
  "matchResult2": ["foo", "bar", "foo:bar"],
  "matchAllResult2": [
    ["foo", undefined],
    ["bar", undefined],
    ["foo:bar", ":bar"]
  ]
}
diff
--- actual
+++ expected
 {
-  "matchResult1": ["f", "o", "o", "b", "a", "r", "foo:bar"],
+  "matchResult1": ["foo", "bar", "foo:bar"],
   "matchAllResult1": [
-    ["f", ""],
-    ["o", ""],
-    ["o", ""],
-    ["b", ""],
-    ["a", ""],
-    ["r", ""],
+    ["foo", ""],
+    ["bar", ""],
     ["foo:bar", ":bar"]
   ],
   "matchResult2": ["foo", "bar", "foo:bar"],
   "matchAllResult2": [
     ["foo", undefined],
     ["bar", undefined],
     ["foo:bar", ":bar"]
   ]
 }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions