Improve serialization performance with memchr by mrobinson · Pull Request #740 · servo/html5ever

mrobinson · 2026-05-15T19:38:27Z

This change greatly improves the performance of serialization (up to
95% on some benchmarks) by changing the way that escaping of HTML
entities works. It uses memchar to avoid creating a chars() iterator
on the output stream. When run with the benchmark from #739, I see these
results:

serialize "lipsum.html" time:   [6.4817 µs 6.5021 µs 6.5212 µs]
                        change: [−95.179% −95.013% −94.846%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild

serialize "lipsum-zh.html"
                        time:   [2.0815 µs 2.0888 µs 2.0947 µs]
                        change: [−91.533% −90.940% −90.407%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) low severe

serialize "medium-fragment.html"
                        time:   [7.7625 µs 7.7927 µs 7.8147 µs]
                        change: [−84.424% −83.952% −83.486%] (p = 0.00 < 0.05)
                        Performance has improved.

serialize "small-fragment.html"
                        time:   [879.01 ns 886.43 ns 892.78 ns]
                        change: [−89.813% −89.711% −89.610%] (p = 0.00 < 0.05)
                        Performance has improved.

serialize "tiny-fragment.html"
                        time:   [332.13 ns 332.78 ns 333.60 ns]
                        change: [−27.768% −27.617% −27.457%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

serialize "strong.html" time:   [5.4946 µs 5.4988 µs 5.5030 µs]
                        change: [−0.3133% −0.0322% +0.2349%] (p = 0.83 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

In this case lipsum.html deserialization time dropped from 122.81 µs
to 6.5021 µs.

mrobinson · 2026-05-15T20:32:34Z

I plan to extend support for this optimization to the XML serializer in a followup. In order to do that I need to make some XML benchmarks. That seems like enough to split into a separate PR.

simonwuelker

Very nice speedup!

We solve a very similar problem for the data state in the HTML tokenizer with SIMD intrinsics. I wonder how that compares to two memchr calls in terms of performance.

html5ever/html5ever/src/tokenizer/mod.rs

Lines 1945 to 1990 in 201534e

    
               unsafe fn data_state_simd_fast_path(&self, input: &mut StrTendril) -> Option<SetResult> { 
        
                   #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] 
        
                   let (mut i, mut n_newlines) = self.data_state_sse2_fast_path(input); 
        
                   #[cfg(target_arch = "aarch64")] 
        
                   let (mut i, mut n_newlines) = self.data_state_neon_fast_path(input); 
        
                   // Process any remaining bytes (less than STRIDE) 
        
                   while let Some(c) = input.as_bytes().get(i) { 
        
                       if matches!(*c, b'<' | b'&' | b'\r' | b'\0') { 
        
                           break; 
        
                       } 
        
                       if *c == b'\n' { 
        
                           n_newlines += 1; 
        
                       } 
        
                       i += 1; 
        
                   } 
        
                   let set_result = if i == 0 { 
        
                       let first_char = input.pop_front_char().unwrap(); 
        
                       debug_assert!(matches!(first_char, '<' | '&' | '\r' | '\0')); 
        
                       // FIXME: Passing a bogus input queue is only relevant when c is \n, which can never happen in this case. 
        
                       // Still, it would be nice to not have to do that. 
        
                       // The same is true for the unwrap call. 
        
                       let preprocessed_char = self 
        
                           .get_preprocessed_char(first_char, &BufferQueue::default()) 
        
                           .unwrap(); 
        
                       SetResult::FromSet(preprocessed_char) 
        
                   } else { 
        
                       debug_assert!( 
        
                           input.len() >= i, 
        
                           "Trying to remove {:?} bytes from a tendril that is only {:?} bytes long", 
        
                           i, 
        
                           input.len() 
        
                       ); 
        
                       let consumed_chunk = input.unsafe_subtendril(0, i as u32); 
        
                       input.unsafe_pop_front(i as u32); 
        
                       SetResult::NotFromSet(consumed_chunk) 
        
                   }; 
        
                   self.current_line.set(self.current_line.get() + n_newlines); 
        
                   Some(set_result) 
        
               }

simonwuelker · 2026-05-17T07:56:54Z

+            let result2 = memchr2(b'&', 0xC2, &slice[..result]).unwrap_or(slice.len());
+            result.min(result2)


nit: If you do unwrap_or(result) for the memchr2 call then you don't need result.min(result2)

mrobinson · 2026-05-17T09:51:54Z

Thanks for the review!

We solve a very similar problem for the data state in the HTML tokenizer with SIMD intrinsics. I wonder how that compares to two memchr calls in terms of performance.

I've been doing a lot of rough experimentation of the past couple days with an SSE3 and AVX2 version of the parser optimization. It's quite possible that we could use a similar technique here (and it would benefit from not having to count newlines). It would be nice to move some of these routines into markup5ever utilities and to make them more general, though very carefully in order to avoid hurting performance.

I think you are ultimately correct in #703, though that bigger wins are likely found by structural changes to the API such as supporting a mode that doesn't count newlines.

This change greatly improves the performance of serialization (up to 95% on some benchmarks) by changing the way that escaping of HTML entities works. It uses memchar to avoid creating a `chars()` iterator on the output stream. When run with the benchmark from #739, I see these results: ``` serialize "lipsum.html" time: [6.4817 µs 6.5021 µs 6.5212 µs] change: [−95.179% −95.013% −94.846%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) low severe 1 (1.00%) low mild serialize "lipsum-zh.html" time: [2.0815 µs 2.0888 µs 2.0947 µs] change: [−91.533% −90.940% −90.407%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 15 (15.00%) low severe serialize "medium-fragment.html" time: [7.7625 µs 7.7927 µs 7.8147 µs] change: [−84.424% −83.952% −83.486%] (p = 0.00 < 0.05) Performance has improved. serialize "small-fragment.html" time: [879.01 ns 886.43 ns 892.78 ns] change: [−89.813% −89.711% −89.610%] (p = 0.00 < 0.05) Performance has improved. serialize "tiny-fragment.html" time: [332.13 ns 332.78 ns 333.60 ns] change: [−27.768% −27.617% −27.457%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe serialize "strong.html" time: [5.4946 µs 5.4988 µs 5.5030 µs] change: [−0.3133% −0.0322% +0.2349%] (p = 0.83 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` In this case `lipsum.html` deserialization time dropped from 122.81 µs to 6.5021 µs. Signed-off-by: Martin Robinson <mrobinson@igalia.com>

github-actions Bot added the V-non-breaking A non-breaking change label May 15, 2026

simonwuelker approved these changes May 17, 2026

View reviewed changes

mrobinson force-pushed the serialize-memchr branch from e4688ab to 630777d Compare May 17, 2026 10:22

mrobinson enabled auto-merge May 17, 2026 10:22

mrobinson added this pull request to the merge queue May 17, 2026

Merged via the queue into main with commit a32b0d2 May 17, 2026
9 checks passed

github-actions Bot added V-non-breaking A non-breaking change and removed V-non-breaking A non-breaking change labels May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve serialization performance with memchr#740

Improve serialization performance with memchr#740
mrobinson merged 1 commit into
mainfrom
serialize-memchr

mrobinson commented May 15, 2026

Uh oh!

mrobinson commented May 15, 2026

Uh oh!

simonwuelker left a comment •

edited

Loading

Uh oh!

simonwuelker May 17, 2026

Uh oh!

mrobinson commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	unsafe fn data_state_simd_fast_path(&self, input: &mut StrTendril) -> Option<SetResult> {
	#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
	let (mut i, mut n_newlines) = self.data_state_sse2_fast_path(input);

	#[cfg(target_arch = "aarch64")]
	let (mut i, mut n_newlines) = self.data_state_neon_fast_path(input);

	// Process any remaining bytes (less than STRIDE)
	while let Some(c) = input.as_bytes().get(i) {
	if matches!(*c, b'<' \| b'&' \| b'\r' \| b'\0') {
	break;
	}
	if *c == b'\n' {
	n_newlines += 1;
	}

	i += 1;
	}

	let set_result = if i == 0 {
	let first_char = input.pop_front_char().unwrap();
	debug_assert!(matches!(first_char, '<' \| '&' \| '\r' \| '\0'));

	// FIXME: Passing a bogus input queue is only relevant when c is \n, which can never happen in this case.
	// Still, it would be nice to not have to do that.
	// The same is true for the unwrap call.
	let preprocessed_char = self
	.get_preprocessed_char(first_char, &BufferQueue::default())
	.unwrap();
	SetResult::FromSet(preprocessed_char)
	} else {
	debug_assert!(
	input.len() >= i,
	"Trying to remove {:?} bytes from a tendril that is only {:?} bytes long",
	i,
	input.len()
	);
	let consumed_chunk = input.unsafe_subtendril(0, i as u32);
	input.unsafe_pop_front(i as u32);
	SetResult::NotFromSet(consumed_chunk)
	};

	self.current_line.set(self.current_line.get() + n_newlines);

	Some(set_result)
	}

		let result2 = memchr2(b'&', 0xC2, &slice[..result]).unwrap_or(slice.len());
		result.min(result2)

Conversation

mrobinson commented May 15, 2026

Uh oh!

mrobinson commented May 15, 2026

Uh oh!

simonwuelker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonwuelker May 17, 2026

Choose a reason for hiding this comment

Uh oh!

mrobinson commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonwuelker left a comment •

edited

Loading