Handle surrogate pair when converting from LSP encoding #1785

bullno1 · 2025-06-06T19:52:11Z

The generic LSP completer adds 1 to the code unit position before conversion to get a 1-based index:

ycmd/ycmd/completers/language_server/language_server_completer.py

Lines 3489 to 3490 in a51329a

    
           lsp.UTF16CodeUnitsToCodepoints( line_value, 
        
                                           location[ 'character' ] + 1 ) )

However, this is problematic when the original index is pointing at a surrogate pair.
For example: given the string: 😊, the language server sends index 0.
Adding 1 will point it to the low surrogate and the following conversion will fail with a UnicodeError:

ycmd/ycmd/completers/language_server/language_server_protocol.py

Lines 826 to 827 in a51329a

    
           bytes_included = value_as_utf16_bytes[ : code_unit_offset * 2 ] 
        
           return len( bytes_included.decode( 'utf-16-le' ) )

This is because the surrogate pair is sliced in half.

What this PR does is checking whether the high byte at the offset is a low surrogate and advance the offset further to skip the entire pair.

This only happens for positions sent from the language server so it is not a reversible conversion.
I added a separate test case from test_CodepointsToUTF16CodeUnitsAndReverse

This change is

puremourning

thanks!

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained (waiting on @bullno1)

puremourning · 2025-06-16T20:07:18Z

@Mergifyio rebase

mergify · 2025-06-16T20:07:31Z

rebase

✅ Branch has been successfully rebased

bullno1 force-pushed the handle-utf16-surrogate-pair branch from 1c73ef0 to 8fbddc5 Compare June 6, 2025 19:58

puremourning approved these changes Jun 6, 2025

View reviewed changes

bullno1 added 2 commits June 16, 2025 20:07

Handle surrogate pair when converting from LSP encoding

6912f9e

Perform a range check before accessing the high byte

ecd1bb8

puremourning force-pushed the handle-utf16-surrogate-pair branch from 8fbddc5 to ecd1bb8 Compare June 16, 2025 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle surrogate pair when converting from LSP encoding #1785

Handle surrogate pair when converting from LSP encoding #1785

Uh oh!

bullno1 commented Jun 6, 2025 •

edited by Valloric

Loading

Uh oh!

puremourning left a comment

Uh oh!

puremourning commented Jun 16, 2025

Uh oh!

mergify bot commented Jun 16, 2025

Uh oh!

Uh oh!

	lsp.UTF16CodeUnitsToCodepoints( line_value,
	location[ 'character' ] + 1 ) )

	bytes_included = value_as_utf16_bytes[ : code_unit_offset * 2 ]
	return len( bytes_included.decode( 'utf-16-le' ) )

Handle surrogate pair when converting from LSP encoding #1785

Are you sure you want to change the base?

Handle surrogate pair when converting from LSP encoding #1785

Uh oh!

Conversation

bullno1 commented Jun 6, 2025 • edited by Valloric Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

puremourning left a comment

Choose a reason for hiding this comment

Uh oh!

puremourning commented Jun 16, 2025

Uh oh!

mergify bot commented Jun 16, 2025

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

bullno1 commented Jun 6, 2025 •

edited by Valloric

Loading