Skip to content

No Handling of 4-Byte UTF-16 Characters During UTF-8 Conversion on Windows #196

@LXY1226

Description

@LXY1226

Description:

When I attempted to integrate zpaq via cmd in a Go program, I encountered an issue while using zpaq l -checksum to verify if files were written correctly. Upon inspecting the output of zpaq, I noticed the presence of non-existent filenames. After investigation, I found that the UTF-8 output unexpectedly contained 4-byte UTF-16 sequences.

Below is my fix. By converting UTF-8 to UTF-16 and then back to UTF-8 using the correct conversion function, I successfully resolved the issue in my case:

std::string wtou(const std::wstring& w) {
	if (w.empty()) return std::string();
	int size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
	std::string result(size - 1, 0);
	WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, &result[0], size, nullptr, nullptr);
	return result;
}

void printUTF8(const char *s, FILE *f = stdout) {
	assert(f);
	assert(s);
	if (flagsilent)
		return;
#ifdef unix
	fprintf(f, "%s", s);
#else
	const HANDLE h = (HANDLE) _get_osfhandle(_fileno(f));
	DWORD ft = GetFileType(h);
	std::wstring w = utow(s, '/');  // <<<<<
	std::string utf8_str = wtou(w); // <<<<<
	if (ft == FILE_TYPE_CHAR) {
		fflush(f);
		DWORD n = 0;
		WriteConsole(h, w.c_str(), w.size(), &n, 0);
		if (g_output_handle != 0) {
			fprintf(g_output_handle, "%s", utf8_str.c_str());
		}
	} else
        fprintf(f, "%s", utf8_str.c_str());
#endif
}

Impact Path:

When archiving, converting Windows-provided UTF-16 filenames to UTF-8 may omit 4-byte UTF-16 characters, saving them as mixed UTF-16 and UTF-8 strings. This behavior does not cause issues when decompressing on Windows, as it converts back to UTF-16. However, it may cause problems on other platforms.

Test Byte Sequences:

  • ☝ (U+261D) - Correctly encoded as 3-byte UTF-8
  • 🤔 (U+1F914) - Incorrectly encoded as 4-byte UTF-8
AI-Generated Analysis

Correct UTF-8 Sequence:

F0 9F A4 94  // 🤔 emoji's correct UTF-8 encoding

The Sequence You Got:

ED A0 BE ED B4 94  // This is the result of incorrectly encoding UTF-16 surrogate pairs as UTF-8

Root Cause

This is a typical UTF-16 to UTF-8 conversion error:

  1. The Unicode code point for 🤔 is U+1F914.
  2. In UTF-16, this character is represented using surrogate pairs: D83E DD14.
  3. zpaq might first convert the character to UTF-16 and then incorrectly treat the surrogate pairs as independent Unicode characters when encoding to UTF-8:
    • U+D83EED A0 BE (UTF-8)
    • U+DD14ED B4 94 (UTF-8)

Apologies for the lack of further explanation. I am currently trying to rescue data that is about to be destroyed by OneDrive. If you need more detailed information, please let me know.

I'm not a native English speaker, so the explanation might seem a bit confusing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions