-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Description:
When I attempted to integrate zpaq
via cmd
in a Go program, I encountered an issue while using zpaq l -checksum
to verify if files were written correctly. Upon inspecting the output of zpaq
, I noticed the presence of non-existent filenames. After investigation, I found that the UTF-8 output unexpectedly contained 4-byte UTF-16 sequences.
Below is my fix. By converting UTF-8 to UTF-16 and then back to UTF-8 using the correct conversion function, I successfully resolved the issue in my case:
std::string wtou(const std::wstring& w) {
if (w.empty()) return std::string();
int size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
std::string result(size - 1, 0);
WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, &result[0], size, nullptr, nullptr);
return result;
}
void printUTF8(const char *s, FILE *f = stdout) {
assert(f);
assert(s);
if (flagsilent)
return;
#ifdef unix
fprintf(f, "%s", s);
#else
const HANDLE h = (HANDLE) _get_osfhandle(_fileno(f));
DWORD ft = GetFileType(h);
std::wstring w = utow(s, '/'); // <<<<<
std::string utf8_str = wtou(w); // <<<<<
if (ft == FILE_TYPE_CHAR) {
fflush(f);
DWORD n = 0;
WriteConsole(h, w.c_str(), w.size(), &n, 0);
if (g_output_handle != 0) {
fprintf(g_output_handle, "%s", utf8_str.c_str());
}
} else
fprintf(f, "%s", utf8_str.c_str());
#endif
}
Impact Path:
When archiving, converting Windows-provided UTF-16 filenames to UTF-8 may omit 4-byte UTF-16 characters, saving them as mixed UTF-16 and UTF-8 strings. This behavior does not cause issues when decompressing on Windows, as it converts back to UTF-16. However, it may cause problems on other platforms.
Test Byte Sequences:
- ☝ (U+261D) - Correctly encoded as 3-byte UTF-8
- 🤔 (U+1F914) - Incorrectly encoded as 4-byte UTF-8
AI-Generated Analysis
Correct UTF-8 Sequence:
F0 9F A4 94 // 🤔 emoji's correct UTF-8 encoding
The Sequence You Got:
ED A0 BE ED B4 94 // This is the result of incorrectly encoding UTF-16 surrogate pairs as UTF-8
Root Cause
This is a typical UTF-16 to UTF-8 conversion error:
- The Unicode code point for
🤔
isU+1F914
. - In UTF-16, this character is represented using surrogate pairs:
D83E DD14
. - zpaq might first convert the character to UTF-16 and then incorrectly treat the surrogate pairs as independent Unicode characters when encoding to UTF-8:
U+D83E
→ED A0 BE
(UTF-8)U+DD14
→ED B4 94
(UTF-8)
Apologies for the lack of further explanation. I am currently trying to rescue data that is about to be destroyed by OneDrive. If you need more detailed information, please let me know.
I'm not a native English speaker, so the explanation might seem a bit confusing.