No Handling of 4-Byte UTF-16 Characters During UTF-8 Conversion on Windows

#### Description:
When I attempted to integrate `zpaq` via `cmd` in a Go program, I encountered an issue while using `zpaq l -checksum` to verify if files were written correctly. Upon inspecting the output of `zpaq`, I noticed the presence of non-existent filenames. After investigation, I found that the UTF-8 output unexpectedly contained 4-byte UTF-16 sequences.

Below is my fix. By converting UTF-8 to UTF-16 and then back to UTF-8 using the correct conversion function, I successfully resolved the issue in my case:

```cpp
std::string wtou(const std::wstring& w) {
	if (w.empty()) return std::string();
	int size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
	std::string result(size - 1, 0);
	WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, &result[0], size, nullptr, nullptr);
	return result;
}

void printUTF8(const char *s, FILE *f = stdout) {
	assert(f);
	assert(s);
	if (flagsilent)
		return;
#ifdef unix
	fprintf(f, "%s", s);
#else
	const HANDLE h = (HANDLE) _get_osfhandle(_fileno(f));
	DWORD ft = GetFileType(h);
	std::wstring w = utow(s, '/');  // <<<<<
	std::string utf8_str = wtou(w); // <<<<<
	if (ft == FILE_TYPE_CHAR) {
		fflush(f);
		DWORD n = 0;
		WriteConsole(h, w.c_str(), w.size(), &n, 0);
		if (g_output_handle != 0) {
			fprintf(g_output_handle, "%s", utf8_str.c_str());
		}
	} else
        fprintf(f, "%s", utf8_str.c_str());
#endif
}
```

#### Impact Path:
When archiving, converting Windows-provided UTF-16 filenames to UTF-8 may omit 4-byte UTF-16 characters, saving them as mixed UTF-16 and UTF-8 strings. This behavior does not cause issues when decompressing on Windows, as it converts back to UTF-16. However, it *may* cause problems on other platforms.

#### Test Byte Sequences:
- ☝ (U+261D) - Correctly encoded as 3-byte UTF-8
- 🤔 (U+1F914) - Incorrectly encoded as 4-byte UTF-8
<details>
<summary>AI-Generated Analysis</summary>

**Correct UTF-8 Sequence:**

```
F0 9F A4 94  // 🤔 emoji's correct UTF-8 encoding
```

**The Sequence You Got:**
```
ED A0 BE ED B4 94  // This is the result of incorrectly encoding UTF-16 surrogate pairs as UTF-8
```

## Root Cause

This is a typical **UTF-16 to UTF-8 conversion error**:

1. The Unicode code point for `🤔` is `U+1F914`.
2. In UTF-16, this character is represented using surrogate pairs: `D83E DD14`.
3. zpaq might first convert the character to UTF-16 and then incorrectly treat the surrogate pairs as independent Unicode characters when encoding to UTF-8:
   - `U+D83E` → `ED A0 BE` (UTF-8)
   - `U+DD14` → `ED B4 94` (UTF-8)

</details>

Apologies for the lack of further explanation. I am currently trying to rescue data that is about to be destroyed by OneDrive. If you need more detailed information, please let me know.



I'm not a native English speaker, so the explanation might seem a bit confusing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No Handling of 4-Byte UTF-16 Characters During UTF-8 Conversion on Windows #196

Description:

Impact Path:

Test Byte Sequences:

Root Cause

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

No Handling of 4-Byte UTF-16 Characters During UTF-8 Conversion on Windows #196

Description

Description:

Impact Path:

Test Byte Sequences:

Root Cause

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions