Skip to content

Support erroring (possibly by default) when detecting the host prefix in the binary #1507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vyasr opened this issue Mar 21, 2025 · 2 comments · May be fixed by #1658
Open

Support erroring (possibly by default) when detecting the host prefix in the binary #1507

vyasr opened this issue Mar 21, 2025 · 2 comments · May be fixed by #1658
Assignees

Comments

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2025

Currently rattler-build supports building conda packages that will have the host prefix replaced in built binaries at install time by conda using similar logic to conda-build as of #598. However, there are flaws with this approach. As discussed extensively in rapidsai/cudf#18251 (comment) and rapidsai/cudf#18251 (comment) (those two comments give a complete picture of exactly how this failed), a sufficiently smart compiler can always produce a binary for which binary prefix replacement is unsafe, regardless of suitable null character padding or other safety checks. In conda/conda-build#1674 (comment), I proposed that conda-build offer a new option to error on package build if such a prefix is detected, since in practice the best solution to these kinds of problems is to fix the package source or build process to avoid embedding the host prefix in the first place. With rattler-build, as a newer packaging tool we have more flexibility and I would suggest that we go further.

I think the default behavior of rattler-build should be to error if the prefix is in the binary. This can be implemented fairly trivially using an extra flag in the config and by checking the value of contains_prefix_binary here. My reasoning for this is that it is easy enough for a project to opt in to prefix patching if there is absolutely no way they can avoid the need, but at least if it is opt in they must know what they are choosing and accept the potential pitfalls and maybe know where to look if something goes wrong. Conversely, if patching is automatic the error modes are potentially arbitrarily bad (silent failures, data corruption, program crashes) and extremely difficult to debug. Realistically most developers would struggle to fully debug a problem like rapidsai/cudf#18251 and trace it back to conda's binary prefix replacement as the source. If at all possible, we should encourage packages to fix their source code and build process to avoid needing binary prefix patching rather than having them implicitly rely on this behavior without realizing it.

Text prefix patching
As an aside, I also have a weak preference for the same behavior in text prefix patching, but in that case I think the problems are far less difficult to debug and also the process is far less likely to go wrong in unpredictable ways, so I am less concerned with the behavior there.

@wolfv
Copy link
Member

wolfv commented Mar 22, 2025

Thanks for the write up. I agree that it might be a good change. Would be interesting to scan the existing packages to see how many do rely on binary prefix replacement (maybe even looking in a few larger environments could be useful, to get an idea).

But yes, I also had to debug some issues with std::string(PREFIX) in the past and it's definitely tricky.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 22, 2025

Would be interesting to scan the existing packages to see how many do rely on binary prefix replacement (maybe even looking in a few larger environments could be useful, to get an idea).

I agree. It could also be helpful to make it a loud warning for a couple of releases and see if anybody reports anything interesting.

But yes, I also had to debug some issues with std::string(PREFIX) in the past and it's definitely tricky.

tl;dr I don't think that this issue is specific to std::string, except for the fact that I don't think modern compilers are as smart at optimizing C code yet, so I would not want to chalk this up to C strings really being safe either.

Longer answer:

I should be clear here. The previous instances of binary prefix replacement causing problems with strings that I've seen -- this one also from RAPIDS and the original conda-build issue, both of which I found out about after rapidsai/cudf#18251 -- both boil down to std::string not being null-terminated resulting in differences in how strings are printed. That is absolutely a pain to debug, but in the grand scheme of things it is debuggable: at least you are looking at localized code and when debugging the normal control flow of the program you can use normal C++ debugging techniques in gdb with a debug build to see that string's length does not match its contents. Also, those issues are strictly due to differences in how std::string and C strings are represented, so once you get to that point knowing how the language works is sufficient to understand the problem and trace it back to the prefix patching.

rapidsai/cudf#18251 is far worse. The exact same optimization that I observed is, so far as I know, completely valid for a C string; it is only related to std::string insofar as I suspect that modern C++ compilers have more advanced heuristics to optimize code C++ strings than C strings, and because the fact that c strings are null terminated means that the probability of finding one string that is a substring of another is far lower because it also requires that they end the same way, i.e. it cannot be an arbitrary substring in the middle. Here, the compiler decided that it would be a good idea for one part of the code to reuse a substring of the string constant embedded by a completely different part of the code. The error was not localized in any way, in that the prefix string being replaced was from an error in a completely different part of the code than where string operations were performed. In fact, based on the fact that the header where the prefix was embedded is included in every single file in our library, since I only see the prefix string for that error once in the final binary I think the two constants may actually have come from separate TUs and been combined at link-time. Furthermore, since the string constant was just passed as a literal to a std::string function and this probably was only observed when compiled with -O2 or higher, there was no way to see anything wrong by inspecting the contents of a std::string in a debug session for inconsistencies between the content and the data. The only way to really root cause the issue was to disassemble near the point of error, inspect a section of assembly instructions manually, identify all possible instructions referencing memory in specific regions, find one that referenced a null string character, then attempt to remap the mapped memory regions from the process space (via info proc mappings or equivalent) back to the original sections of the ELF file (as determined via readelf and hexdump or equivalents).

Hence why I think we should change the default here: these kinds of errors are almost impossible to debug without deep knowledge of the conda/rattler build/install stack, the C/C++ languages, and the actual ELF binary format, and they can manifest in arbitrarily bad ways, not just misprinted strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants