-
Notifications
You must be signed in to change notification settings - Fork 110
Skyline/work/20251019 utf8 no bom #3651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Phase 1 Analysis complete: - 3,439 files with BOM (25.36%) - 10,123 files without BOM (74.64%) - Majority are .cs files (2,818) - 5 files excluded (.tli/.tlh auto-generated) Phase 2 Tooling: - Created remove-bom.ps1 script - Supports dry-run mode for safe testing - Preserves file timestamps - Validates conversion success - Excludes auto-generated COM headers Files added: - todos/active/files-with-bom.txt: Complete list (to be removed before completion) - scripts/misc/remove-bom.ps1: BOM removal tool (permanent)
Standardize all source files to UTF-8 without BOM except for auto-generated COM type library files. Phase 2 Complete - BOM Removal: - Converted 3,434 files from UTF-8 with BOM to UTF-8 without BOM - Excluded 5 auto-generated COM headers (.tli/.tlh files) - Preserved all file timestamps - Verified only BOM bytes removed, no content changes Files still with BOM (intentionally excluded): - pwiz_aux/msrc/utility/vendor_api/thermo/xrawfile2.tli - pwiz_tools/Skyline/Executables/BuildMethod/BuildLTQMethod/ltmethod.tlh - pwiz_tools/Skyline/Executables/BuildMethod/BuildLTQMethod/ltmethod.tli - pwiz_tools/Skyline/Executables/BuildMethod/BuildTSQEZMethod/tsqezmethod.tlh - pwiz_tools/Skyline/Executables/BuildMethod/BuildTSQEZMethod/tsqezmethod.tli Results verified by re-running analyze-bom-git.ps1: - Before: 3,439 files with BOM (25.36%) - After: 5 files with BOM (0.04%) All changes are reversible with a single git revert if needed.
Implement InspectUtf8Bom() method that: - Automatically detects and removes UTF-8 BOM from source files - FAILS the test when BOMs are found (even after fixing) - Excludes auto-generated COM headers (.tli, .tlh) - Covers Skyline and Shared directories - Provides clear error message with actionable steps Developer workflow when BOM is accidentally introduced: 1. Run CodeInspection test (or push to PR) 2. Test FAILS with clear error message 3. BOM automatically removed from files 4. Developer reviews with 'git diff' 5. Developer commits the fix 6. Test passes on next run This prevents BOMs from being merged into master by: - Failing PR builds on TeamCity - Appearing in nightly test failure emails - Auto-fixing for developer convenience Much better UX than just failing without fixing.
- implement new static analysis to catch this use case and error to keep it from being added in the future - improved error output format in CodeInspectionTest to produce links to the file and line - added a new TODO for work related to fixing the CodeInspectionTest warnings about UI code in the Model folder
- add todos/backlog/TODO-compress_vendor_test_data.md for future work to compress these folders - add validate-bom-compliance.ps1 as a tool that can check for any unexpected BOMs project-wide - defer project-wide validation of BOMs for later, as it cannot be done automatically with a .ps1 file
|
||
- **Binary integrity**: Compressed archives are explicitly binary, preventing accidental text modifications | ||
- **Repository size**: Significant space savings (XML compresses extremely well) | ||
- **Consistency**: Matches how vendor reader DLLs are stored in `pwiz_aux/msrc/utility` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vendor DLLs are stored encrypted. I would love to not have to do that because we could update individual DLLs instead of the entire archive. I suppose we could have encrypted each DLL separately, but that would have been pretty awkward.
|
||
## Benefits | ||
|
||
- **Binary integrity**: Compressed archives are explicitly binary, preventing accidental text modifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to zip (tarball) all vendor test data and their reference mzMLs, and I stopped doing that because it was annoying to update, maintain, and diff. There is already architecture in place to use .tar.bz2 files for this. A lot of the old test data is still maintained in tar.bz2 files, and only the updated reference mzMLs are submitted plaintext. I would definitely want to keep doing that for reference mzMLs. The vendor data itself probably would be ok to put back in tarballs, but I'm not sure it's really necessary or useful except for "modify everything in the repository" scripts like you've done here, which are exceedingly unusual. For git itself, there's a gitattributes file that tells it to treat everything in vendor_readers as binary:
pwiz\data\vendor_readers.gitattributes
## Benefits | ||
|
||
- **Binary integrity**: Compressed archives are explicitly binary, preventing accidental text modifications | ||
- **Repository size**: Significant space savings (XML compresses extremely well) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no large vendor test data XMLs. Only the reference mzMLs are large and I am against going back to zipping those unless there's a way to easily diff stuff within zips.
- **Repository size**: Significant space savings (XML compresses extremely well) | ||
- **Consistency**: Matches how vendor reader DLLs are stored in `pwiz_aux/msrc/utility` | ||
- **Protection**: Prevents inadvertent BOM removal, line ending changes, or other text transformations | ||
- **Git performance**: Fewer objects to track, faster clones/pulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a fair point for the directory-based vendor data (not the mzMLs), because those can have tons of small files. But the biggest ones are already stored in tar.bz2 files. I see one Agilent .d, TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d, got committed outside the tarball probably because updating the tarball would mean a 9mb change when the new .d was only 1mb. So if we're going to go back to tarballing all vendor data (and I'm not convinced we should for formats like Bruker TDF with only 2-4 files per .d), it should be per raw file rather than per vendor. That's really the only infrastructure change that would be needed in this TODO, because the existing infrastructure is per vendor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about .editorconfig. Does the one in root work for Skyline.sln, or does that directory need a separate .editorconfig?
Apparently not. Visual Studio searches up the directory tree stopping when it reaches one with root = true. So, I am going to add that to our /.editorconfig. We don't need any new .editorconfig files until we have settings specific to another project or subfolder (e.g. .NET-specific settings might bot into pwiz_tools, and Skyline-specific settings could go into pwiz_tools/Skyline). |
This work removes a byte order mark (BOM) from many files in our project, fixes our own code that adds these marks and adds a test to the CodeInspectionTest that should fail any file with a BOM within the set of files this test inspects, mostly under the Skyline folder.
Differences between files when only the BOM-v-noBOM changes has long been a pain, but becomes more painful with the introduction of LLM coding tools that seem to prefer noBOM and which is apparently standard in 'nix coding environments. The BOM may be largely going out of style and it is certainly unnecessary, appearing in less than 25% of our files before this change.
This branch also adds a new script scripts/misc/analyze-bom-git.ps1 to analyze the files that have a BOM.