-
Notifications
You must be signed in to change notification settings - Fork 110
Skyline/work/20251019 utf8 no bom #3651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
6e2e76e
b2bafec
7b1a259
2063ccc
66329d9
5cd42a0
8c5db09
d15d002
103c834
95919b8
4fb9a1b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Validate UTF-8 BOM compliance against approved list | ||
# | ||
# Usage: .\validate-bom-compliance.ps1 | ||
# Returns: Exit code 0 if compliant, 1 if unexpected BOMs found | ||
# | ||
# This script is intended to be used in CI/commit hooks to prevent | ||
# unwanted BOM introduction. | ||
|
||
$ErrorActionPreference = "Stop" | ||
$utf8Bom = @(0xEF, 0xBB, 0xBF) | ||
|
||
# Approved files that are allowed to have BOM with explanations | ||
$approvedBomFiles = @{ | ||
# Visual Studio auto-generated COM type library files | ||
"pwiz_aux/msrc/utility/vendor_api/thermo/xrawfile2.tli" = "Visual Studio generated COM type library" | ||
"pwiz_tools/Skyline/Executables/BuildMethod/BuildLTQMethod/ltmethod.tlh" = "Visual Studio generated COM type library" | ||
"pwiz_tools/Skyline/Executables/BuildMethod/BuildLTQMethod/ltmethod.tli" = "Visual Studio generated COM type library" | ||
"pwiz_tools/Skyline/Executables/BuildMethod/BuildTSQEZMethod/tsqezmethod.tlh" = "Visual Studio generated COM type library" | ||
"pwiz_tools/Skyline/Executables/BuildMethod/BuildTSQEZMethod/tsqezmethod.tli" = "Visual Studio generated COM type library" | ||
|
||
# Agilent vendor data format test files (represent real instrument output) | ||
"pwiz/data/vendor_readers/Agilent/Reader_Agilent_Test.data/TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d/AcqData/DefaultMassCal.xml" = "Agilent vendor data format" | ||
"pwiz/data/vendor_readers/Agilent/Reader_Agilent_Test.data/TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d/AcqData/Devices.xml" = "Agilent vendor data format" | ||
"pwiz/data/vendor_readers/Agilent/Reader_Agilent_Test.data/TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d/AcqData/MSTS.xml" = "Agilent vendor data format" | ||
"pwiz/data/vendor_readers/Agilent/Reader_Agilent_Test.data/TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d/AcqData/acqmethod.xml" = "Agilent vendor data format" | ||
"pwiz_tools/Bumbershoot/bumberdash/Tests/Data/AgilentTest.d/AcqData/Devices.xml" = "Agilent vendor data format" | ||
"pwiz_tools/Bumbershoot/bumberdash/Tests/Data/AgilentTest.d/AcqData/acqmethod.xml" = "Agilent vendor data format" | ||
} | ||
|
||
Write-Host "Validating UTF-8 BOM compliance..." -ForegroundColor Cyan | ||
Write-Host "" | ||
|
||
# Get all Git-tracked files | ||
$gitFiles = @(git ls-files) | ||
$filesWithBom = @() | ||
|
||
foreach ($file in $gitFiles) { | ||
# Convert forward slashes to backslashes for Windows | ||
$filePath = $file.Replace('/', '\') | ||
$fullPath = Join-Path (Get-Location) $filePath | ||
|
||
if (-not [System.IO.File]::Exists($fullPath)) { | ||
continue | ||
} | ||
|
||
try { | ||
# Read first 3 bytes | ||
$stream = [System.IO.File]::OpenRead($fullPath) | ||
$bytes = New-Object byte[] 3 | ||
$bytesRead = $stream.Read($bytes, 0, 3) | ||
$stream.Close() | ||
|
||
if ($bytesRead -ge 3 -and | ||
$bytes[0] -eq $utf8Bom[0] -and | ||
$bytes[1] -eq $utf8Bom[1] -and | ||
$bytes[2] -eq $utf8Bom[2]) { | ||
$filesWithBom += $file | ||
} | ||
} catch { | ||
# Skip files that can't be read | ||
} | ||
} | ||
|
||
Write-Host "Found $($filesWithBom.Count) files with UTF-8 BOM" -ForegroundColor $(if ($filesWithBom.Count -eq 0) { "Green" } else { "Yellow" }) | ||
Write-Host "" | ||
|
||
# Check for unexpected BOMs | ||
$unexpectedBoms = @() | ||
$approvedBoms = @() | ||
|
||
foreach ($file in $filesWithBom) { | ||
$normalizedPath = $file.Replace('\', '/') | ||
if ($approvedBomFiles.ContainsKey($normalizedPath)) { | ||
$approvedBoms += @{ | ||
File = $file | ||
Reason = $approvedBomFiles[$normalizedPath] | ||
} | ||
} else { | ||
$unexpectedBoms += $file | ||
} | ||
} | ||
|
||
# Report approved BOMs | ||
if ($approvedBoms.Count -gt 0) { | ||
Write-Host "=== Approved BOMs ($($approvedBoms.Count)) ===" -ForegroundColor Green | ||
foreach ($entry in $approvedBoms) { | ||
Write-Host " [OK] $($entry.File)" -ForegroundColor Green | ||
Write-Host " Reason: $($entry.Reason)" -ForegroundColor Gray | ||
} | ||
Write-Host "" | ||
} | ||
|
||
# Report unexpected BOMs | ||
if ($unexpectedBoms.Count -gt 0) { | ||
Write-Host "=== UNEXPECTED BOMs FOUND ($($unexpectedBoms.Count)) ===" -ForegroundColor Red | ||
Write-Host "" | ||
Write-Host "The following files have UTF-8 BOMs but are not on the approved list:" -ForegroundColor Red | ||
Write-Host "" | ||
foreach ($file in $unexpectedBoms) { | ||
Write-Host " [ERROR] $file" -ForegroundColor Red | ||
} | ||
Write-Host "" | ||
Write-Host "Action required:" -ForegroundColor Yellow | ||
Write-Host " 1. If these files should NOT have BOM, remove it using scripts/misc/remove-bom.ps1" -ForegroundColor Yellow | ||
Write-Host " 2. If these files MUST have BOM, add them to the approved list in this script" -ForegroundColor Yellow | ||
Write-Host "" | ||
Write-Host "For more information, see todos/active/TODO-20251019_utf8_no_bom.md" -ForegroundColor Yellow | ||
Write-Host "" | ||
exit 1 | ||
} | ||
|
||
# Success | ||
Write-Host "=== VALIDATION PASSED ===" -ForegroundColor Green | ||
Write-Host "All files with BOM are on the approved list." -ForegroundColor Green | ||
Write-Host "Project is BOM-compliant!" -ForegroundColor Green | ||
Write-Host "" | ||
exit 0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# TODO: Compress vendor test data directories | ||
|
||
**Status:** Backlog | ||
**Priority:** Low | ||
**Estimated Effort:** Small to Medium (1-2 days) | ||
|
||
## Problem | ||
|
||
Vendor data format test directories (`.d`, `.raw`, etc.) contain XML and other text files that have binary integrity requirements. These files are: | ||
- Sensitive to encoding changes (BOM addition/removal changes SHA-1 hashes) | ||
- Sensitive to line ending changes (\n vs \r\n changes byte offsets) | ||
- Treated as "binary files masquerading as text" by developers | ||
- Large and take up significant repository space | ||
|
||
Currently, some test data like the Agilent `.d` directories contain XML files with UTF-8 BOMs that represent real instrument output and must not be modified. However, Git treats them as text files and they're vulnerable to inadvertent modification during bulk operations (like BOM removal). | ||
|
||
## Current State | ||
|
||
Test data is stored uncompressed in directories like: | ||
- `pwiz/data/vendor_readers/Agilent/Reader_Agilent_Test.data/*.d/` | ||
- `pwiz/data/vendor_readers/Waters/Reader_Waters_Test.data/*.raw/` | ||
- Similar directories for other vendors | ||
|
||
## Proposed Solution | ||
|
||
Store vendor test data directories in compressed archives (.zip, .7z, or .gz): | ||
|
||
1. **Compress test data** | ||
- One-time compression of existing `.d` and other vendor directories | ||
- Use .zip (widely supported) or .7z (better compression) | ||
- Expected compression: 80-90% size reduction for XML-heavy formats | ||
|
||
2. **Update test harness** | ||
- Modify `VendorReaderTestHarness.cpp` to: | ||
- Detect if test data is compressed (check for .zip/.7z extension) | ||
- Extract to temp directory before running test | ||
- Run test on extracted data | ||
- Clean up temp directory after test | ||
- Handle extraction failures gracefully with clear error messages | ||
|
||
3. **Update build system** | ||
- Ensure extraction tools available in CI environment | ||
- Add extraction step if needed for build-time test data generation | ||
|
||
## Benefits | ||
|
||
- **Binary integrity**: Compressed archives are explicitly binary, preventing accidental text modifications | ||
- **Repository size**: Significant space savings (XML compresses extremely well) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are no large vendor test data XMLs. Only the reference mzMLs are large and I am against going back to zipping those unless there's a way to easily diff stuff within zips. |
||
- **Consistency**: Matches how vendor reader DLLs are stored in `pwiz_aux/msrc/utility` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Vendor DLLs are stored encrypted. I would love to not have to do that because we could update individual DLLs instead of the entire archive. I suppose we could have encrypted each DLL separately, but that would have been pretty awkward. |
||
- **Protection**: Prevents inadvertent BOM removal, line ending changes, or other text transformations | ||
- **Git performance**: Fewer objects to track, faster clones/pulls | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably a fair point for the directory-based vendor data (not the mzMLs), because those can have tons of small files. But the biggest ones are already stored in tar.bz2 files. I see one Agilent .d, TOFsulfasMS4GHzDualMode+DADSpectra+UVSignal272-NoProfile.d, got committed outside the tarball probably because updating the tarball would mean a 9mb change when the new .d was only 1mb. So if we're going to go back to tarballing all vendor data (and I'm not convinced we should for formats like Bruker TDF with only 2-4 files per .d), it should be per raw file rather than per vendor. That's really the only infrastructure change that would be needed in this TODO, because the existing infrastructure is per vendor. |
||
|
||
## Risks & Mitigation | ||
|
||
| Risk | Mitigation | | ||
|------|-----------| | ||
| Slower tests (extraction overhead) | Extract once per test run, reuse for multiple tests | | ||
| Extraction failure in CI | Clear error messages, verify extraction tools in CI setup | | ||
| Developer workflow disruption | Document how to add/update test data | | ||
| Debugging difficulty | Keep some uncompressed data for quick inspection, or add extraction utility | | ||
|
||
## Implementation Plan | ||
|
||
### Phase 1: Proof of Concept (4 hours) | ||
1. Compress one `.d` directory (e.g., TOFsulfasMS4GHzDualMode test data) | ||
2. Update VendorReaderTestHarness.cpp to handle extraction for that one test | ||
3. Verify test passes with compressed data | ||
4. Measure compression ratio and test time impact | ||
|
||
### Phase 2: Full Implementation (1 day) | ||
1. Compress all vendor test data directories | ||
2. Update all vendor reader tests | ||
3. Update documentation for adding new test data | ||
4. Verify all tests pass on local machine | ||
|
||
### Phase 3: CI Integration (2-4 hours) | ||
1. Ensure CI environments have extraction tools | ||
2. Run full test suite on CI | ||
3. Monitor for any extraction-related failures | ||
4. Document any CI-specific requirements | ||
|
||
## Related | ||
|
||
- Similar to how vendor reader DLLs are stored in `pwiz_aux/msrc/utility` | ||
- Related to BOM standardization work (TODO-20251019_utf8_no_bom.md) | ||
- May want to exclude compressed archives from certain Git operations | ||
|
||
## Future Considerations | ||
|
||
- Could use this approach for other large test data | ||
- Could add utility script to easily extract/re-compress for debugging | ||
- Could use Git LFS for very large compressed archives if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to zip (tarball) all vendor test data and their reference mzMLs, and I stopped doing that because it was annoying to update, maintain, and diff. There is already architecture in place to use .tar.bz2 files for this. A lot of the old test data is still maintained in tar.bz2 files, and only the updated reference mzMLs are submitted plaintext. I would definitely want to keep doing that for reference mzMLs. The vendor data itself probably would be ok to put back in tarballs, but I'm not sure it's really necessary or useful except for "modify everything in the repository" scripts like you've done here, which are exceedingly unusual. For git itself, there's a gitattributes file that tells it to treat everything in vendor_readers as binary:
pwiz\data\vendor_readers.gitattributes