Skip to content

Commit fd441fa

Browse files
authored
xFail known bad tests on H100 and fix CVEs (#547)
Known issue on H100 (and GH200) with loading checkpoints. Also fixing CVE in ARM container
1 parent 0360d50 commit fd441fa

File tree

3 files changed

+6
-0
lines changed

3 files changed

+6
-0
lines changed

Dockerfile.arm

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,7 @@ COPY --from=rust-env /usr/local/rustup /usr/local/rustup
312312

313313

314314
# RUN rm -rf /usr/local/cargo /usr/local/rustup
315+
RUN rm -rf /root/.cache/bazel
315316
RUN chmod 777 -R /workspace/bionemo2/
316317

317318
# Transformer engine attention defaults

docs/docs/user-guide/appendix/releasenotes-fw.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
* Moved inference script to a new executable `infer_esm2`, and deprecated the inference example in the fine-tuning tutorial.
2222
* Added new Jupyter notebook tutorials for inference and zero-shot protein design. These notebooks can be deployed on the cloud resources as a [brev.dev](https://www.brev.dev/) launchable.
2323

24+
### Known Issues:
25+
* Loading a checkpoint for Geneformer inference on H100 has a known regression in accuracy. Work is in progress to resolve by next release.
2426

2527
## BioNeMo Framework v2.1
2628

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,9 @@ def __getitem__(self, idx):
260260
return {"text": self.input_ids[idx], "attention_mask": self.mask[idx]}
261261

262262

263+
@pytest.mark.xfail(
264+
reason="Known issue on H100 GPUs"
265+
)
263266
def test_geneformer_nemo1_v_nemo2_inference_golden_values(
264267
geneformer_config: GeneformerConfig, cells: List[List[str]], seed: int = 42
265268
):

0 commit comments

Comments
 (0)