Skip to content

Consider changing derecho testing to use the shared develop queue #105

@billsacks

Description

@billsacks

Our derecho testing currently uses dedicated 128 core nodes for each test run. This is overkill for our needs, where 8 cores should be sufficient.

I tried changing to use the shared 8-core develop nodes with these diffs:

diff --git a/config/derecho.yaml b/config/derecho.yaml
index 7fc0cf2..038533c 100644
--- a/config/derecho.yaml
+++ b/config/derecho.yaml
@@ -1,11 +1,12 @@
 machine:
   name: derecho
- cores_per_node: 128
+ # Use just 8 cores on the develop queue for less resource wastage (the develop queue allows shared nodes) and potentially less queue wait time
+ cores_per_node: 8
  head_node_name: derecho6
  scheduler:
  type: pbs
  account: p93300606
- queue: main
+ queue: develop
  partition: None
 
 test:

This led to a few test failures:

gfortran-pio2.5.10-O: failure in ESMF_ArrayRedistPerfUTest.F90: slightly exceeds tolerance (1.64e-2 > 1e-2)

gfortran-pio2.5.10-g:

CRASHED: mpi/g: src/Superstructure/Component/tests/ESMF_CompTunnelUTest.F90
   FAIL: mpi/g:   Check time delay on timeout ServiceLoop Actual Component E, ESMF_CompTunnelUTest.F90, line 1381:  Incorrect time delay
   FAIL: mpi/g:   ServiceLoop for the Actual Component E - timeoutFlag, ESMF_CompTunnelUTest.F90, line 1407:  Did not return ESMF_SUCCESS
   FAIL: mpi/g:   Testing timeoutFlag, ESMF_CompTunnelUTest.F90, line 1414:  timeoutFlag wrong

In addition, there were a number of other issues with failures to report / commit to artifacts repo, but at least some of these may have been due to issues with the derecho qstat command (see #104).

I'd like to try this again to see if we can get testing to work reliably from the develop queue, since this should greatly decrease the cost incurred with our nightly tests.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions