-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
We have a reproducible test case from a collaborator trying to use getm within Terra for TCGA data - this happens transiently, but can be reproduced when trying to download at scale from TCGA buckets. I am not sure why it happens as there are simply Google signed urls, but getm doesn't handle the error gracefully, leaving a process open and causing the WDL task to hang indefinitely (costing $ for the user).
In order to help resolve this I wanted to check if the following could be updated in getm:
- when an error like this occurs, have getm exit properly rather than leave the process open
- see if we can add some sort of retries for errors like this - not sure if that needs to be a user provided flag (like up to 3 retries) or if this can just be made to retry within reasonable default limits, but retry would be preferred over fully restarting a WDL
- ideally: add timestamps to the beginning of all of the verbose logging, which helps a lot in the debugging of issues like this across systems
Thank you!
Here is output from getm running in very verbose logging mode which shows the error:
DEBUG:getm.cli:{"progress_class": "ProgressLogger"}
DEBUG:getm.cli:{"message": "validating manifest", "manifest": [{"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200441Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4c006e811411f156a76...682b", "filepath": "/cromwell_root", "checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum-algorithm": "md5"}, {"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/c05728b2-70bd-4a9b-a6d6-abc6127f9b5a/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200443Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4eb6a3cc2bbbb14a428ff80a8c74e3f...bd8", "filepath": "/cromwell_root", "checksum": "1b0e4d4571f949df5e266218a1af2ad4", "checksum-algorithm": "md5"}], "schema": {"type": "array", "items": {"type": "object", "properties": {"url": {"type": "string"}, "filepath": {"type": "string"}, "checksum": {"type": "string"}, "checksum-algorithm": {"type": "string", "enum": ["md5", "gs_crc32c", "s3_etag", "null"]}}, "required": ["url"], "dependencies": {"checksum": ["checksum-algorithm"], "checksum-algorithm": ["checksum"]}}}}
DEBUG:getm.cli:{"multipart_buffer_size": 104857600}
INFO:getm.cli:{"message": "initiating multipart download", "url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fs...357dd225636cde32eef08510e01144c226b2ad682b", "expected_checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum_algorithm": "md5"}
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 104.4MiB/s 2.453076s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 102.0MiB/s 5.017502s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 109.7MiB/s 7.001775s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 8.891281s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 11.107452s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 110.4MiB/s 13.914369s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 106.4MiB/s 16.840150s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 103.6MiB/s 19.760566s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 101.5MiB/s 22.693003s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 100.0MiB/s 25.596672s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 98.7MiB/s 28.521763s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 97.7MiB/s 31.447519s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 96.9MiB/s 34.359779s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 95.2MiB/s 37.633454s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 90.8MiB/s 42.295202s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 88.2MiB/s 46.439717s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 89.7MiB/s 48.525774s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 91.2MiB/s 50.532638s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.5MiB/s 52.563134s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.3MiB/s 54.882982s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.1MiB/s 57.742125s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.9MiB/s 60.629910s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.6MiB/s 63.552936s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.3MiB/s 66.548761s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 92.0MiB/s 69.555787s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.7MiB/s 72.567698s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.5MiB/s 75.553901s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.3MiB/s 78.505067s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 81.459932s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 84.325711s
Process URLReaderKeepAlive-1:1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 444, in _error_catcher
yield
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 567, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 525, in _fp_read
data = self._fp.read(chunk_amt)
File "/usr/lib/python3.9/http/client.py", line 458, in read
n = self.readinto(b)
File "/usr/lib/python3.9/http/client.py", line 502, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 190, in run
bytes_read = handle.readinto(buf[stop: stop + self.chunk_size])
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 737, in readinto
temp = self.read(len(b))
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 593, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 461, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 194, in run
buf.stop = stop
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 90, in __exit__
self.close()
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 82, in close
sm.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist
Exception ignored in: <function SharedMemory.__del__ at 0x7e5ca92ba550>
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
self.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist
Metadata
Metadata
Assignees
Labels
No labels