Hi, a few days ago we have encountered this exception when pulling dvc tracked data using `dvc get` command. We were pulling about 800 of 4GB files. The issue can be related to this comment https://github.com/iterative/dvc/issues/9070#issuecomment-1443852816 ``` Traceback (most recent call last): File "reproduce_pull.py", line 50, in <module> copy( File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 93, in copy return _get( File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 241, in _get raise result File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/executors.py", line 134, in batch_coros result = fut.result() File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 220, in _get_one_coro return await get_coro( File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 84, in func return await wrapped(path1, path2, **kw) File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped res = await fn(*args, **kwargs) File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 1224, in _get_file body, content_length = await _open_file(range=0) File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 1215, in _open_file resp = await self._call_s3( File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 348, in _call_s3 return await _error_wrapper( File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 140, in _error_wrapper raise err PermissionError: The difference between the request time and the server's time is too large. ``` We found out that the command fails semi-deterministically after about 20 minutes of pulling process. We also found out that setting the `--jobs` parameter to 10 instead of 4x cpu_count (which is default) worked-around the problem. ```python >>> import os >>> os.cpu_count() 56 ``` I did a little bit digging around and maybe found a root cause of the problem: - In this [SO comment](https://stackoverflow.com/a/40225757) someone mentioned that this problem can be sometimes caused by asynchronous calls. Mainly because the request is created much earlier than executed which can cause request time and s3 server time diff. - according to the [SO comment](https://stackoverflow.com/a/26862395) the maximal request-servertime diff is 15 minutes for S3 - `dvc-data` package uses `s3fs` that calls `async _get_file` https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L1204C15-L1204C24 during the pull - the `_get_file` is called asynchronously as many times as the `--jobs` parameter specifies - the `body, content_length = await _open_file(range=0)` of _get_file https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L1204C15-L1204C24 will be waiting after 10 calls, because the aiobotocore by default provides only 10 sessions. https://github.com/aio-libs/aiobotocore/issues/651 => That means that for large number of jobs (hundreds) a session waiting time longer than 15 minutes can cause the "The difference between the request time and the server's time is too large." error because the request timestamp differs by more than 15 minutes from server time. # Reproduction I was not able to reproduce the issue by simulating the download process so I needed to pull actual file from s3. The following code creates s3 bucket, uploads file and tries to copy it from s3 using dvc-data in 1000 threads. If some of the 1000 threads do not finish in 15 minutes the request will probably fail. In our case it failed after 17 minutes. ```python # %% import logging from pathlib import Path from s3fs import S3FileSystem from tempfile import TemporaryDirectory from dvc.fs import FileSystem from dvc_objects.fs.generic import copy from dvc_objects.fs.local import LocalFileSystem import tqdm from dvc.fs.callbacks import Callback logging.basicConfig(level=logging.INFO) s3_fs = S3FileSystem(endpoint_url="your-endpoint-url") dvc_local_fs = LocalFileSystem() bucket_name = "test-bucket" no_requests = 1000 # 1000 pull requests in total no_chars = 1_000_000_000 # 1.9 GB files no_jobs = 1000 # 1000 async jobs scheduled to reach the 15 minutes diff limit # %% if s3_fs.exists(bucket_name): logging.info(f"Deleting {bucket_name}") s3_fs.rm(bucket_name, recursive=True) logging.info("Creating bucket") s3_fs.mkdir(bucket_name) with TemporaryDirectory() as dir: file = Path(dir, "test-file") file.write_text(" ".join(["a"] * no_chars)) with Callback.as_tqdm_callback( unit="file", desc="Put", ) as cb: s3_fs.put(dir, bucket_name, recursive=True, callback=cb) logging.info(f"File uploaded to {bucket_name}") # %% with TemporaryDirectory(dir=".") as dir: logging.info("Loading files from s3") dvc_s3_fs = FileSystem(s3_fs) bucket_dir = s3_fs.ls(bucket_name)[0] file = f"{bucket_dir}/test-file" src_files = [file] * no_requests dst_files = [str(Path(dir, str(idx))) for idx in range(no_requests)] with Callback.as_tqdm_callback(unit="file", desc="pull", size=len(dst_files)) as cb: copy( dvc_s3_fs, src_files, dvc_local_fs, dst_files, callback=cb, batch_size=no_jobs, ) # %% ``` # DVC doctor ``` ╰─❯ dvc doctor DVC version: 2.55.0 ------------------- Platform: Python 3.8.10 on Linux-5.15.0-78-generic-x86_64-with-glibc2.17 Subprojects: dvc_data = 0.47.5 dvc_objects = 0.25.0 dvc_render = 0.3.1 dvc_task = 0.3.0 scmrepo = 1.3.1 Supports: http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3), https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3), s3 (s3fs = 2023.9.2, boto3 = 1.28.17) Cache types: <https://error.dvc.org/no-dvc-cache> Caches: local Remotes: https Workspace directory: xfs on /dev/mapper/data-srv Repo: dvc, git Repo.site_cache_dir: /var/tmp/dvc/repo/77821165453cef8903abded466242330 ```