Skip to content

Shouldn't API client pass stream = True to the requests when downloading datasets? #754

@i-aki-y

Description

@i-aki-y

Since the requests package seem to exhaust system RAM as default behavior, I think some api should pass stream = True that allows chunked download. Current implementation hardcode as stream=None (equivalent to False) and this can make the user's system unstable when downloading large datasets.

settings = self._session.merge_environment_settings(http_request.url, {}, None, None, None)

The download_file method in KaggleApi class tries to support chunked downloads but I am not sure this code works as expected because the downloading would be considered complete at this point.

for data in response.iter_content(chunk_size):

And I think the current usage of the kaggle.http_client() outside of the with self.build_kaggle_client() as kaggle: statement is not recommended because resource managed by kaggle object might be closed outside the with statement.

with self.build_kaggle_client() as kaggle:
  ...

download_file(..., kaggle.http_client(), ...)

ex.

with self.build_kaggle_client() as kaggle:
request = ApiDownloadDataFileRequest()
request.competition_name = competition
request.file_name = file_name
response = kaggle.competitions.competition_api_client.download_data_file(request)
url = response.history[0].url
outfile = os.path.join(effective_path, url.split('?')[0].split('/')[-1])
if force or self.download_needed(response, outfile, quiet):
self.download_file(response, outfile, kaggle.http_client(), quiet, not force)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions