-
Notifications
You must be signed in to change notification settings - Fork 169
Adds FileReader
#1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds FileReader
#1499
Conversation
torchdata/nodes/io/file_read.py
Outdated
import logging | ||
from typing import Any, Dict, Optional | ||
|
||
from smart_open import open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this open
clash with Python's in-built open
for some other operations?
Is it safer to use it as smart_open.open
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
except Exception as e: | ||
logger.error(f"Error reading {file_path}: {e}") | ||
raise | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add a shutdown method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question with #1498 - what is needed to be implemented there?
torchdata/nodes/io/file_read.py
Outdated
encoding=None if "b" in self.mode else self.encoding, | ||
transport_params=self.transport_params, | ||
) as f: | ||
content = f.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this use case for large files
- Does this line load all the contents at once
if yes for 1. and No for 2., then we might want to add a chunked reader or streamer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think that might be out of scope for this PR and we can work on that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it includes large files. but i have a separate piece of code that does read text file line-by-line, without using this FileReader
. i.e. text streamer.
agree that we those would be great to have as a separate PR.
torchdata/nodes/io/file_read.py
Outdated
|
||
return {self.DATA_KEY: content, self.METADATA_KEY: metadata} | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a retry logic
what if there were some network hiccups, or some other retryable error
Is there a way to find out if the error was because of network (we can retry in that case) or some other factor (file not present, etc., so do not retry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, added in the new commit.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1499
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ec10460 with merge base b8eb16f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
similar to #1498, but for file reading.