Skip to content

Conversation

keunwoochoi
Copy link
Contributor

similar to #1498, but for file reading.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 21, 2025
import logging
from typing import Any, Dict, Optional

from smart_open import open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this open clash with Python's in-built open for some other operations?
Is it safer to use it as smart_open.open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

except Exception as e:
logger.error(f"Error reading {file_path}: {e}")
raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a shutdown method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question with #1498 - what is needed to be implemented there?

encoding=None if "b" in self.mode else self.encoding,
transport_params=self.transport_params,
) as f:
content = f.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this use case for large files
  2. Does this line load all the contents at once

if yes for 1. and No for 2., then we might want to add a chunked reader or streamer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think that might be out of scope for this PR and we can work on that later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it includes large files. but i have a separate piece of code that does read text file line-by-line, without using this FileReader. i.e. text streamer.

agree that we those would be great to have as a separate PR.


return {self.DATA_KEY: content, self.METADATA_KEY: metadata}

except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a retry logic
what if there were some network hiccups, or some other retryable error
Is there a way to find out if the error was because of network (we can retry in that case) or some other factor (file not present, etc., so do not retry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, added in the new commit.

Copy link

pytorch-bot bot commented Aug 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1499

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ec10460 with merge base b8eb16f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@keunwoochoi keunwoochoi changed the title draft: Adds FileReader draft: Adds FileReader Aug 31, 2025
@keunwoochoi keunwoochoi changed the title draft: Adds FileReader Adds FileReader Aug 31, 2025
@ramanishsingh ramanishsingh merged commit 9295079 into meta-pytorch:main Sep 9, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants