Replies: 1 comment
-
Hey @yliu2702! I'm here to help you out with that issue you're facing. I'm an expert bot that can assist you with bug fixes, answer your questions, and even guide you on how to become a contributor. Let's work together to solve this problem! To solve the Here is an example of how to use the from langchain.document_loaders import TextLoader
from pathlib import Path
def load_texts_from_folder(output_folder, batch_size=10):
all_docs = []
for file_name in tqdm(os.listdir(output_folder), desc="Folders"):
folder_path = os.path.join(output_folder, file_name)
if os.path.isdir(folder_path):
txt_files = sorted([f for f in os.listdir(folder_path) if f.endswith('.txt')])
for i in range(0, len(txt_files), batch_size):
batch_files = txt_files[i:i+batch_size]
pbar = tqdm(batch_files, desc=f"Loading files in {file_name}", leave=False)
for txt_file in pbar:
txt_file_path = os.path.join(folder_path, txt_file)
try:
loader = TextLoader(txt_file_path)
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
all_docs.append(doc)
# Process each document as needed
pbar.set_postfix(current_file=txt_file)
except MemoryError:
print(f"Memory Error occurred while loading: {txt_file_path}")
print(f"Documents loaded so far: {len(all_docs)}")
return all_docs
return all_docs
# Usage example
output_folder = '...'
all_docs = load_texts_from_folder(output_folder, batch_size=10)
print(f"Total documents loaded: {len(all_docs)}") By using |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I have processed a folder of txt files, the path is temp_storage/file_1/chunked_n.json or temp_storage/file_1/chunked_n.txt. Each file should be around 3-4 MB, and a subfolder contains <300 files, and I tried two subfolders under temp_storage. I tried to load a folder of json files / or a folder of txt files using JSONLoader or TextLoader, but I met MemoryError in both scenarios. The loading process may end after I load 200 or even 120 files. I don't know how to solve this Memory error. I even tried to load docs in batches, instead of loading all one-time, but I met the same error.
Can anyone help me with that? Maybe there's some parameters in Langchain function, or suggestion about docs processing. By the way, I tried lazy_load as well, but it fails.
System Info
pip install langchain
from langchain.document_loaders import JSONLoader, TextLoader
Beta Was this translation helpful? Give feedback.
All reactions