MemoryError when loading a folder of large json file / txt file #25890

yliu2702 · 2024-08-30T15:33:43Z

yliu2702
Aug 30, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from tqdm import tqdm
import os

def load_texts_from_folder(output_folder, batch_size=10):
    all_docs = []

    # 遍历每个 file_name 对应的子文件夹，并添加进度条
    for file_name in tqdm(os.listdir(output_folder), desc="Folders"):
        folder_path = os.path.join(output_folder, file_name)
        if os.path.isdir(folder_path):  # 确保是目录
            doc_text = ""
            txt_files = sorted([f for f in os.listdir(folder_path) if f.endswith('.txt')])
            
            # 分批处理文件
            for i in range(0, len(txt_files), batch_size):
                batch_files = txt_files[i:i+batch_size]
                pbar = tqdm(batch_files, desc=f"Loading files in {file_name}", leave=False)
                
                for txt_file in pbar:
                    txt_file_path = os.path.join(folder_path, txt_file)
                    try:
                        loader = TextLoader(txt_file_path)
                        # 加载文本内容，并将其合并成一个文档
                        loaded_texts = loader.load()
                        doc_text += "\n".join([doc.page_content for doc in loaded_texts])
                        pbar.set_postfix(current_file=txt_file)
                    except MemoryError:
                        print(f"Memory Error occurred while loading: {txt_file_path}")
                        print(f"Documents loaded so far: {len(all_docs)}")
                        return all_docs

                # 每批处理完，创建一个新的文档对象，并清空 doc_text
                doc = {"page_content": doc_text, "metadata": {"source": file_name}}
                all_docs.append(doc)
                doc_text = ""  # 清空以释放内存

    return all_docs

# 使用示例
output_folder = '...'
all_docs = load_texts_from_folder(output_folder, batch_size=10)
print(f"Total documents loaded: {len(all_docs)}")

Description

I have processed a folder of txt files, the path is temp_storage/file_1/chunked_n.json or temp_storage/file_1/chunked_n.txt. Each file should be around 3-4 MB, and a subfolder contains <300 files, and I tried two subfolders under temp_storage. I tried to load a folder of json files / or a folder of txt files using JSONLoader or TextLoader, but I met MemoryError in both scenarios. The loading process may end after I load 200 or even 120 files. I don't know how to solve this Memory error. I even tried to load docs in batches, instead of loading all one-time, but I met the same error.
Can anyone help me with that? Maybe there's some parameters in Langchain function, or suggestion about docs processing. By the way, I tried lazy_load as well, but it fails.

System Info

pip install langchain
from langchain.document_loaders import JSONLoader, TextLoader

@yliu2702 · 2024-08-30T15:36:06Z

dosubot[bot]
bot Aug 30, 2024

Hey @yliu2702! I'm here to help you out with that issue you're facing. I'm an expert bot that can assist you with bug fixes, answer your questions, and even guide you on how to become a contributor. Let's work together to solve this problem!

To solve the MemoryError when loading a large folder of JSON or TXT files using JSONLoader or TextLoader in LangChain, you can use the lazy_load method provided by these loaders. This method processes the files in an iterator fashion, which helps in managing memory usage efficiently.

Here is an example of how to use the lazy_load method with TextLoader:

from langchain.document_loaders import TextLoader
from pathlib import Path

def load_texts_from_folder(output_folder, batch_size=10):
    all_docs = []

    for file_name in tqdm(os.listdir(output_folder), desc="Folders"):
        folder_path = os.path.join(output_folder, file_name)
        if os.path.isdir(folder_path):
            txt_files = sorted([f for f in os.listdir(folder_path) if f.endswith('.txt')])

            for i in range(0, len(txt_files), batch_size):
                batch_files = txt_files[i:i+batch_size]
                pbar = tqdm(batch_files, desc=f"Loading files in {file_name}", leave=False)

                for txt_file in pbar:
                    txt_file_path = os.path.join(folder_path, txt_file)
                    try:
                        loader = TextLoader(txt_file_path)
                        docs_lazy = loader.lazy_load()

                        for doc in docs_lazy:
                            all_docs.append(doc)
                            # Process each document as needed
                            pbar.set_postfix(current_file=txt_file)
                    except MemoryError:
                        print(f"Memory Error occurred while loading: {txt_file_path}")
                        print(f"Documents loaded so far: {len(all_docs)}")
                        return all_docs

    return all_docs

# Usage example
output_folder = '...'
all_docs = load_texts_from_folder(output_folder, batch_size=10)
print(f"Total documents loaded: {len(all_docs)}")

By using lazy_load, you can process each document one at a time, which helps in reducing the memory footprint and avoiding MemoryError ^[1].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MemoryError when loading a folder of large json file / txt file #25890

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MemoryError when loading a folder of large json file / txt file #25890

Uh oh!

yliu2702 Aug 30, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment

Uh oh!

dosubot[bot] bot Aug 30, 2024

yliu2702
Aug 30, 2024

dosubot[bot]
bot Aug 30, 2024