-
Checked other resources
Commit to Help
Example Codefrom fastapi import FastAPI
from fastapi import UploadFile, File
from langchain_community.document_loaders import CSVLoader
app = FastAPI()
@app.post("/")
async def example(
file: UploadFile = File(None),
):
print(type(file))
from pprint import pprint as pp
print(context)
pp(vars(file))
pp(dir(file))
loader = CSVLoader(context)
#Returns error
data = loader.load()
return {}
# run with uvicorn main:app --reload
# http://127.0.0.1:8000
# create a new file `csv.csv`, and upload it via the file picker DescriptionDue to requirements in the project, our implementation of LangChain separates the Front-End and Back-End into separate applications. The Front-End makes request via an AWS API Gateway -> Lambda, while the Back-End handles storage, logging, and everything LangChain via a Docker Image -> AWS Lambda. The front-end is able to provide a file, but because of limitation on AWS Lambda, this file isn't 'stored locally' (with a path), but instead is in memory as starlette.datastructures.UploadFile or basically
From what I can tell, most document loaders only take a string for the file path; resulting in an error on load.
Any idea on utilizing this file directly instead of providing a path to the loader? Please note - csv is just an example. I eventually hope to support many more complex files types like System InfoSystem Information (running my docker image locally)
Package Information
Packages not installed (Not Necessarily a Problem)The following packages were not found:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
🤖 Hey there @pastram-i! 🎉 Good to see you back with another intriguing issue. Let's dive into this one! Based on the information you've provided, it seems that the current implementation of LangChain's document loaders only supports loading files from the filesystem. However, you can create a custom BlobLoader to handle in-memory files, such as the Here's an example of how you might create a new BlobLoader for in-memory files: from langchain_community.document_loaders.blob_loaders import BlobLoader
from starlette.datastructures import UploadFile
class InMemoryBlobLoader(BlobLoader):
def __init__(self, upload_file: UploadFile):
self.upload_file = upload_file
def yield_blobs(self):
yield self.upload_file.read() This Then, you can use this new BlobLoader in the GenericLoader: from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers.pdf import PyPDFParser
# Assuming `upload_file` is an instance of `starlette.datastructures.UploadFile`
blob_loader = InMemoryBlobLoader(upload_file)
blob_parser = PyPDFParser() # or any other parser you want to use
loader = GenericLoader(blob_loader, blob_parser)
docs = loader.lazy_load()
next(docs) This code creates a new Please note that this is a workaround and might not be the most efficient solution, especially for large files, as it loads the entire file into memory. If you're dealing with large files, you might want to consider other solutions, such as streaming the file or storing it temporarily on disk. I hope this helps! If you have any other questions or need further clarification, feel free to ask. Sources
|
Beta Was this translation helpful? Give feedback.
-
hey @pastram-i! have you found a solution for this? I'm currently facing the same issue |
Beta Was this translation helpful? Give feedback.
-
Hey, I wanted to leave my 2 cents here as well, maybe it will be useful for someone. You don't need to save the file temporarily, since loaders and parsers are separated, you can create an import os
from typing import Iterable, Iterator, Union
from fastapi import FastAPI, HTTPException, UploadFile, status
from langchain_community.document_loaders.blob_loaders import BlobLoader
from langchain_community.document_loaders.parsers.txt import TextParser
from langchain_core.document_loaders import BaseLoader, Blob
from langchain_core.documents import Document
class UploadFileBlobLoader(BlobLoader):
data: Union[bytes, str] = ""
metadata: dict[str, str] = {}
def __init__(
self,
*,
file: UploadFile,
) -> None:
self.data = file.file.read()
if file.filename is not None:
self.metadata["source"] = file.filename
def yield_blobs(self) -> Iterable[Blob]:
yield Blob.from_data(data=self.data, metadata=self.metadata)
class TextDocumentComposer(BaseLoader):
def __init__(
self,
*,
file: UploadFile,
) -> None:
self.loader = UploadFileBlobLoader(file=file)
self.parser = TextParser()
def lazy_load(self) -> Iterator[Document]:
for blob in self.loader.yield_blobs():
for doc in self.parser.parse(blob=blob):
yield doc
app = FastAPI()
@app.post("/")
async def upload_file(
file: UploadFile,
) -> None:
file_extension = os.path.splitext(file.filename)[1].lower()
match file_extension:
case ".txt":
composer = TextDocumentComposer(file=file)
case _:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Unsupported file format",
)
docs = composer.load() |
Beta Was this translation helpful? Give feedback.
@matisidler - I've settled on a bit of a workaround instead of a solution. I upload to s3 using boto3, then use the S3FileLoader. I've just settled on this solution within the last week, so I haven't tested in thoroughly, but afaik it should be file type agnostic.