RecursiveUrlLoader question #18886

ArturPevko · 2024-03-11T00:14:49Z

ArturPevko
Mar 11, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

loader = RecursiveUrlLoader(
        url="https://sotkaonline.ru/", max_depth=2, extractor=lambda x:Soup(x,"html.parser").text
    )
document = loader.load()
print(document)

Description

Good afternoon,
I have a code that I would like to use to load the HTML page of the website "https://sotkaonline.ru/", but it is returning an empty document.
Could you please help me with this issue? I would appreciate any assistance you can provide.
Thank you for your time and attention.

System Info

pip install langchain

@ArturPevko · 2024-03-11T00:15:07Z

dosubot[bot]
bot Mar 11, 2024

Hey there @ArturPevko! Nice to see you diving into another challenge with LangChain. How's it going on your end?

Based on the context provided, there could be several reasons why the RecursiveUrlLoader is returning an empty document when trying to load the HTML page of the website "https://sotkaonline.ru/". Here are some potential causes and solutions:

HTTP Status Check: The loader has a condition to check the HTTP response status (check_response_status). If this is set to True and the website returns a status code in the range of 400-599, the loader will skip processing the URL. If "https://sotkaonline.ru/" is returning such an error status, it would lead to an empty document. You can check the HTTP response status of "https://sotkaonline.ru/" to see if this is the issue.
Content Extraction: The loader uses an extractor function to extract document contents from the raw HTML. If this function returns an empty string for the website's content, the document will be ignored. The default extractor is a lambda function that returns the input as is (lambda x: x), but if it's overridden and fails to properly extract content from "https://sotkaonline.ru/", it could result in an empty document. You can ensure that the extractor function works correctly for the site's HTML structure.
Prevent Outside URLs: The loader has a prevent_outside flag that, when set to True, prevents loading URLs that are not children of the root URL. If "https://sotkaonline.ru/" redirects to another domain or the links on the page lead to other domains, and prevent_outside is True, those links would be ignored. You can verify that prevent_outside is appropriately configured for your use case.
Async/Await with aiohttp: If use_async is True, the loader uses aiohttp for asynchronous HTTP requests. Issues with SSL verification, timeouts, or aiohttp-specific exceptions could lead to failures in loading or processing the page content. The loader disables SSL verification (ssl=False), which generally shouldn't cause an empty document unless the website specifically requires valid SSL for access. You can check if any redirects or domain-specific issues are handled correctly.
Timeouts: The loader has a timeout parameter for requests. If the website takes longer to respond than the specified timeout, the request will fail, potentially leading to an empty document. You can adjust the timeout value if necessary.

In addition, there are two similar issues in the LangChain repository that might be related to your problem:

UTF-8 content is not loaded correctly with RecursiveUrlLoader: This issue suggests setting the encoding of the response to 'utf-8' to ensure that the content is correctly loaded.
RecursiveUrlLoader doesn't include root URL content: This issue suggests a modification to the lazy_load method to fetch and process the root URL content, create a Document object for it, and include it in the list of Documents returned by the lazy_load method.

Please try these solutions and let me know if they help resolve your issue. If not, please provide more information about your setup and the exact error message you're seeing, and I'll be happy to assist further.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

maximeperrindev · 2024-03-11T15:39:49Z

maximeperrindev
Mar 11, 2024

@ArturPevko
I tried to load recursively your website but unfortunately, it looks like the structure is not appropriate for recursive loading.
I recommend you to use WebBaseLoaderand specify urls you want to load.

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(["https://sotkaonline.ru/", "https://sotkaonline.ru/historia"])
document = loader.load()
print(len(document))

I think that's what you need if you do not want to change the website structure

0 replies

nrjvivek876 · 2024-03-17T02:32:51Z

nrjvivek876
Mar 17, 2024

I tried to crawl complete website using RecursiveUrlLoader with max_depth=None, but it's only crawling first page linked URL only, it never went to other pages. but I tried max_depth with any number then crawling other pages, can any one guide me how can I crawl complete website.

1 reply

maximeperrindev Mar 18, 2024

@niraj876
RecursiveUrlLoader rely also on the website structure. It's crawling recursively from the first page to every links contained in it.
As I said above, you can try to load with WebBasLoader as a workaround if you cannot control the way the website is designed.

imuhammadarsalan · 2025-02-20T15:04:51Z

imuhammadarsalan
Feb 20, 2025

its not crawling the complete website.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RecursiveUrlLoader question #18886

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

About Dosu

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RecursiveUrlLoader question #18886

Uh oh!

ArturPevko Mar 11, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 4 comments · 1 reply

Uh oh!

Uh oh!

dosubot[bot] bot Mar 11, 2024

Sources

About Dosu

Uh oh!

maximeperrindev Mar 11, 2024

Uh oh!

nrjvivek876 Mar 17, 2024

Uh oh!

maximeperrindev Mar 18, 2024

Uh oh!

imuhammadarsalan Feb 20, 2025

its not crawling the complete website.

ArturPevko
Mar 11, 2024

Replies: 4 comments 1 reply

dosubot[bot]
bot Mar 11, 2024

maximeperrindev
Mar 11, 2024

nrjvivek876
Mar 17, 2024

imuhammadarsalan
Feb 20, 2025