Skip to content

Recrawl problem in IndexReIndexMonitor_p #687

@henschi

Description

@henschi

When I use the “Re-Crawl Index Documents” option of IndexReIndexMonitor_p to re-crawl very old documents ( fresh_date_dt:[2000-01-01T00:00:00.000Z TO 2010-01-01T00:00:00.000Z] ), the same documents are then in the list of simulated documents after searching again.
I have found this error in the protocol:

I 2025/03/25 18:56:07 SWITCHBOARD * Not Condensed Resource 'http://www.soitu.es/soitu/portada.html': denied, canonical != source; canonical = http://www.soitu.es/; source = http://www.soitu.es/soitu/
portada.html

But if I use the “re-crawl url” option from the list of simulation documents

localhost:8090/solr/select?core=collection1&wt=html&start=0&rows=10&q=fresh_date_dt%3A[2000-01-01T00%3A00%3A00.000Z+TO+2010-01-01T00%3A00%3A00.000Z]+AND+(httpstatus_i%3A200)

then it will be removed from the list.

I found a small difference in the source.
QuickCrawlLink_p.java -> line 141 -> new CrawlProfile
vs
IndexReIndexMonitor_p.java -> line 152 -> new RecrawlBusyThread -> line 345 new CrawlProfile

A CrawlProfile is created in both places, but the parameter noindexWhenCanonicalUnequalURL is different.
Is that intentional or a small mistake?

Metadata

Metadata

Assignees

Labels

bugIndicates an unexpected problem or unintended behaviorcrawler

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions