-
-
Notifications
You must be signed in to change notification settings - Fork 463
Description
When I use the “Re-Crawl Index Documents” option of IndexReIndexMonitor_p to re-crawl very old documents ( fresh_date_dt:[2000-01-01T00:00:00.000Z TO 2010-01-01T00:00:00.000Z] ), the same documents are then in the list of simulated documents after searching again.
I have found this error in the protocol:
I 2025/03/25 18:56:07 SWITCHBOARD * Not Condensed Resource 'http://www.soitu.es/soitu/portada.html': denied, canonical != source; canonical = http://www.soitu.es/; source = http://www.soitu.es/soitu/
portada.html
But if I use the “re-crawl url” option from the list of simulation documents
localhost:8090/solr/select?core=collection1&wt=html&start=0&rows=10&q=fresh_date_dt%3A[2000-01-01T00%3A00%3A00.000Z+TO+2010-01-01T00%3A00%3A00.000Z]+AND+(httpstatus_i%3A200)
then it will be removed from the list.
I found a small difference in the source.
QuickCrawlLink_p.java -> line 141 -> new CrawlProfile
vs
IndexReIndexMonitor_p.java -> line 152 -> new RecrawlBusyThread -> line 345 new CrawlProfile
A CrawlProfile is created in both places, but the parameter noindexWhenCanonicalUnequalURL is different.
Is that intentional or a small mistake?