-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
This is a follow-up of #11324 (comment) to formalize #11324 (comment)
The linkcheck
builder extracts all the links that should be checked, putting them in a queue so that a HyperlinkAvailabilityCheckWorker
can check the link for availability. The number of workers is predefined (say n) at the start of the build and does not change. Since we are slowly moving to using sessions objects instead of standalone requests, at any point in time, there should have at most n sessions running, one for each worker.
Assume that there are two workers and three links to check, say bar.org/x
, bar.org/y
and foo.org
. The queue may look like Q = ['bar.org/x', 'foo.org', 'bar.org/y']
. Then, Worker 1 and Worker 2 would check Q[0]
and Q[1]
simultaneously. If Worker 2 finishes before Worker 1, then Worker 1 checks Q[2]
. Ideally, it would have been better to let Worker 1 check that link since this would not have needed a new Session to be opened.
Since the links to check are known before checking anything, one can first pre-process them so that and reorganize them as follows:
- Each link is group by its domain so that a single session can check multiple links within the same domain without having to re-open a TCP connection each time.
- Links that are "standalone" are checked at the end (or at the beginning) or by a standalone worker (only responsible for that). When the other workers are done and don't have "bulk" links to check anymore, they will help that standalone worker.
- Worker 1 to n-1 process together the links of the first domain. If one worker is done before the others, they would help the standalone worker processing the standalone links. Otherwise, they move to processing the other domains and their links.
- Links within a domain should be dispatched to multiple workers if we don't end-up with n-1 workers processing a single link. Like, if I have 5 workers and 5 links of the same domain to process, it's probably more efficient to only use 2 workers instead of 5 since we would have opened only 2 TCP connections instead of 5. I think we should investigate hwo to properly split the chunks and delegate to the workers (and that's the most hard part of this feature IMHO).
Alternatively, we could assume that there are only a worker responsible for processing a domain and the others only help if there is nothing else to do. If there are 15 links for domain 1 and 5 links for domain 2, the flow is as follows:
t=0
--- Worker 1 processes links 1 to 5 of domain 1 and Worker 2 processes all links of domain 2.t=1
--- Worker 1 processes links 6 to 10 of domain 1 and Worker 2 processes links 11 to 15 of domain 1.
The idea is to balance as much as possible the blocks so that the number of TCP connections to open is as small as possible and there is no waiting time between two checks.
This feature has probably a very low priority because the implementation will not be trivial and we might need a PoC to know whether this is really worth implementing. For projects with many links of different domains, the current implementation is suitable but for projects using sphinx.ext.intersphinx
, this may be of interest. Still it is good to have an opened issue where we could discuss.
Related: