Skip to content

[Feature Request] Multi-threading #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gprime31 opened this issue May 17, 2021 · 5 comments
Open

[Feature Request] Multi-threading #7

gprime31 opened this issue May 17, 2021 · 5 comments

Comments

@gprime31
Copy link

gprime31 commented May 17, 2021

Multi-threading would be great, if it's possible.
Seeing how it only uses one core ATM.

@gprime31 gprime31 changed the title [Feature Request] Multi-hreading [Feature Request] Multi-threading May 17, 2021
@arthur4ires
Copy link

I was wondering how to implement this, if anyone has the idea and locations I can write and submit a pull request.

@rotemreiss
Copy link
Owner

Hi @gprime31, @arthur4ires,

I totally agree that the current way is not good enough, I think that what we can do without too much effort and refactoring is to split each domain into a different thread.
In my experience, I usually feel the need in multi-threading when scanning a large list with a set of domains, do you think that the above solution will serve your needs as well?

@arthur4ires
Copy link

I didn't envision working independently with each domain, I only envisioned launching multiple processes with Python's multiprocessing lib.

At line 184, where the loop that will iterate through each line in the text begins.

My idea is to transform the variable that opens the file to a list type, a real list in python, and make it global.

And the processes will remove the values already used and add them to a new list that will be used to return the clean values.

@rotemreiss
Copy link
Owner

Hi @arthur4ires ,

I actually started working on that a few days ago, and I'm mostly done, just testing and fixing a few bugs that I've found along the way.

From my experience, I usually have a long list of URLs, which is combined from multiple domains (and sub-domains ofc), but not a very long list of URLs that are all of the same domain. Therefore I chose to multi-process by base-URLs, and not by splitting the list of URLs into smaller chunks.

Maybe the approach I went with isn't the most generic and is making a heavy assumption, but it could be tweaked more in the future if needed.

Just some stats until now:
I ran the multi-process version on a list of 54562 URLs with ~400 unique base URLs.
Without multi-process (the current version of Uddup) - 207.35907626152039 seconds
With multi-process - 71.43765902519226 seconds

Tested locally on my Macbook Pro (16 cores).

If you'd like to further discuss it with me, you can PM me on Twitter (@2RS3C).

@arthur4ires
Copy link

The results are actually much better with threads.

Did you release this version on any branches here on github? If there was I could help with some bugs or code.

If you've already started with this approach I think continuing with it would be more prudent.

An interesting point for me is displaying the results on the screen with a verbose option.

That way you can know more or less where the script is. I sent an mp on your twitter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants