Skip to content

Commit cd4ba3b

Browse files
committed
started to write guide about creating collectors
1 parent c40a483 commit cd4ba3b

File tree

8 files changed

+89
-6
lines changed

8 files changed

+89
-6
lines changed

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
proxy_py README
2-
========
2+
===============
33

44
proxy_py is a program which collects proxies, saves them in a database
55
and makes periodically checks.

check_from_stdin.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
"""
2+
just a helper script for testing proxies
3+
"""
4+
from proxy_py import settings
5+
from models import Proxy
6+
from checkers.base_checker import BaseChecker
7+
8+
import asyncio
9+
import proxy_utils
10+
import sys
11+
import re
12+
13+
14+
proxy_find_regex = \
15+
r"([0-9]{1,3})[^0-9]+([0-9]{1,3})[^0-9]+([0-9]{1,3})[^0-9]+([0-9]{1,3})"\
16+
r"[^0-9]+([0-9]{1,5})"
17+
semaphore = asyncio.BoundedSemaphore(settings.NUMBER_OF_CONCURRENT_TASKS)
18+
tasks = []
19+
20+
21+
async def check_task(ip, port):
22+
async with semaphore:
23+
for raw_protocol in range(len(Proxy.PROTOCOLS)):
24+
proxy_url = '{}://{}:{}'.format(
25+
Proxy.PROTOCOLS[raw_protocol],
26+
ip,
27+
port
28+
)
29+
check_result, _ = await proxy_utils.check_proxy(proxy_url)
30+
if check_result:
31+
break
32+
# if check_result:
33+
# print('proxy {} works'.format(proxy_url))
34+
print('+' if check_result else '-', end='')
35+
36+
37+
async def main():
38+
for line in sys.stdin:
39+
line = line.strip()
40+
groups = re.search(proxy_find_regex, line).groups()
41+
ip = '.'.join(groups[:4])
42+
port = groups[4]
43+
44+
tasks.append(asyncio.ensure_future(check_task(ip, port)))
45+
46+
await asyncio.gather(*tasks)
47+
print()
48+
BaseChecker.clean()
49+
50+
51+
if __name__ == '__main__':
52+
asyncio.get_event_loop().run_until_complete(main())
File renamed without changes.

collectors/pages_collector.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from proxy_py import settings
2-
from collectors.collector import AbstractCollector
2+
from collectors.abstract_collector import AbstractCollector
33

44

55
# TODO: save pages to collector state

collectors/web/net/checkerproxy/collector.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from collectors.collector import AbstractCollector
1+
from collectors.abstract_collector import AbstractCollector
22

33
import async_requests
44
import json

collectors/web/net/free_proxy_list/collector.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from collectors.collector import AbstractCollector
1+
from collectors.abstract_collector import AbstractCollector
22
import lxml.html
33
import lxml.etree
44

docs/source/guides/guides.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
proxy_py Guides
2-
========
2+
===============
33

44
.. toctree::
5-
:maxdepth: 4
5+
:maxdepth: 4
66

7+
how_to_create_collector.rst
78

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
proxy_py How to create collector
2+
================================
3+
4+
Collector is a class which is used to parse proxies from web page or another source. All collectors are inherited from `collectors.abstract_collector.AbstractCollector`, also there is `collectors.pages_collector.PagesCollector` which is used for paginated sources. It's better to understand through examples.
5+
6+
Simple collector
7+
****************
8+
9+
Let's start with the simplest collector we can imagine, it will be from the page http://www.89ip.cn/ti.html as you can see, it sends form as GET request to this url http://www.89ip.cn/tqdl.html?num=9999&address=&kill_address=&port=&kill_port=&isp=
10+
11+
Firstly we can try to check that these proxies are really good. Just copy and paste list of proxies to file say /tmp/proxies and run this command inside virtual environment
12+
13+
.. code-block:: bash
14+
15+
cat /tmp/proxies | python3 check_from_stdin.py
16+
17+
You're gonna get something like this:
18+
19+
`++++++++++++++++++++++-+++++-+++++++++++++++++++++++++++-++++++-++++-+++++++++++++++++++++++++++++++--+++++++-+++++++-++-+-+++-+++++++++-+++++++++++++++++++++--++--+-++++++++++++++++-+++--+++-+-+++++++++++++++++--++++++++++++-+++++-+++-++++++++-+++++-+-+++++++-++-+--++++-+++-++++++++++-++++--+++++++-+++++++-++--+++++-+-+++++++++++++++++++++-++-+++-+++--++++--+++-+++++++-+++++++-+++++++++++++++---+++++-+++++++++-+++++-+-++++++++++++-+--+++--+-+-+-++-+++++-+++--++++++-+++++++++++--+-+++-+-++++--+++++--+++++++++-+-+-++++-+-++++++++++++++-++-++++++--+--++++-+-++--++--+++++-++-+++-++++--++--+---------+--+--++--------+++-++-+--++++++++++++++++-+++++++++-+++++++--+--+--+-+-+++---++------------------+--+----------+-+-+--++-+----------+-------+--+------+----+-+--+--++----+--+-++++++-++-+++`
20+
21+
+ means working proxy with at leat one protocol, - means not working, the result above is perfect, so many good proxies.
22+
23+
Note: working means proxy respond with timeout set in settings, if you increase it you'll get more proxies.
24+
25+
Paginated collector
26+
*******************
27+
28+
So, let's create a simple paginated collector.
29+
30+

0 commit comments

Comments
 (0)