Skip to content

Commit 3435a72

Browse files
committed
some code
1 parent 625d64b commit 3435a72

File tree

4 files changed

+161
-4
lines changed

4 files changed

+161
-4
lines changed

checkers/base_checker.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ def set_attr_if_is_not_none(attribute_name, first_obj, second_obj):
3535

3636

3737
class BaseChecker:
38+
# TODO: rewrite using HttpClient
3839
aiohttp_connector = None
3940

4041
def __init__(self, url=None, request_type="GET", timeout=None):

collectors/web/cn/89ip/collector.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
from collectors.abstract_collector import AbstractCollector
2+
import http_client
3+
4+
5+
class Collector(AbstractCollector):
6+
__collector__ = True
7+
8+
def __init__(self):
9+
super(Collector, self).__init__()
10+
self.processing_period = 30 * 60 # 30 minutes
11+
'''
12+
floating period means proxy_py will be changing
13+
period to not make extra requests and handle
14+
new proxies in time, you don't need to change
15+
it in most cases
16+
'''
17+
# self.floating_processing_period = False
18+
19+
async def collect(self):
20+
url = 'http://www.89ip.cn/tqdl.html?num=9999&address=&kill_address=&port=&kill_port=&isp='
21+
html = await http_client.get_text(url)
22+
23+
return []

docs/source/guides/how_to_create_collector.rst

Lines changed: 46 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
11
proxy_py How to create collector
22
================================
33

4-
Collector is a class which is used to parse proxies from web page or another source. All collectors are inherited from `collectors.abstract_collector.AbstractCollector`, also there is `collectors.pages_collector.PagesCollector` which is used for paginated sources. It's better to understand through examples.
4+
Collector is a class which is used to parse proxies from web page or another source.
5+
All collectors are inherited from `collectors.abstract_collector.AbstractCollector`,
6+
also there is `collectors.pages_collector.PagesCollector` which is used for paginated sources.
7+
It's better to understand through examples.
58

69
Simple collector
710
****************
811

9-
Let's start with the simplest collector we can imagine, it will be from the page http://www.89ip.cn/ti.html as you can see, it sends form as GET request to this url http://www.89ip.cn/tqdl.html?num=9999&address=&kill_address=&port=&kill_port=&isp=
12+
Let's start with the simplest collector we can imagine,
13+
it will be from the page http://www.89ip.cn/ti.html as you can see,
14+
it sends form as GET request to this url
15+
http://www.89ip.cn/tqdl.html?num=9999&address=&kill_address=&port=&kill_port=&isp=
1016

11-
Firstly we can try to check that these proxies are really good. Just copy and paste list of proxies to file say /tmp/proxies and run this command inside virtual environment
17+
Firstly we can try to check that these proxies are really good.
18+
Just copy and paste list of proxies to file say /tmp/proxies and run this command inside virtual environment
1219

1320
.. code-block:: bash
1421
@@ -18,10 +25,45 @@ You're gonna get something like this:
1825

1926
`++++++++++++++++++++++-+++++-+++++++++++++++++++++++++++-++++++-++++-+++++++++++++++++++++++++++++++--+++++++-+++++++-++-+-+++-+++++++++-+++++++++++++++++++++--++--+-++++++++++++++++-+++--+++-+-+++++++++++++++++--++++++++++++-+++++-+++-++++++++-+++++-+-+++++++-++-+--++++-+++-++++++++++-++++--+++++++-+++++++-++--+++++-+-+++++++++++++++++++++-++-+++-+++--++++--+++-+++++++-+++++++-+++++++++++++++---+++++-+++++++++-+++++-+-++++++++++++-+--+++--+-+-+-++-+++++-+++--++++++-+++++++++++--+-+++-+-++++--+++++--+++++++++-+-+-++++-+-++++++++++++++-++-++++++--+--++++-+-++--++--+++++-++-+++-++++--++--+---------+--+--++--------+++-++-+--++++++++++++++++-+++++++++-+++++++--+--+--+-+-+++---++------------------+--+----------+-+-+--++-+----------+-------+--+------+----+-+--+--++----+--+-++++++-++-+++`
2027

21-
+ means working proxy with at leat one protocol, - means not working, the result above is perfect, so many good proxies.
28+
+ means working proxy with at least one protocol,
29+
- means not working, the result above is perfect, so many good proxies.
2230

2331
Note: working means proxy respond with timeout set in settings, if you increase it you'll get more proxies.
2432

33+
Alright, let's code it!
34+
35+
We need to place our collector inside `collectors/web/` directory using reversed domain path,
36+
it will be `collectors/web/cn/89ip/collector.py`
37+
38+
To make class be collector we need to declare variable `__collector__`
39+
40+
Note: name of file and name of class don't make sense,
41+
you can declare as many files and classes in each file per domain as you want
42+
43+
.. code-block:: python
44+
45+
from collectors.abstract_collector import AbstractCollector
46+
47+
48+
class Collector(AbstractCollector):
49+
__collector__ = True
50+
51+
We can override default processing period in constructor like this:
52+
53+
.. code-block:: python
54+
55+
def __init__(self):
56+
super(Collector, self).__init__()
57+
self.processing_period = 30 * 60 # 30 minutes
58+
'''
59+
floating period means proxy_py will be changing
60+
period to not make extra requests and handle
61+
new proxies in time, you don't need to change
62+
it in most cases
63+
'''
64+
# self.floating_processing_period = False
65+
66+
2567
Paginated collector
2668
*******************
2769

http_client.py

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
from proxy_py import settings
2+
from fake_useragent import UserAgent
3+
from aiosocks.connector import ProxyConnector, ProxyClientRequest
4+
5+
import aiohttp
6+
7+
8+
# TODO: complete cookies saving
9+
class HttpClient:
10+
"""
11+
Simple class for making http requests,
12+
user-agent is set to random one in constructor
13+
"""
14+
_aiohttp_connector = None
15+
16+
def __init__(self):
17+
self.user_agent = UserAgent().random
18+
self.timeout = 60
19+
if HttpClient._aiohttp_connector is None:
20+
HttpClient._aiohttp_connector = ProxyConnector(
21+
remote_resolve=True,
22+
limit=settings.NUMBER_OF_SIMULTANEOUS_REQUESTS,
23+
limit_per_host=settings.NUMBER_OF_SIMULTANEOUS_REQUESTS_PER_HOST,
24+
)
25+
self.proxy_address = None
26+
27+
async def get(self, url):
28+
"""
29+
send HTTP GET request
30+
31+
:param url:
32+
:return:
33+
"""
34+
pass
35+
36+
async def post(self, url, data):
37+
"""
38+
send HTTP POST request
39+
40+
:param url:
41+
:param data:
42+
:return:
43+
"""
44+
raise NotImplementedError()
45+
46+
async def request(self, method, url, data) -> HttpClientResult:
47+
headers = {
48+
'User-Agent': self.user_agent,
49+
50+
}
51+
52+
async with aiohttp.ClientSession(connector=HttpClient._aiohttp_connector,
53+
connector_owner=False,
54+
request_class=ProxyClientRequest
55+
) as session:
56+
async with session.request(method,
57+
url=url,
58+
proxy=self.proxy_address,
59+
timeout=self.timeout,
60+
headers=headers) as response:
61+
return HttpClientResult.make(response)
62+
63+
@staticmethod
64+
async def clean():
65+
HttpClient._aiohttp_connector.close()
66+
67+
68+
class HttpClientResult:
69+
text = None
70+
aiohttp_response = None
71+
72+
@staticmethod
73+
async def make(aiohttp_response):
74+
obj = HttpClientResult()
75+
obj.aiohttp_response = aiohttp_response
76+
obj.text = await obj.aiohttp_response.text()
77+
78+
return obj
79+
80+
def as_text(self):
81+
return self.text
82+
83+
84+
async def get_text(url):
85+
"""
86+
fast method for getting get response without creating extra objects
87+
88+
:param url:
89+
:return:
90+
"""
91+
return (await HttpClient().get(url)).as_text()

0 commit comments

Comments
 (0)