Skip to content

Commit 65a2869

Browse files
committed
add actions support
1 parent 1d85a80 commit 65a2869

File tree

7 files changed

+128
-57
lines changed

7 files changed

+128
-57
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Gerapy Pyppeteer Changelog
22

3+
## 0.2.2 (2021-09-07)
4+
5+
### Features
6+
7+
- add support for executing Python based functions
8+
- add support for returning script result
9+
310
## 0.1.2 (2021-06-20)
411

512
### Buf Fixes & Features

README.md

Lines changed: 69 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ web page which you configured the request as PyppeteerRequest.
3636

3737
GerapyPyppeteer provides some optional settings.
3838

39-
### Concurrency
39+
### Concurrency
4040

4141
You can directly use Scrapy's setting to set Concurrency of Pyppeteer,
4242
for example:
@@ -47,7 +47,7 @@ CONCURRENT_REQUESTS = 3
4747

4848
### Pretend as Real Browser
4949

50-
Some website will detect WebDriver or Headless, GerapyPyppeteer can
50+
Some website will detect WebDriver or Headless, GerapyPyppeteer can
5151
pretend Chromium by inject scripts. This is enabled by default.
5252

5353
You can close it if website does not detect WebDriver to speed up:
@@ -56,15 +56,15 @@ You can close it if website does not detect WebDriver to speed up:
5656
GERAPY_PYPPETEER_PRETEND = False
5757
```
5858

59-
Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this
59+
Also you can use `pretend` attribute in `PyppeteerRequest` to overwrite this
6060
configuration.
6161

6262
### Logging Level
6363

6464
By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer
6565
configured the logging level of Pyppeteer to WARNING.
6666

67-
If you want to see more logs from Pyppeteer, you can change the this setting:
67+
If you want to see more logs from Pyppeteer, you can change the this setting:
6868

6969
```python
7070
import logging
@@ -82,11 +82,11 @@ GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
8282

8383
### Headless
8484

85-
By default, Pyppeteer is running in `Headless` mode, you can also
85+
By default, Pyppeteer is running in `Headless` mode, you can also
8686
change it to `False` as you need, default is `True`:
8787

8888
```python
89-
GERAPY_PYPPETEER_HEADLESS = False
89+
GERAPY_PYPPETEER_HEADLESS = False
9090
```
9191

9292
### Window Size
@@ -137,19 +137,19 @@ GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script']
137137

138138
All of the optional resource type list:
139139

140-
* document: the Original HTML document
141-
* stylesheet: CSS files
142-
* script: JavaScript files
143-
* image: Images
144-
* media: Media files such as audios or videos
145-
* font: Fonts files
146-
* texttrack: Text Track files
147-
* xhr: Ajax Requests
148-
* fetch: Fetch Requests
149-
* eventsource: Event Source
150-
* websocket: Websocket
151-
* manifest: Manifest files
152-
* other: Other files
140+
- document: the Original HTML document
141+
- stylesheet: CSS files
142+
- script: JavaScript files
143+
- image: Images
144+
- media: Media files such as audios or videos
145+
- font: Fonts files
146+
- texttrack: Text Track files
147+
- xhr: Ajax Requests
148+
- fetch: Fetch Requests
149+
- eventsource: Event Source
150+
- websocket: Websocket
151+
- manifest: Manifest files
152+
- other: Other files
153153

154154
### Screenshot
155155

@@ -158,7 +158,7 @@ You can get screenshot of loaded page, you can pass `screenshot` args to `Pyppet
158158
- `type` (str): Specify screenshot type, can be either `jpeg` or `png`. Defaults to `png`.
159159
- `quality` (int): The quality of the image, between 0-100. Not applicable to `png` image.
160160
- `fullPage` (bool): When true, take a screenshot of the full scrollable page. Defaults to `False`.
161-
- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields:
161+
- `clip` (dict): An object which specifies clipping region of the page. This option should have the following fields:
162162
- `x` (int): x-coordinate of top-left corner of clip area.
163163
- `y` (int): y-coordinate of top-left corner of clip area.
164164
- `width` (int): width of clipping area.
@@ -200,41 +200,69 @@ GERAPY_PYPPETEER_SCREENSHOT = {
200200

201201
`PyppeteerRequest` provide args which can override global settings above.
202202

203-
* url: request url
204-
* callback: callback
205-
* one of "load", "domcontentloaded", "networkidle0", "networkidle2".
206-
see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded`
207-
* wait_for: wait for some element to load, also supports dict
208-
* script: script to execute
209-
* proxy: use proxy for this time, like `http://x.x.x.x:x`
210-
* sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP`
211-
* timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT`
212-
* ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`
213-
* pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND`
214-
* screenshot: ignored resource types, see
215-
https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot,
216-
override `GERAPY_PYPPETEER_SCREENSHOT`
203+
- url: request url
204+
- callback: callback
205+
- one of "load", "domcontentloaded", "networkidle0", "networkidle2".
206+
see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded`
207+
- wait_for: wait for some element to load, also supports dict
208+
- script: script to execute
209+
- proxy: use proxy for this time, like `http://x.x.x.x:x`
210+
- sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP`
211+
- timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT`
212+
- ignore_resource_types: ignored resource types, override `GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES`
213+
- pretend: pretend as normal browser, override `GERAPY_PYPPETEER_PRETEND`
214+
- screenshot: ignored resource types, see
215+
https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot,
216+
override `GERAPY_PYPPETEER_SCREENSHOT`
217217

218218
For example, you can configure PyppeteerRequest as:
219219

220220
```python
221221
from gerapy_pyppeteer import PyppeteerRequest
222222

223223
def parse(self, response):
224-
yield PyppeteerRequest(url,
224+
yield PyppeteerRequest(url,
225225
callback=self.parse_detail,
226226
wait_until='domcontentloaded',
227227
wait_for='title',
228-
script='() => { console.log(document) }',
228+
script='() => { return {name: "Germey"} }',
229229
sleep=2)
230230
```
231231

232232
Then Pyppeteer will:
233-
* wait for document to load
234-
* wait for title to load
235-
* execute `console.log(document)` script
236-
* sleep for 2s
237-
* return the rendered web page content
233+
234+
- wait for document to load
235+
- wait for title to load
236+
- execute `console.log(document)` script
237+
- sleep for 2s
238+
- return the rendered web page content, get from `response.meta['screenshot']`
239+
- return the script executed result, get from `response.meta['script_result']`
240+
241+
For waiting mechanism controlled by JavaScript, you can use await in `script`, for example:
242+
243+
```python
244+
js = '''async () => {
245+
await new Promise(resolve => setTimeout(resolve, 10000));
246+
return {
247+
'name': 'Germey'
248+
}
249+
}
250+
'''
251+
yield PyppeteerRequest(url, callback=self.parse, script=js)
252+
```
253+
254+
Then you can get the script result from `response.meta['script_result']`, result is `{'name': 'Germey'}`.
255+
256+
If you think the JavaScript is wired to write, you can use actions argument to define a function to execute `Python` based functions, for example:
257+
258+
```python
259+
async def execute_actions(page: Page):
260+
await page.evaluate('() => { document.title = "Hello World"; }')
261+
return 1
262+
yield PyppeteerRequest(url, callback=self.parse, actions=execute_actions)
263+
```
264+
265+
Then you can get the actions result from `response.meta['actions_result']`, result is `1`.
238266

239267
## Example
240268

@@ -366,4 +394,3 @@ chromiumExecutable = {
366394
```
367395

368396
You can find your own operating system, modify your chrome or chrome executable path.
369-

example/example/settings.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
RETRY_HTTP_CODES = [403, 500, 502, 503, 504]
1919

20-
GERAPY_PYPPETEER_HEADLESS = True
20+
GERAPY_PYPPETEER_HEADLESS = False
2121

2222
LOG_LEVEL = 'DEBUG'
2323

example/example/spiders/book.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,24 @@
44
from example.items import BookItem
55
from gerapy_pyppeteer import PyppeteerRequest
66
import logging
7+
from pyppeteer.page import Page
78

89
logger = logging.getLogger(__name__)
910

1011

12+
js = '''async () => {
13+
await new Promise(resolve => setTimeout(resolve, 10000));
14+
return {
15+
'name': 'Germey'
16+
}
17+
}'''
18+
19+
20+
async def execute_action(page: Page):
21+
await page.evaluate('() => { document.title = "Hello World"; }')
22+
return 1
23+
24+
1125
class BookSpider(scrapy.Spider):
1226
name = 'book'
1327
allowed_domains = ['spa5.scrape.center']
@@ -20,14 +34,15 @@ def start_requests(self):
2034
"""
2135
start_url = f'{self.base_url}/page/1'
2236
logger.info('crawling %s', start_url)
23-
yield PyppeteerRequest(start_url, callback=self.parse_index, wait_for='.item .name')
37+
yield PyppeteerRequest(start_url, callback=self.parse_index, actions=execute_action, wait_for='.item .name', script=js)
2438

2539
def parse_index(self, response):
2640
"""
2741
extract books and get next page
2842
:param response:
2943
:return:
3044
"""
45+
logger.debug('response meta %s', response.meta)
3146
items = response.css('.item')
3247
for item in items:
3348
href = item.css('.top a::attr(href)').extract_first()

gerapy_pyppeteer/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
VERSION = (0, 1, '2')
1+
VERSION = (0, 2, '2rc1')
22

33
version = __version__ = '.'.join(map(str, VERSION))

gerapy_pyppeteer/downloadermiddlewares.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,11 +312,19 @@ async def _handle_interception(pu_request):
312312
await browser.close()
313313
return self._retry(request, 504, spider)
314314

315+
_actions_result = None
316+
# evaluate actions
317+
if pyppeteer_meta.get('actions'):
318+
_actions = pyppeteer_meta.get('actions')
319+
logger.debug('evaluating %s', _actions)
320+
_actions_result = await _actions(page)
321+
322+
_script_result = None
315323
# evaluate script
316324
if pyppeteer_meta.get('script'):
317325
_script = pyppeteer_meta.get('script')
318326
logger.debug('evaluating %s', _script)
319-
await page.evaluate(_script)
327+
_script_result = await page.evaluate(_script)
320328

321329
# sleep
322330
_sleep = self.sleep
@@ -365,6 +373,10 @@ async def _handle_interception(pu_request):
365373
encoding='utf-8',
366374
request=request
367375
)
376+
if _script_result:
377+
response.meta['script_result'] = _script_result
378+
if _actions_result:
379+
response.meta['actions_result'] = _actions_result
368380
if screenshot:
369381
response.meta['screenshot'] = screenshot
370382
return response

gerapy_pyppeteer/request.py

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ class PyppeteerRequest(Request):
66
"""
77
Scrapy ``Request`` subclass providing additional arguments
88
"""
9-
10-
def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=None, proxy=None,
9+
10+
def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=None, actions=None, proxy=None,
1111
sleep=None, timeout=None, ignore_resource_types=None, pretend=None, screenshot=None, meta=None, *args,
1212
**kwargs):
1313
"""
@@ -17,6 +17,7 @@ def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=No
1717
see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto, default is `domcontentloaded`
1818
:param wait_for: wait for some element to load, also supports dict
1919
:param script: script to execute
20+
:param actions: actions defined for execution of Page object
2021
:param proxy: use proxy for this time, like `http://x.x.x.x:x`
2122
:param sleep: time to sleep after loaded, override `GERAPY_PYPPETEER_SLEEP`
2223
:param timeout: load timeout, override `GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT`
@@ -31,29 +32,38 @@ def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=No
3132
# use meta info to save args
3233
meta = copy.deepcopy(meta) or {}
3334
pyppeteer_meta = meta.get('pyppeteer') or {}
34-
35+
3536
self.wait_until = pyppeteer_meta.get('wait_until') if pyppeteer_meta.get(
3637
'wait_until') is not None else (wait_until or 'domcontentloaded')
37-
self.wait_for = pyppeteer_meta.get('wait_for') if pyppeteer_meta.get('wait_for') is not None else wait_for
38-
self.script = pyppeteer_meta.get('script') if pyppeteer_meta.get('script') is not None else script
39-
self.sleep = pyppeteer_meta.get('sleep') if pyppeteer_meta.get('sleep') is not None else sleep
40-
self.proxy = pyppeteer_meta.get('proxy') if pyppeteer_meta.get('proxy') is not None else proxy
41-
self.pretend = pyppeteer_meta.get('pretend') if pyppeteer_meta.get('pretend') is not None else pretend
42-
self.timeout = pyppeteer_meta.get('timeout') if pyppeteer_meta.get('timeout') is not None else timeout
38+
self.wait_for = pyppeteer_meta.get('wait_for') if pyppeteer_meta.get(
39+
'wait_for') is not None else wait_for
40+
self.script = pyppeteer_meta.get('script') if pyppeteer_meta.get(
41+
'script') is not None else script
42+
self.actions = pyppeteer_meta.get('actions') if pyppeteer_meta.get(
43+
'actions') is not None else actions
44+
self.sleep = pyppeteer_meta.get('sleep') if pyppeteer_meta.get(
45+
'sleep') is not None else sleep
46+
self.proxy = pyppeteer_meta.get('proxy') if pyppeteer_meta.get(
47+
'proxy') is not None else proxy
48+
self.pretend = pyppeteer_meta.get('pretend') if pyppeteer_meta.get(
49+
'pretend') is not None else pretend
50+
self.timeout = pyppeteer_meta.get('timeout') if pyppeteer_meta.get(
51+
'timeout') is not None else timeout
4352
self.ignore_resource_types = pyppeteer_meta.get('ignore_resource_types') if pyppeteer_meta.get(
4453
'ignore_resource_types') is not None else ignore_resource_types
4554
self.screenshot = pyppeteer_meta.get('screenshot') if pyppeteer_meta.get(
4655
'screenshot') is not None else screenshot
47-
56+
4857
pyppeteer_meta = meta.setdefault('pyppeteer', {})
4958
pyppeteer_meta['wait_until'] = self.wait_until
5059
pyppeteer_meta['wait_for'] = self.wait_for
5160
pyppeteer_meta['script'] = self.script
61+
pyppeteer_meta['actions'] = self.actions
5262
pyppeteer_meta['sleep'] = self.sleep
5363
pyppeteer_meta['proxy'] = self.proxy
5464
pyppeteer_meta['pretend'] = self.pretend
5565
pyppeteer_meta['timeout'] = self.timeout
5666
pyppeteer_meta['screenshot'] = self.screenshot
5767
pyppeteer_meta['ignore_resource_types'] = self.ignore_resource_types
58-
68+
5969
super().__init__(url, callback, meta=meta, *args, **kwargs)

0 commit comments

Comments
 (0)