@@ -36,7 +36,7 @@ web page which you configured the request as PyppeteerRequest.
36
36
37
37
GerapyPyppeteer provides some optional settings.
38
38
39
- ### Concurrency
39
+ ### Concurrency
40
40
41
41
You can directly use Scrapy's setting to set Concurrency of Pyppeteer,
42
42
for example:
@@ -47,7 +47,7 @@ CONCURRENT_REQUESTS = 3
47
47
48
48
### Pretend as Real Browser
49
49
50
- Some website will detect WebDriver or Headless, GerapyPyppeteer can
50
+ Some website will detect WebDriver or Headless, GerapyPyppeteer can
51
51
pretend Chromium by inject scripts. This is enabled by default.
52
52
53
53
You can close it if website does not detect WebDriver to speed up:
@@ -56,15 +56,15 @@ You can close it if website does not detect WebDriver to speed up:
56
56
GERAPY_PYPPETEER_PRETEND = False
57
57
```
58
58
59
- Also you can use ` pretend ` attribute in ` PyppeteerRequest ` to overwrite this
59
+ Also you can use ` pretend ` attribute in ` PyppeteerRequest ` to overwrite this
60
60
configuration.
61
61
62
62
### Logging Level
63
63
64
64
By default, Pyppeteer will log all the debug messages, so GerapyPyppeteer
65
65
configured the logging level of Pyppeteer to WARNING.
66
66
67
- If you want to see more logs from Pyppeteer, you can change the this setting:
67
+ If you want to see more logs from Pyppeteer, you can change the this setting:
68
68
69
69
``` python
70
70
import logging
@@ -82,11 +82,11 @@ GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30
82
82
83
83
### Headless
84
84
85
- By default, Pyppeteer is running in ` Headless ` mode, you can also
85
+ By default, Pyppeteer is running in ` Headless ` mode, you can also
86
86
change it to ` False ` as you need, default is ` True ` :
87
87
88
88
``` python
89
- GERAPY_PYPPETEER_HEADLESS = False
89
+ GERAPY_PYPPETEER_HEADLESS = False
90
90
```
91
91
92
92
### Window Size
@@ -137,19 +137,19 @@ GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES = ['stylesheet', 'script']
137
137
138
138
All of the optional resource type list:
139
139
140
- * document: the Original HTML document
141
- * stylesheet: CSS files
142
- * script: JavaScript files
143
- * image: Images
144
- * media: Media files such as audios or videos
145
- * font: Fonts files
146
- * texttrack: Text Track files
147
- * xhr: Ajax Requests
148
- * fetch: Fetch Requests
149
- * eventsource: Event Source
150
- * websocket: Websocket
151
- * manifest: Manifest files
152
- * other: Other files
140
+ - document: the Original HTML document
141
+ - stylesheet: CSS files
142
+ - script: JavaScript files
143
+ - image: Images
144
+ - media: Media files such as audios or videos
145
+ - font: Fonts files
146
+ - texttrack: Text Track files
147
+ - xhr: Ajax Requests
148
+ - fetch: Fetch Requests
149
+ - eventsource: Event Source
150
+ - websocket: Websocket
151
+ - manifest: Manifest files
152
+ - other: Other files
153
153
154
154
### Screenshot
155
155
@@ -158,7 +158,7 @@ You can get screenshot of loaded page, you can pass `screenshot` args to `Pyppet
158
158
- ` type ` (str): Specify screenshot type, can be either ` jpeg ` or ` png ` . Defaults to ` png ` .
159
159
- ` quality ` (int): The quality of the image, between 0-100. Not applicable to ` png ` image.
160
160
- ` fullPage ` (bool): When true, take a screenshot of the full scrollable page. Defaults to ` False ` .
161
- - ` clip ` (dict): An object which specifies clipping region of the page. This option should have the following fields:
161
+ - ` clip ` (dict): An object which specifies clipping region of the page. This option should have the following fields:
162
162
- ` x ` (int): x-coordinate of top-left corner of clip area.
163
163
- ` y ` (int): y-coordinate of top-left corner of clip area.
164
164
- ` width ` (int): width of clipping area.
@@ -200,41 +200,69 @@ GERAPY_PYPPETEER_SCREENSHOT = {
200
200
201
201
` PyppeteerRequest ` provide args which can override global settings above.
202
202
203
- * url: request url
204
- * callback: callback
205
- * one of "load", "domcontentloaded", "networkidle0", "networkidle2".
206
- see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto , default is ` domcontentloaded `
207
- * wait_for: wait for some element to load, also supports dict
208
- * script: script to execute
209
- * proxy: use proxy for this time, like ` http://x.x.x.x:x `
210
- * sleep: time to sleep after loaded, override ` GERAPY_PYPPETEER_SLEEP `
211
- * timeout: load timeout, override ` GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT `
212
- * ignore_resource_types: ignored resource types, override ` GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES `
213
- * pretend: pretend as normal browser, override ` GERAPY_PYPPETEER_PRETEND `
214
- * screenshot: ignored resource types, see
215
- https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot ,
216
- override ` GERAPY_PYPPETEER_SCREENSHOT `
203
+ - url: request url
204
+ - callback: callback
205
+ - one of "load", "domcontentloaded", "networkidle0", "networkidle2".
206
+ see https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.goto , default is ` domcontentloaded `
207
+ - wait_for: wait for some element to load, also supports dict
208
+ - script: script to execute
209
+ - proxy: use proxy for this time, like ` http://x.x.x.x:x `
210
+ - sleep: time to sleep after loaded, override ` GERAPY_PYPPETEER_SLEEP `
211
+ - timeout: load timeout, override ` GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT `
212
+ - ignore_resource_types: ignored resource types, override ` GERAPY_PYPPETEER_IGNORE_RESOURCE_TYPES `
213
+ - pretend: pretend as normal browser, override ` GERAPY_PYPPETEER_PRETEND `
214
+ - screenshot: ignored resource types, see
215
+ https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.screenshot ,
216
+ override ` GERAPY_PYPPETEER_SCREENSHOT `
217
217
218
218
For example, you can configure PyppeteerRequest as:
219
219
220
220
``` python
221
221
from gerapy_pyppeteer import PyppeteerRequest
222
222
223
223
def parse (self , response ):
224
- yield PyppeteerRequest(url,
224
+ yield PyppeteerRequest(url,
225
225
callback = self .parse_detail,
226
226
wait_until = ' domcontentloaded' ,
227
227
wait_for = ' title' ,
228
- script = ' () => { console.log(document) }' ,
228
+ script = ' () => { return {name: "Germey"} }' ,
229
229
sleep = 2 )
230
230
```
231
231
232
232
Then Pyppeteer will:
233
- * wait for document to load
234
- * wait for title to load
235
- * execute ` console.log(document) ` script
236
- * sleep for 2s
237
- * return the rendered web page content
233
+
234
+ - wait for document to load
235
+ - wait for title to load
236
+ - execute ` console.log(document) ` script
237
+ - sleep for 2s
238
+ - return the rendered web page content, get from ` response.meta['screenshot'] `
239
+ - return the script executed result, get from ` response.meta['script_result'] `
240
+
241
+ For waiting mechanism controlled by JavaScript, you can use await in ` script ` , for example:
242
+
243
+ ``` python
244
+ js = ''' async () => {
245
+ await new Promise(resolve => setTimeout(resolve, 10000));
246
+ return {
247
+ 'name': 'Germey'
248
+ }
249
+ }
250
+ '''
251
+ yield PyppeteerRequest(url, callback = self .parse, script = js)
252
+ ```
253
+
254
+ Then you can get the script result from ` response.meta['script_result'] ` , result is ` {'name': 'Germey'} ` .
255
+
256
+ If you think the JavaScript is wired to write, you can use actions argument to define a function to execute ` Python ` based functions, for example:
257
+
258
+ ``` python
259
+ async def execute_actions (page : Page):
260
+ await page.evaluate(' () => { document.title = "Hello World"; }' )
261
+ return 1
262
+ yield PyppeteerRequest(url, callback = self .parse, actions = execute_actions)
263
+ ```
264
+
265
+ Then you can get the actions result from ` response.meta['actions_result'] ` , result is ` 1 ` .
238
266
239
267
## Example
240
268
@@ -366,4 +394,3 @@ chromiumExecutable = {
366
394
```
367
395
368
396
You can find your own operating system, modify your chrome or chrome executable path.
369
-
0 commit comments