Skip to content

Commit b815fe3

Browse files
committed
Docs: How to open a browser
1 parent dba86c7 commit b815fe3

File tree

2 files changed

+40
-6
lines changed

2 files changed

+40
-6
lines changed

README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ The crawlPage API has built-in [puppeteer](https://github.com/puppeteer/puppetee
4040
- [Page Instance](#Page-Instance)
4141
- [life Cycle](#life-Cycle)
4242
- [onCrawlItemComplete](#onCrawlItemComplete)
43+
- [Open Browser](#Open-Browser)
4344
- [Crawl Interface](#Crawl-Interface)
4445
- [life Cycle](#life-Cycle-1)
4546
- [onCrawlItemComplete](#onCrawlItemComplete-1)
@@ -163,7 +164,7 @@ myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
163164
await new Promise((r) => setTimeout(r, 300))
164165

165166
// Gets the URL of the page image
166-
const urls = await page!.$$eval(
167+
const urls = await page.$$eval(
167168
`${elSelectorMap[id - 1]} img`,
168169
(imgEls) => {
169170
return imgEls.map((item) => item.src)
@@ -282,13 +283,13 @@ myXCrawl.crawlPage('https://www.example.com').then((res) => {
282283

283284
#### Browser Instance
284285

285-
When you call crawlPage API to crawl pages in the same crawler instance, the browser instance used is the same, because the crawlPage API of the browser instance in the same crawler instance is shared. It's a headless browser, no UI shell, what he does is bring **all modern web platform features** provided by the browser rendering engine to the code. For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser).
286+
When you call crawlPage API to crawl pages in the same crawler instance, the browser instance used is the same, because the crawlPage API of the browser instance in the same crawler instance is shared. For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser).
286287

287288
**Note:** The browser will keep running and the file will not be terminated. If you want to stop, you can execute browser.close() to close it. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. Because the crawlPage API of the browser instance in the same crawler instance is shared.
288289

289290
#### Page Instance
290291

291-
When you call crawlPage API to crawl pages in the same crawler instance, a new page instance will be generated from the browser instance. It can be used for interactive operations. For specific usage, please refer to [Page](https://pptr.dev/api/puppeteer.page).
292+
When you call crawlPage API to crawl pages in the same crawler instance, a new page instance will be generated from the browser instance. For specific usage, please refer to [Page](https://pptr.dev/api/puppeteer.page).
292293

293294
The browser instance will retain a reference to the page instance. If it is no longer used in the future, the page instance needs to be closed by itself, otherwise it will cause a memory leak.
294295

@@ -323,6 +324,22 @@ In the onCrawlItemComplete function, you can get the results of each crawled goa
323324

324325
**Note:** If you need to crawl many pages at one time, you need to use this life cycle function to process the results of each target and close the page instance after each page is crawled down. If you do not close the page instance, then The program will crash due to too many opened pages.
325326

327+
#### Open Browser
328+
329+
Disable running the browser in headless mode.
330+
331+
```js
332+
import xCrawl from 'x-crawl'
333+
334+
const myXCrawl = xCrawl({
335+
maxRetry: 3,
336+
// Cancel running the browser in headless mode
337+
crawlPage: { launchBrowser: { headless: false } }
338+
})
339+
340+
myXCrawl.crawlPage('https://www.example.com').then((res) => {})
341+
```
342+
326343
### Crawl Interface
327344

328345
Crawl interface data through [crawlData()](#crawlData) .

docs/cn.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ crawlPage API 内置了 [puppeteer](https://github.com/puppeteer/puppeteer) ,
4040
- [page 实例](#page-实例)
4141
- [生命周期](#生命周期)
4242
- [onCrawlItemComplete](#onCrawlItemComplete)
43+
- [打开浏览器](#打开浏览器)
4344
- [爬取接口](#爬取接口)
4445
- [生命周期](#生命周期-1)
4546
- [onCrawlItemComplete](#onCrawlItemComplete-1)
@@ -161,7 +162,7 @@ myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
161162
await new Promise((r) => setTimeout(r, 300))
162163

163164
// 获取页面图片的 URL
164-
const urls = await page!.$$eval(
165+
const urls = await page.$$eval(
165166
`${elSelectorMap[id - 1]} img`,
166167
(imgEls) => {
167168
return imgEls.map((item) => item.src)
@@ -281,13 +282,13 @@ myXCrawl.crawlPage('https://www.example.com').then((res) => {
281282

282283
#### browser 实例
283284

284-
当你在同个爬虫实例调用 crawlPage API 进行爬取页面时,所用的 browser 实例都是同一个,因为 browser 实例在同个爬虫实例中的 crawlPage API 是共享的。他是个无头浏览器,并无 UI 外壳,他做的是将浏览器渲染引擎提供的**所有现代网络平台功能**带到代码中。具体使用可以参考 [Browser](https://pptr.dev/api/puppeteer.browser)
285+
当你在同个爬虫实例调用 crawlPage API 进行爬取页面时,所用的 browser 实例都是同一个,因为 browser 实例在同个爬虫实例中的 crawlPage API 是共享的。具体使用可以参考 [Browser](https://pptr.dev/api/puppeteer.browser)
285286

286287
**注意:** browser 会一直保持着运行,造成文件不会终止,如果想停止可以执行 browser.close() 关闭。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。因为 browser 实例在同个爬虫实例中的 crawlPage API 是共享的。
287288

288289
#### page 实例
289290

290-
当你在同个爬虫实例调用 crawlPage API 进行爬取页面时,都会从 browser 实例中产生一个新的 page 实例。其可以做交互操作,具体使用可以参考 [Page](https://pptr.dev/api/puppeteer.page)
291+
当你在同个爬虫实例调用 crawlPage API 进行爬取页面时,都会从 browser 实例中产生一个新的 page 实例。具体使用可以参考 [Page](https://pptr.dev/api/puppeteer.page)
291292

292293
browser 实例内部会保留着对 page 实例的引用,如果后续不再使用需要自行关闭 page 实例,否则会造成内存泄露。
293294

@@ -322,6 +323,22 @@ crawlPage API 拥有的声明周期函数:
322323

323324
**注意:** 如果你需要一次性爬取很多页面,就需要在每个页面爬下来后,用这个生命周期函数来处理每个目标的结果并关闭 page 实例,如果不进行关闭操作,则会因开启的 page 过多而造成程序崩溃。
324325

326+
#### 打开浏览器
327+
328+
取消以无头模式运行浏览器。
329+
330+
```js
331+
import xCrawl from 'x-crawl'
332+
333+
const myXCrawl = xCrawl({
334+
maxRetry: 3,
335+
// 取消以无头模式运行浏览器
336+
crawlPage: { launchBrowser: { headless: false } }
337+
})
338+
339+
myXCrawl.crawlPage('https://www.example.com').then((res) => {})
340+
```
341+
325342
### 爬取接口
326343

327344
通过 [crawlData()](#crawlData) 爬取接口数据。

0 commit comments

Comments
 (0)