Skip to content

Commit c7d7344

Browse files
committed
Feat: Examples and more
1 parent c83c222 commit c7d7344

File tree

7 files changed

+218
-57
lines changed

7 files changed

+218
-57
lines changed

README.md

Lines changed: 74 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ The crawlPage API has built-in [puppeteer](https://github.com/puppeteer/puppetee
5151
- [Config Priority](#Config-Priority)
5252
- [Interval Time](#Interval-Time)
5353
- [Fail Retry](#Fail-Retry)
54-
- [Rotate Proxy](#Rotate Proxy)
54+
- [Rotate Proxy](#Rotate-Proxy)
5555
- [Custom Device Fingerprint](#Custom-Device-Fingerprint)
5656
- [Priority Queue](#Priority-Queue)
5757
- [About Results](#About-Results)
@@ -115,6 +115,8 @@ The crawlPage API has built-in [puppeteer](https://github.com/puppeteer/puppetee
115115
- [API Other](#API-Other)
116116
- [AnyObject](#AnyObject)
117117
- [More](#More)
118+
- [Community](#Community)
119+
- [Issues](#Issues)
118120

119121
## Install
120122

@@ -126,14 +128,14 @@ npm install x-crawl
126128

127129
## Example
128130

129-
Take the automatic acquisition of photos of experiences and homes in hawaii every day as an example::
131+
Take the automatic acquisition of some photos of experiences and homes around the world every day as an example:
130132

131133
```js
132134
// 1.Import module ES/CJS
133135
import xCrawl from 'x-crawl'
134136

135137
// 2.Create a crawler instance
136-
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })
138+
const myXCrawl = xCrawl({maxRetry: 3,intervalTime: { max: 3000, min: 2000 }})
137139

138140
// 3.Set the crawling task
139141
/*
@@ -142,27 +144,31 @@ const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })
142144
*/
143145
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
144146
// Call crawlPage API to crawl Page
145-
const res = await myXCrawl.crawlPage([
146-
'https://zh.airbnb.com/s/hawaii/experiences',
147-
'https://zh.airbnb.com/s/hawaii/homes'
148-
])
147+
const res = await myXCrawl.crawlPage({
148+
targets: [
149+
'https://www.airbnb.cn/s/experiences',
150+
'https://www.airbnb.cn/s/plus_homes'
151+
],
152+
viewport: { width: 1920, height: 1080 }
153+
})
149154

150155
// Store the image URL to targets
151156
const targets = []
152-
const elSelectorMap = ['.c14whb16', '.l196t2l1']
157+
const elSelectorMap = ['._fig15y', '._aov0j6']
153158
for (const item of res) {
154159
const { id } = item
155160
const { page } = item.data
156-
const boxSelector = elSelectorMap[id - 1]
157161

158-
// Wait for the image element to appear
159-
await page.waitForSelector(`${boxSelector} img`)
162+
// Wait for the page to load
163+
await new Promise((r) => setTimeout(r, 300))
160164

161-
// Gets the URL of the page's wheel image element
162-
const boxHandle = await page.$(boxSelector)
163-
const urls = await boxHandle.$$eval('picture img', (imgEls) => {
164-
return imgEls.map((item) => item.src)
165-
})
165+
// Gets the URL of the page image
166+
const urls = await page!.$$eval(
167+
`${elSelectorMap[id - 1]} img`,
168+
(imgEls) => {
169+
return imgEls.map((item) => item.src)
170+
}
171+
)
166172
targets.push(...urls)
167173

168174
// Close page
@@ -532,7 +538,7 @@ The intervalTime option defaults to undefined . If there is a setting value, it
532538
533539
It can avoid crawling failure due to temporary problems, and will wait for the end of this round of crawling targets to crawl again.
534540
535-
The number of failed retries can be set by creating crawler application instance, advanced usage, and detailed target.
541+
You can create crawler application instance, advanced usage, detailed target these three places Settings.
536542
537543
```js
538544
import xCrawl from 'x-crawl'
@@ -550,7 +556,7 @@ The maxRetry attribute determines how many times to retry.
550556
551557
With failed retries, custom error times and HTTP status codes, the proxy is automatically rotated for crawling targets.
552558
553-
You can set the number of failed retries in the three places of creating a crawler application instance, advanced usage, and detailed goals.
559+
You can create crawler application instance, advanced usage, detailed target these three places Settings.
554560
555561
Take crawlPage as an example:
556562
@@ -615,9 +621,9 @@ myXCrawl.crawlPage({
615621
'https://www.example.com/page-1',
616622
'https://www.example.com/page-2',
617623
'https://www.example.com/page-3',
618-
// Unfingerprint for this target
624+
// Cancel the fingerprint for this target
619625
{ url: 'https://www.example.com/page-4', fingerprint: null },
620-
// Set the fingerprint individually for this target
626+
// Set a separate fingerprint for this target
621627
{
622628
url: 'https://www.example.com/page-5',
623629
fingerprint: {
@@ -635,8 +641,9 @@ myXCrawl.crawlPage({
635641
}
636642
}
637643
],
638-
// Set the fingerprint uniformly for this target
644+
// Set fingerprints uniformly for this target
639645
fingerprints: [
646+
// Device fingerprint 1
640647
{
641648
maxWidth: 1024,
642649
maxHeight: 800,
@@ -648,7 +655,7 @@ myXCrawl.crawlPage({
648655
versions: [
649656
{
650657
name: 'Chrome',
651-
// browser version
658+
// Browser version
652659
maxMajorVersion: 112,
653660
minMajorVersion: 100,
654661
maxMinorVersion: 20,
@@ -663,6 +670,44 @@ myXCrawl.crawlPage({
663670
}
664671
]
665672
}
673+
},
674+
// Device fingerprint 2
675+
{
676+
platform: 'Windows',
677+
mobile: 'random',
678+
userAgent: {
679+
value:
680+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
681+
versions: [
682+
{
683+
name: 'Chrome',
684+
maxMajorVersion: 91,
685+
minMajorVersion: 88,
686+
maxMinorVersion: 10,
687+
maxPatchVersion: 5615
688+
},
689+
{ name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
690+
{ name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
691+
]
692+
}
693+
},
694+
// Device fingerprint 3
695+
{
696+
platform: 'Windows',
697+
mobile: 'random',
698+
userAgent: {
699+
value:
700+
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
701+
versions: [
702+
{
703+
name: 'Firefox',
704+
maxMajorVersion: 47,
705+
minMajorVersion: 43,
706+
maxMinorVersion: 10,
707+
maxPatchVersion: 5000
708+
}
709+
]
710+
}
666711
}
667712
]
668713
})
@@ -1706,4 +1751,10 @@ export interface AnyObject extends Object {
17061751
17071752
## More
17081753
1709-
If you have **problems, needs, good suggestions** please raise **Issues** in https://github.com/coder-hxl/x-crawl/issues.
1754+
### Community
1755+
1756+
**GitHub Discussions:** May be discussed through [GitHub Discussions](https://github.com/coder-hxl/x-crawl/discussions).
1757+
1758+
### Issues
1759+
1760+
If you have questions, needs, or good suggestions, you can raise them at [GitHub Issues](https://github.com/coder-hxl/x-crawl/issues).

assets/cn/crawler-result.png

-53.2 KB
Loading

assets/cn/crawler.png

-15.7 KB
Loading

assets/en/crawler-result.png

47.9 KB
Loading

assets/en/crawler.png

-22.9 KB
Loading

docs/cn.md

Lines changed: 69 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,8 @@ crawlPage API 内置了 [puppeteer](https://github.com/puppeteer/puppeteer) ,
116116
- [API Other](#API-Other)
117117
- [AnyObject](#AnyObject)
118118
- [更多](#更多)
119+
- [社区](#社区)
120+
- [Issues](#Issues)
119121

120122
## 安装
121123

@@ -127,7 +129,7 @@ npm install x-crawl
127129

128130
## 示例
129131

130-
每天自动获取某站 首页、国漫、电影这三个页面的轮播图片为例:
132+
以每天自动获取世界各地的经历和房间的一些照片为例:
131133

132134
```js
133135
// 1.导入模块 ES/CJS
@@ -139,23 +141,31 @@ const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })
139141
// 3.设置爬取任务
140142
// 调用 startPolling API 开始轮询功能,每隔一天会调用回调函数
141143
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
142-
// 调用 crawlPage API 爬取 首页、国漫、电影 这三个页面
143-
const res = await myXCrawl.crawlPage([
144-
'https://www.bilibili.com',
145-
'https://www.bilibili.com/guochuang',
146-
'https://www.bilibili.com/movie'
147-
])
144+
// 调用 crawlPage API 来爬取页面
145+
const res = await myXCrawl.crawlPage({
146+
targets: [
147+
'https://www.airbnb.cn/s/experiences',
148+
'https://www.airbnb.cn/s/plus_homes'
149+
],
150+
viewport: { width: 1920, height: 1080 }
151+
})
148152

149153
// 存放图片 URL 到 targets
150154
const targets = []
151-
const elSelectorMap = ['.carousel-inner', '.chief-recom-item', '.bg-item']
155+
const elSelectorMap = ['._fig15y', '._aov0j6']
152156
for (const item of res) {
153157
const { id } = item
154158
const { page } = item.data
155159

156-
// 获取页面轮播图片元素的 URL
157-
const urls = await page.$$eval(`${elSelectorMap[id - 1]} img`, (imgEls) =>
158-
imgEls.map((item) => item.src)
160+
// 等待页面加载完成
161+
await new Promise((r) => setTimeout(r, 300))
162+
163+
// 获取页面图片的 URL
164+
const urls = await page!.$$eval(
165+
`${elSelectorMap[id - 1]} img`,
166+
(imgEls) => {
167+
return imgEls.map((item) => item.src)
168+
}
159169
)
160170
targets.push(...urls)
161171

@@ -520,7 +530,7 @@ intervalTime 选项默认为 undefined 。若有设置值,则会在爬取目
520530
521531
可避免因一时问题而造成爬取失败,将会等待这一轮爬取目标结束后重新爬取目标。
522532
523-
可以通过在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置失败重试次数
533+
可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置
524534
525535
```js
526536
import xCrawl from 'x-crawl'
@@ -538,7 +548,7 @@ maxRetry 属性决定要重试几次。
538548
539549
配合失败重试,自定义错误次数以及 HTTP 状态码为爬取目标自动轮换代理。
540550
541-
可以通过在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置失败重试次数
551+
可以在 创建爬虫应用实例、进阶用法、详细目标 这三个地方设置
542552
543553
以 crawlPage 为例:
544554
@@ -625,6 +635,7 @@ myXCrawl.crawlPage({
625635
],
626636
// 为此次的目标统一设置指纹
627637
fingerprints: [
638+
// 设备指纹 1
628639
{
629640
maxWidth: 1024,
630641
maxHeight: 800,
@@ -651,6 +662,44 @@ myXCrawl.crawlPage({
651662
}
652663
]
653664
}
665+
},
666+
// 设备指纹 2
667+
{
668+
platform: 'Windows',
669+
mobile: 'random',
670+
userAgent: {
671+
value:
672+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
673+
versions: [
674+
{
675+
name: 'Chrome',
676+
maxMajorVersion: 91,
677+
minMajorVersion: 88,
678+
maxMinorVersion: 10,
679+
maxPatchVersion: 5615
680+
},
681+
{ name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
682+
{ name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
683+
]
684+
}
685+
},
686+
// 设备指纹 3
687+
{
688+
platform: 'Windows',
689+
mobile: 'random',
690+
userAgent: {
691+
value:
692+
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
693+
versions: [
694+
{
695+
name: 'Firefox',
696+
maxMajorVersion: 47,
697+
minMajorVersion: 43,
698+
maxMinorVersion: 10,
699+
maxPatchVersion: 5000
700+
}
701+
]
702+
}
654703
}
655704
]
656705
})
@@ -1693,4 +1742,10 @@ export interface AnyObject extends Object {
16931742
16941743
## 更多
16951744
1696-
如果您有 **问题 、需求、好的建议** 请在 https://github.com/coder-hxl/x-crawl/issues 中提 **Issues** 。
1745+
### 社区
1746+
1747+
**GitHub Discussions:** 可以通过 [GitHub Discussions](https://github.com/coder-hxl/x-crawl/discussions) 进行讨论。
1748+
1749+
### Issues
1750+
1751+
如果您有 **问题 、需求、好的建议** 可以在 [GitHub Issues](https://github.com/coder-hxl/x-crawl/issues) 中提 **Issues** 。

0 commit comments

Comments
 (0)