Skip to content

Commit a541e6b

Browse files
committed
Docs: Adjust features, descriptions, and default values
1 parent 42ad192 commit a541e6b

File tree

3 files changed

+107
-54
lines changed

3 files changed

+107
-54
lines changed

README.md

Lines changed: 35 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
# x-crawl [![npm](https://img.shields.io/npm/v/x-crawl.svg)](https://www.npmjs.com/package/x-crawl) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/coder-hxl/x-crawl/blob/main/LICENSE)
1+
# x-crawl · [![npm](https://img.shields.io/npm/v/x-crawl.svg)](https://www.npmjs.com/package/x-crawl) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/coder-hxl/x-crawl/blob/main/LICENSE)
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible Node.js multifunctional crawler library. Used to crawl pages, crawl interfaces, crawl files, and poll crawls.
5+
x-crawl is a flexible Node.js multipurpose crawler library. The usage is flexible, and there are many built-in functions for crawl pages, crawl interfaces, crawl files, etc.
66

77
> If you also like x-crawl, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a star to support it, thank you for your support!
88
99
## Features
1010

11-
- **🔥 AsyncSync** - Just change the mode attribute value to switch async or sync crawling mode.
12-
- **⚙️Multiple functions** - It can crawl pages, crawl interfaces, crawl files and polling crawls, and supports crawling single or multiple.
13-
- **🖋️ Flexible writing style** - Simple target configuration, detailed target configuration, mixed target array configuration and advanced configuration, the same crawling API can adapt to multiple configurations.
14-
- **👀Device Fingerprinting** - Zero configuration or custom configuration to avoid fingerprinting to identify and track us from different locations.
15-
- **⏱️ Interval Crawling** - No interval, fixed interval and random interval can generate or avoid high concurrent crawling.
16-
- **🔄 Retry on failure** - Global settings, local settings and individual settings, It can avoid crawling failure caused by temporary problems.
11+
- **🔥 Asynchronous Synchronous** - Just change the mode property to toggle asynchronous or synchronous crawling mode.
12+
- **⚙️Multiple purposes** - It can crawl pages, crawl interfaces, crawl files and poll crawls to meet the needs of various scenarios.
13+
- **🖋️ Flexible writing style** - The same crawling API can be adapted to multiple configurations, and each configuration method is very unique.
14+
- **👀Device Fingerprinting** - Zero configuration or custom configuration, avoid fingerprinting to identify and track us from different locations.
15+
- **⏱️ Interval Crawling** - No interval, fixed interval and random interval to generate or avoid high concurrent crawling.
16+
- **🔄 Failed Retry** - Avoid crawling failure due to transient problems, unlimited retries.
1717
- **🚀 Priority Queue** - According to the priority of a single crawling target, it can be crawled ahead of other targets.
1818
- **☁️ Crawl SPA** - Crawl SPA (Single Page Application) to generate pre-rendered content (aka "SSR" (Server Side Rendering)).
19-
- **⚒️ Controlling Pages** - Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
20-
- **🧾 Capture Record** - Capture and record crawling results and other information, and highlight reminders on the console.
19+
- **⚒️ Control Page** - You can submit form, keyboard input, event operation, generate screenshots of the page, etc.
20+
- **🧾 Capture Record** - Capture and record the crawled information, and highlight it on the console.
2121
- **🦾 TypeScript** - Own types, implement complete types through generics.
2222

2323
## Relationship with Puppeteer
@@ -499,9 +499,9 @@ myXCrawl
499499
url: 'https://www.example.com/page-2',
500500
fingerprint: {
501501
maxWidth: 1980,
502-
minWidth: 1980,
502+
minWidth: 1200,
503503
maxHeight: 1080,
504-
minHidth: 1080,
504+
minHidth: 800,
505505
platform: 'Android'
506506
}
507507
}
@@ -589,9 +589,16 @@ The larger the value of the priority attribute, the higher the priority in the c
589589
590590
### About Results
591591
592-
For the result, the result of each crawl target is uniformly wrapped with an object that provides information about the result of the crawl target, such as id, result, success or not, maximum retry, number of retries, error information collected, and so on. Automatically determine whether the return value is wrapped in an array depending on the configuration you choose, and the type fits perfectly in TS.
592+
Each crawl target will generate a detail object, which will contain the following properties:
593593
594-
The id of each object is determined according to the order of crawl targets in your configuration, and if there is a priority used, it will be sorted by priority.
594+
- id: Generated according to the order of crawling targets, if there is a priority, it will be generated according to the priority
595+
- isSuccess: Whether to crawl successfully
596+
- maxRetry: The maximum number of retries for this crawling target
597+
- retryCount: The number of times the crawling target has been retried
598+
- crawlErrorQueue: Error collection of the crawl target
599+
- data: the crawling data of the crawling target
600+
601+
If it is a specific configuration, it will automatically determine whether the details object is stored in an array according to the configuration method you choose, and return the array, otherwise return the details object. Already fits types perfectly in TypeScript.
595602
596603
Details about configuration methods and results are as follows: [crawlPage config](#config), [crawlData config](#config-1), [crawlFile config](#config-2).
597604
@@ -1144,7 +1151,6 @@ export interface XCrawlConfig extends CrawlCommonConfig {
11441151
- baseUrl: undefined
11451152
- intervalTime: undefined
11461153
- crawlPage: undefined
1147-
- launchBrowser: undefined
11481154
11491155
#### Detail target config
11501156
@@ -1170,8 +1176,9 @@ export interface CrawlPageDetailTargetConfig extends CrawlCommonConfig {
11701176
11711177
**Default Value**
11721178
1179+
- url: undefined
11731180
- headers: undefined
1174-
- method: undefined
1181+
- cookies: undefined
11751182
- priority: undefined
11761183
- viewport: undefined
11771184
- fingerprint: undefined
@@ -1192,8 +1199,8 @@ export interface CrawlDataDetailTargetConfig extends CrawlCommonConfig {
11921199
11931200
**Default Value**
11941201
1202+
- url: undefined
11951203
- method: 'GET'
1196-
11971204
- headers: undefined
11981205
- params: undefined
11991206
- data: undefined
@@ -1216,6 +1223,7 @@ export interface CrawlFileDetailTargetConfig extends CrawlCommonConfig {
12161223
12171224
**Default Value**
12181225
1226+
- url: undefined
12191227
- headers: undefined
12201228
- priority: undefined
12211229
- storeDir: \_\_dirname
@@ -1248,6 +1256,8 @@ export interface CrawlPageAdvancedConfig extends CrawlCommonConfig {
12481256
12491257
**Default Value**
12501258
1259+
- targets: undefined
1260+
12511261
- intervalTime: undefined
12521262
- fingerprint: undefined
12531263
- headers: undefined
@@ -1271,6 +1281,7 @@ export interface CrawlDataAdvancedConfig<T> extends CrawlCommonConfig {
12711281
12721282
**Default Value**
12731283
1284+
- targets: undefined
12741285
- intervalTime: undefined
12751286
- fingerprint: undefined
12761287
- headers: undefined
@@ -1300,6 +1311,7 @@ export interface CrawlFileAdvancedConfig extends CrawlCommonConfig {
13001311
13011312
**Default Value**
13021313
1314+
- targets: undefined
13031315
- intervalTime: undefined
13041316
- fingerprint: undefined
13051317
- headers: undefined
@@ -1533,6 +1545,12 @@ export interface CrawlCommonRes {
15331545
}
15341546
```
15351547
1548+
- id: Generated according to the order of crawling targets, if there is a priority, it will be generated according to the priority
1549+
- isSuccess: Whether to crawl successfully
1550+
- maxRetry: The maximum number of retries for this crawling target
1551+
- retryCount: The number of times the crawling target has been retried
1552+
- crawlErrorQueue: Error collection of the crawl target
1553+
15361554
#### CrawlPageSingleRes
15371555
15381556
```ts

docs/cn.md

Lines changed: 37 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
# x-crawl [![npm](https://img.shields.io/npm/v/x-crawl.svg)](https://www.npmjs.com/package/x-crawl) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/coder-hxl/x-crawl/blob/main/LICENSE)
1+
# x-crawl · [![npm](https://img.shields.io/npm/v/x-crawl.svg)](https://www.npmjs.com/package/x-crawl) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/coder-hxl/x-crawl/blob/main/LICENSE)
22

33
[English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文
44

5-
x-crawl 是一个灵活的 Node.js 多功能爬虫库。用于爬页面、爬接口、爬文件以及轮询爬
5+
x-crawl 是一个灵活的 Node.js 多功能爬虫库。用法灵活,并且内置众多功能用于爬页面、爬接口、爬文件等
66

77
> 如果你也喜欢 x-crawl ,可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 star 支持一下,感谢大家的支持!
88
99
## 特征
1010

11-
- **🔥 异步同步** - 只需更改一下 mode 属性值即可切换异步或同步爬取模式
12-
- **⚙️ 多种功能** - 可爬页面、爬接口、爬文件以及轮询爬,并且支持爬取单个或多个
13-
- **🖋️ 写法灵活** - 简单目标配置、详细目标配置、混合目标数组配置以及进阶配置,同种爬取 API 适配多种配置。
14-
- **👀 设备指纹** - 零配置或自定义配置,即可避免通过指纹识别从不同位置识别并跟踪我们
15-
- **⏱️ 间隔爬取** - 无间隔、固定间隔以及随机间隔,即可产生或避免高并发爬取
16-
- **🔄 失败重试** - 全局设置、局部设置以及单独设置, 即可避免因一时问题而造成爬取失败
17-
- **🚀 优先队列** - 根据单个爬取目标的优先级可以优先于其他目标提前进行爬取
11+
- **🔥 异步同步** - 只需更改一下 mode 属性即可切换异步或同步爬取模式
12+
- **⚙️ 多种用途** - 可爬页面、爬接口、爬文件以及轮询爬,满足各种场景需求
13+
- **🖋️ 写法灵活** - 同种爬取 API 适配多种配置,每种配置方式都非常独特
14+
- **👀 设备指纹** - 零配置或自定义配置,避免指纹识别从不同位置识别并跟踪我们
15+
- **⏱️ 间隔爬取** - 无间隔、固定间隔以及随机间隔,产生或避免高并发爬取
16+
- **🔄 失败重试** - 避免因短暂的问题而造成爬取失败,无限制重试次数
17+
- **🚀 优先队列** - 根据单个爬取目标的优先级可以优先于其他目标提前爬取
1818
- **☁️ 爬取 SPA** - 爬取 SPA(单页应用程序)生成预渲染内容(即“SSR”(服务器端渲染))。
19-
- **⚒️ 控制页面** - 无头浏览器可以表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
20-
- **🧾 捕获记录** - 对爬取的结果等信息进行捕获记录,并在控制台进行高亮的提醒。
19+
- **⚒️ 控制页面** - 可以表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
20+
- **🧾 捕获记录** - 对爬取的信息进行捕获记录,并在控制台进行高亮的提醒。
2121
- **🦾 TypeScript** - 拥有类型,通过泛型实现完整的类型。
2222

2323
## 跟 puppeteer 的关系
@@ -32,7 +32,7 @@ crawlPage API 内置了 [puppeteer](https://github.com/puppeteer/puppeteer) ,
3232
- [创建应用](#创建应用)
3333
- [一个爬虫应用实例](#一个爬虫应用实例)
3434
- [爬取模式](#爬取模式)
35-
- [设备指纹](#设备指纹)
35+
- [默认设备指纹](#默认设备指纹)
3636
- [多个爬虫应用实例](#多个爬虫应用实例)
3737
- [爬取页面](#爬取页面)
3838
- [browser 实例](#browser-实例)
@@ -48,7 +48,7 @@ crawlPage API 内置了 [puppeteer](https://github.com/puppeteer/puppeteer) ,
4848
- [onBeforeSaveItemFile](#onBeforeSaveItemFile)
4949
- [启动轮询](#启动轮询)
5050
- [配置优先级](#配置优先级)
51-
- [设备指纹](#设备指纹-1)
51+
- [自定义设备指纹](#自定义设备指纹)
5252
- [间隔时间](#间隔时间)
5353
- [失败重试](#失败重试)
5454
- [优先队列](#优先队列)
@@ -493,9 +493,9 @@ myXCrawl
493493
url: 'https://www.example.com/page-2',
494494
fingerprint: {
495495
maxWidth: 1980,
496-
minWidth: 1980,
496+
minWidth: 1200,
497497
maxHeight: 1080,
498-
minHidth: 1080,
498+
minHidth: 800,
499499
platform: 'Android'
500500
}
501501
}
@@ -581,9 +581,16 @@ priority 属性的值越大就在当前爬取队列中越优先。
581581
582582
### 关于结果
583583
584-
对于结果,每个爬取目标的结果将统一使用对象包裹着,该对象提供了关于这次爬取目标结果的信息,比如:id、结果、是否成功、最大重试、重试次数、收集到错误信息等。自动根据你选用的配置方式决定返回值是否包裹在一个数组中,并且在 TS 中类型完美适配。
584+
每个爬取目标都会产生一个详情对象,该详情对象会包含以下属性:
585585
586-
每个对象的 id 是根据你配置里的爬取目标顺序决定的,如果有使用优先级,则会根据优先级排序。
586+
- id:根据爬取目标的顺序生成的,如果有优先级,则会根据优先级生成
587+
- isSuccess:是否成功爬取
588+
- maxRetry:该次爬取目标的最大重试次数
589+
- retryCount:该次爬取目标已经重试的次数
590+
- crawlErrorQueue:该次爬取目标的报错收集
591+
- data:该次爬取目标的爬取数据
592+
593+
如果是特定的配置,会自动根据你选用的配置方式决定详情对象是否存放在一个数组中,并把该数组返回,否则返回详情对象。已经在 TypeScript 中类型完美适配。
587594
588595
相关的配置方式和结果详情查看:[crawlPage 配置](#配置)、[crawlData 配置](#配置-1)、[crawlFile 配置](#配置-2) 。
589596
@@ -1135,7 +1142,6 @@ export interface XCrawlConfig extends CrawlCommonConfig {
11351142
- baseUrl: undefined
11361143
- intervalTime: undefined
11371144
- crawlPage: undefined
1138-
- launchBrowser: undefined
11391145
11401146
#### Detail target config
11411147
@@ -1161,8 +1167,9 @@ export interface CrawlPageDetailTargetConfig extends CrawlCommonConfig {
11611167
11621168
**默认值**
11631169
1170+
- url: undefined
11641171
- headers: undefined
1165-
- method: undefined
1172+
- cookies: undefined
11661173
- priority: undefined
11671174
- viewport: undefined
11681175
- fingerprint: undefined
@@ -1183,8 +1190,8 @@ export interface CrawlDataDetailTargetConfig extends CrawlCommonConfig {
11831190
11841191
**默认值**
11851192
1193+
- url: undefined
11861194
- method: 'GET'
1187-
11881195
- headers: undefined
11891196
- params: undefined
11901197
- data: undefined
@@ -1207,6 +1214,7 @@ export interface CrawlFileDetailTargetConfig extends CrawlCommonConfig {
12071214
12081215
**默认值**
12091216
1217+
- url: undefined
12101218
- headers: undefined
12111219
- priority: undefined
12121220
- storeDir: \_\_dirname
@@ -1239,6 +1247,7 @@ export interface CrawlPageAdvancedConfig extends CrawlCommonConfig {
12391247
12401248
**默认值**
12411249
1250+
- targets: undefined
12421251
- intervalTime: undefined
12431252
- fingerprint: undefined
12441253
- headers: undefined
@@ -1262,6 +1271,7 @@ export interface CrawlDataAdvancedConfig<T> extends CrawlCommonConfig {
12621271
12631272
**默认值**
12641273
1274+
- targets: undefined
12651275
- intervalTime: undefined
12661276
- fingerprint: undefined
12671277
- headers: undefined
@@ -1291,6 +1301,7 @@ export interface CrawlFileAdvancedConfig extends CrawlCommonConfig {
12911301
12921302
**默认值**
12931303
1304+
- targets: undefined
12941305
- intervalTime: undefined
12951306
- fingerprint: undefined
12961307
- headers: undefined
@@ -1524,6 +1535,12 @@ export interface CrawlCommonRes {
15241535
}
15251536
```
15261537
1538+
- id:根据爬取目标的顺序生成的,如果有优先级,则会根据优先级生成
1539+
- isSuccess:是否成功爬取
1540+
- maxRetry:该次爬取目标的最大重试次数
1541+
- retryCount:该次爬取目标已经重试的次数
1542+
- crawlErrorQueue:该次爬取目标的报错收集
1543+
15271544
#### CrawlPageSingleRes
15281545
15291546
```ts

0 commit comments

Comments
 (0)