crawl baidu index without selenium&phantomjs
- flask
- pillow
- numpy
- requests
- lxml
- docker
- 启动
docker
sudo docker pull scrapinghub/splash
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
- 拷贝项目
git clone https://github.com/Syhen/baidu-index.git
-
设置
baidu-index
环境变量 -
启动flask微服务
cd baidu-index/baidu_index/backend
python index.py
- 配置nginx 配置微服务的nginx,因为splash不能解析localhost
然后将 baidu_index.core.index.get_res2
中的域名调整为配置好的域名
- demo
from __future__ import unicode_literals
from requests.cookies import RequestsCookieJar
from baidu_index.core.index import BaiduIndexCrawler
cookies = RequestsCookieJar()
# update cookies with login
baidu_index_crawler = BaiduIndexCrawler('机器学习', cookies, start_date="2017-01-01", end_date="2017-01-31")
baidu_index_crawler.next()
# 936
禁止商用!