... | ... | @@ -5,7 +5,9 @@ equity_penetration_qcc,通过scrapy部署 |
|
|
项目名称:project-gravel
|
|
|
分支:develop_equity_penetration
|
|
|
|
|
|
当前仅运行apph5的非登录爬虫
|
|
|
正常流仅运行apph5的非登录爬虫
|
|
|
|
|
|
对外测试流运行带登录爬虫: qcc/tyc
|
|
|
```
|
|
|
|
|
|
|
... | ... | @@ -33,6 +35,16 @@ https://www.qcc.com |
|
|
/data/gravel_spiders/equity_penetration_qcc
|
|
|
/data/gravel_spiders/equity_penetration_qcc_login
|
|
|
```
|
|
|
+ 对外测试
|
|
|
```buildoutcfg
|
|
|
官网PC端入口:
|
|
|
https://www.qcc.com
|
|
|
https://www.tianyancha.com
|
|
|
|
|
|
采集文件存放路径:
|
|
|
/data/gravel_spiders/equity_penetration_qcc_test
|
|
|
/data/gravel_spiders/equity_penetration_tyc_test
|
|
|
```
|
|
|
|
|
|
|
|
|
## 采集频率及采集策略
|
... | ... | @@ -48,6 +60,7 @@ https://www.qcc.com |
|
|
### 增量采集策略
|
|
|
<!--新增数据怎么来?无需单独采集新增数据?-->
|
|
|
```text
|
|
|
kafka消费topic<qcc_spider_from_lake_ic_new_list>: 工商变更: 包括企业更新与新增企业
|
|
|
```
|
|
|
---
|
|
|
|
... | ... | @@ -65,6 +78,8 @@ https://www.qcc.com |
|
|
```text
|
|
|
equity_penetration_qcc
|
|
|
equity_penetration_qcc_login (登录)
|
|
|
equity_penetration_qcc_test (登录*对外测试)
|
|
|
equity_penetration_tyc_test (登录*对外测试)
|
|
|
```
|
|
|
|
|
|
<!--spider_name-->
|
... | ... | @@ -83,6 +98,8 @@ equity_penetration_qcc_login (登录) |
|
|
* redis key:
|
|
|
* qcc
|
|
|
* qcc_login
|
|
|
* qcc_test
|
|
|
* tyc_test
|
|
|
|
|
|
### 优先级队列说明
|
|
|
* equity_penetration 支持队列优先级
|
... | ... | @@ -128,14 +145,36 @@ equity_penetration_qcc_login (登录) |
|
|
### 任务参数说明
|
|
|
<!--特有参数说明,通用参数比如spider_name,task_params,task_src,task_result等不需说明-->
|
|
|
|
|
|
+ direct_flag: 布尔值,当该值为真时,地域列表/搜索列表均直接在爬虫内部直接发公司详情请求,得到公司详情结果
|
|
|
+ area_code: 省份/市区编码,例如:安徽(AH); 合肥(AH_340100)
|
|
|
+ page: 页码
|
|
|
+ search_key: 搜索框输入内容
|
|
|
+ fid: QCC企业id
|
|
|
+ pid: QCC个人id
|
|
|
+ direct_flag: 直接跳转详情请求(不会生成列表item)
|
|
|
|
|
|
> 通用
|
|
|
>+ login_flag: 带登录
|
|
|
>+ direct_flag: 直接跳转详情请求(不会生成列表item)
|
|
|
|
|
|
> 列表
|
|
|
>+ 地域
|
|
|
> + area_code: 省份/市区编码,例如:安徽(AH); 合肥(AH_340100)
|
|
|
> + page: 页码
|
|
|
>+ 搜索
|
|
|
> + search_key: 搜索框输入内容
|
|
|
> + company_name: 同上,因任务源不同字段名称不同,不会与search_key同时出现[task_params.get('search_key') or task_params.get('company_name')]
|
|
|
> + company_code: 四要素之一,
|
|
|
> + credit_no: 四要素之一,
|
|
|
> + company_name_digest: 四要素之一,清洗辅助作用
|
|
|
> + company_major_type: 企业类型,供统计
|
|
|
> + n_company_status: 企业状态,供统计
|
|
|
|
|
|
> 详情
|
|
|
>+ fid: QCC企业id
|
|
|
>+ pid: QCC个人id
|
|
|
|
|
|
> 对外测试
|
|
|
>- batch_date
|
|
|
>- batch_sequence_num
|
|
|
>- name: 法人名称
|
|
|
|
|
|
> 其他
|
|
|
>+ ic_flag
|
|
|
>+ change_date: 变更时间
|
|
|
>+ establish_date: 新增时间
|
|
|
|
|
|
## data_type说明
|
|
|
<!--可能产生的data_type说明-->
|
... | ... | @@ -156,13 +195,17 @@ equity_penetration_qcc_login (登录) |
|
|
|
|
|
> 注意:部分示例的结果不包含爬虫的附加信息, 仅有data部分
|
|
|
|
|
|
> [列表任务结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/list) <br>
|
|
|
> [列表任务结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/test/no_login/list) <br>
|
|
|
> 分为地域列表,搜索列表,详见data_type说明
|
|
|
|
|
|
> [公司页详情结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/no_login/company) <br>
|
|
|
> [公司页详情结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/test/no_login/company)
|
|
|
|
|
|
> [个人页详情结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/test/login/person)
|
|
|
|
|
|
> [个人页详情结果](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/login/person) <br>
|
|
|
### 对外测试
|
|
|
> [qcc](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/test_online/qcc)
|
|
|
|
|
|
> [tyc](http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration/scrapy_spiders/gravel_spiders/spiders/example/test_online/tyc)
|
|
|
|
|
|
## 爬虫运行环境
|
|
|
<!--udm模块?scrapy?或其他-->
|
... | ... | @@ -176,7 +219,7 @@ scrapy |
|
|
```buildoutcfg
|
|
|
target: node_43,node_42,node_32,node_33,node_29,node_28
|
|
|
project: equity_penetration
|
|
|
spider_name: equity_penetration_qcc,equity_penetration_qcc_login
|
|
|
spider_name: equity_penetration_qcc,equity_penetration_qcc_login,equity_penetration_qcc_test,equity_penetration_tyc_test
|
|
|
```
|
|
|
|
|
|
|
... | ... | @@ -228,6 +271,8 @@ task_result=8000 # 参数错误 |
|
|
采集文件存放路径:
|
|
|
/data/gravel_spiders/equity_penetration_qcc
|
|
|
/data/gravel_spiders/equity_penetration_qcc_login
|
|
|
/data/gravel_spiders/equity_penetration_qcc_test
|
|
|
/data/gravel_spiders/equity_penetration_tyc_test
|
|
|
```
|
|
|
|
|
|
## 归集后存放目录
|
... | ... | |