Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • P12315

Last edited by 章一锋 Feb 23, 2023
Page history
This is an old version of this page. You can view the most recent version or browse the history.

P12315

基本信息

数据名称(中文)

12315 的行业分类

数据英文名称

12315

采集网站(采集入口)

https://www.12315.cn/cuser/portal/tscase/corperation

采集频率及采集策略

存量更新策略

轮更,尚未知一轮要多久

增量采集策略

每次从头到尾运行一遍,去重得到增量数据

爬虫

责任人

章一锋

爬虫名称

P12315

代码地址

http://tech.pingansec.com/granite/project-gravel/-/blob/develop_12315/scrapy_spiders/gravel_spiders/spiders/12315.py

队列名称及队列地址

* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key: P12315

优先级队列说明

任务来源

任务数据:非在营、在营企业
导入任务配置文件路径:http://tech.pingansec.com/granite/project-gravel/-/blob/develop_12315/app_12315/data_dump/P12315.yml  

任务输入参数(样例)

任务样例

{
  "credit_no": "91110108MA01F9QE45",
  "company_name": "北京智源视界科技有限公司"
}

任务参数说明

data_type说明

  detail: 表示详情数据

爬虫结果的超级数据

{
  "task_result": 1000,
  "error_msg": "",
  "spider_end_time": "2021-11-04 20:05:52",
  "spider_ip": "10.8.6.30",
  "@version": "1",
  "@timestamp": "2021-11-04T12:06:05.467Z",
  "type": "P12315",
  "spider_name": "P12315",
  "data_type": "detail",
  "http_code": 200,
  "task_params": {
  	"company_name": "张天雨",
  	"credit_no": "92321002MA1PBECC42"
  },
  "data": {
  	"code": 1,
  	"msg": "",
  	"data": {
  		"ODRBRAND": null,
  		"ANADDR": null,
  		"REGNO": "321002600777708",
  		"UNITNAME": null,
  		"stQyname": "",
  		"JYFW": "普通货物道路运输。(依法须经批准的项目,经相关部门批准后方可开展经营活动)",
  		"REGUNITNAME": "扬州市广陵区市场监督管理局",
  		"QYBM": null,
  		"XZQHBM": null,
  		"UBINDTYPENAME": "道路运输业",
  		"HIGHLIGHTTITLE": "****",
  		"S_EXT_NODENUM": "320000",
  		"INDUCOMMBUREID": null,
  		"ADDR": "广陵区湾头镇万福玉器创意园413号",
  		"UNITCODE": null,
  		"NBXH": "92321002MA1PBECC42",
  		"REGSTATE_CN": "存续(在营、开业、在册)",
  		"UBINDTYPE": "54",
  		"SQ": null,
  		"REGUNIT": "321002",
  		"PRIPID": "f8c48efc055ed3028cfef6d790f4d7fd",
  		"INDUCOMMBURENAME": null,
  		"ODRID": null,
  		"INVOPT": "****",
  		"ENTTYPE": "9500",
  		"TEL": "",
  		"ODR": null,
  		"QYWZ": null,
  		"ENTTYPENAME": "个体工商户",
  		"REGSTATECODE": "1"
  	},
  	"redirectUrl": ""
  },
  "spider_start_time": "2021-11-04 20:05:46.424",
  "metadata": {},
  "path": "/data/gravel_spiders/P12315/bdp-c-118_10.8.6.30/30217.json",
  "host": "bdp-ls-002"
}

实际爬虫结果的数据结构

爬虫运行环境

scrapy

爬虫部署信息

爬虫运行机器:10.8.6.30
进程数:25
项目名称:P12315
任务提交机器:10.8.6.63
任务提交方式:crontab

Taskhub地址

http://tech.pingansec.com/granite/project-taskhub/-/blob/master/taskhub/config/gravel/config.d/P12315.yml

Taskhub调度规则说明

task_result为以下值时被过滤:
  - 1000
  - 1101
  - 1102
  - 2001
  - 7000
  - 9300
其他值的任务都会被放入队列

爬虫监控指标设计

爬虫待采集结果目录


/data/gravel_spiders/P12315

数据归集

责任人

```
范召贤
```

数据归集方式

- [ ] 爬虫直接写kafka

- [ x ] 爬虫写文件logstash采集

归集后存放目录

```
/data2_227/grvael_spider_result/P12315
```

logstash配置文件名称

```
P12315
```

logstash文件采集type

数据归集的topic

```buildoutcfg
general-taxpayer
```

ES日志索引及筛选条件

```buildoutcfg
gravel-spider-data*  spider_name is P12315
```

监控指标看板

数据保留策略


数据清洗

责任人

李子健

代码地址

http://192.168.109.110/granite/project-collie-app/-/tree/master/app_12315

部署地址

10.8.6.228
/home/collie/product/app_12315

部署方法及说明

  • crontab + data_pump

  • supervisor + data_pump

  • supervisor + consumer

    数据接收来源

归集的文件

数据存储表地址

* 数据库地址:bdp-rds-007.mysql.rds.aliyuncs.com
* 表名:utn_ic.company_12315
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages