Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • P12315

Last edited by 章一锋 Feb 23, 2023
Page history

P12315

基本信息

数据名称(中文)

12315 的行业分类

数据英文名称

12315

采集网站(采集入口)

https://www.12315.cn/cuser/portal/tscase/corperation

采集频率及采集策略

存量更新策略

轮更,尚未知一轮要多久

增量采集策略

每次从头到尾运行一遍,去重得到增量数据

爬虫

责任人

章一锋

爬虫名称

P12315

代码地址

http://tech.pingansec.com/granite/project-gravel/-/blob/develop_12315/scrapy_spiders/gravel_spiders/spiders/12315.py

队列名称及队列地址

* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key: P12315

优先级队列说明

任务来源

任务数据:非在营、在营企业
导入任务配置文件路径:http://tech.pingansec.com/granite/project-gravel/-/blob/develop_12315/app_12315/data_dump/P12315.yml  

任务输入参数(样例)

任务样例

{
  "credit_no": "91110108MA01F9QE45",
  "company_name": "北京智源视界科技有限公司"
}

任务参数说明

data_type说明

  detail: 表示详情数据

爬虫结果的超级数据

{
  "task_result": 1000,
  "error_msg": "",
  "spider_end_time": "2021-11-04 20:05:52",
  "spider_ip": "10.8.6.30",
  "@version": "1",
  "@timestamp": "2021-11-04T12:06:05.467Z",
  "type": "P12315",
  "spider_name": "P12315",
  "data_type": "detail",
  "http_code": 200,
  "task_params": {
  	"company_name": "张天雨",
  	"credit_no": "92321002MA1PBECC42"
  },
  "data": {
  	"code": 1,
  	"msg": "",
  	"data": {
  		"ODRBRAND": null,
  		"ANADDR": null,
  		"REGNO": "321002600777708",
  		"UNITNAME": null,
  		"stQyname": "",
  		"JYFW": "普通货物道路运输。(依法须经批准的项目,经相关部门批准后方可开展经营活动)",
  		"REGUNITNAME": "扬州市广陵区市场监督管理局",
  		"QYBM": null,
  		"XZQHBM": null,
  		"UBINDTYPENAME": "道路运输业",
  		"HIGHLIGHTTITLE": "****",
  		"S_EXT_NODENUM": "320000",
  		"INDUCOMMBUREID": null,
  		"ADDR": "广陵区湾头镇万福玉器创意园413号",
  		"UNITCODE": null,
  		"NBXH": "92321002MA1PBECC42",
  		"REGSTATE_CN": "存续(在营、开业、在册)",
  		"UBINDTYPE": "54",
  		"SQ": null,
  		"REGUNIT": "321002",
  		"PRIPID": "f8c48efc055ed3028cfef6d790f4d7fd",
  		"INDUCOMMBURENAME": null,
  		"ODRID": null,
  		"INVOPT": "****",
  		"ENTTYPE": "9500",
  		"TEL": "",
  		"ODR": null,
  		"QYWZ": null,
  		"ENTTYPENAME": "个体工商户",
  		"REGSTATECODE": "1"
  	},
  	"redirectUrl": ""
  },
  "spider_start_time": "2021-11-04 20:05:46.424",
  "metadata": {},
  "path": "/data/gravel_spiders/P12315/bdp-c-118_10.8.6.30/30217.json",
  "host": "bdp-ls-002"
}

实际爬虫结果的数据结构

爬虫运行环境

scrapy

爬虫部署信息

爬虫运行机器:10.8.6.30
进程数:25
项目名称:P12315
任务提交机器:10.8.6.63
任务提交方式:crontab

Taskhub地址

http://tech.pingansec.com/granite/project-taskhub/-/blob/master/taskhub/config/gravel/config.d/P12315.yml

Taskhub调度规则说明

task_result为以下值时被过滤:
  - 1000
  - 1101
  - 1102
  - 2001
  - 7000
  - 9300
其他值的任务都会被放入队列

爬虫监控指标设计

kibnana 爬虫运行结果查看地址
https://es-cn-4591blu580004eavf.kibana.elasticsearch.aliyuncs.com:5601/goto/19243c4e0bdd6a69f7dc8376e05ed20d

爬虫待采集结果目录


/data/gravel_spiders/P12315

数据归集

责任人

```
范召贤
```

数据归集方式

- [ ] 爬虫直接写kafka

- [ x ] 爬虫写文件logstash采集

归集后存放目录

```
/data2_227/grvael_spider_result/P12315
```

logstash配置文件名称

```
P12315
```

logstash文件采集type

数据归集的topic

```buildoutcfg
general-taxpayer
```

ES日志索引及筛选条件

```buildoutcfg
gravel-spider-data*  spider_name is P12315
```

监控指标看板

数据保留策略


数据清洗

责任人

李子健

代码地址

http://192.168.109.110/granite/project-collie-app/-/tree/master/app_12315

部署地址

10.8.6.228
/home/collie/product/app_12315

部署方法及说明

  • crontab + data_pump

  • supervisor + data_pump

  • supervisor + consumer

    数据接收来源

归集的文件

数据存储表地址

* 数据库地址:bdp-rds-007.mysql.rds.aliyuncs.com
* 表名:utn_ic.company_12315
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages