Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • tax_punish

Last edited by Liu Zhiqiang Sep 18, 2023
Page history
This is an old version of this page. You can view the most recent version or browse the history.

tax_punish

基本信息

数据名称(中文)

税收违法

数据英文名称

risk_tax_punish

采集网站(采集入口)

http://www.chinatax.gov.cn/chinatax/c101249/n2020011502/index.html

采集频率及采集策略

有36个入口地区,分别进去进行爬取

存量更新策略

36个地区的入口地址作为初始任务
逐条更新
目前暂定每天更新一遍

增量采集策略

暂无必要进行增量处理

爬虫名称以及平台

爬虫名称: risk_tax_punish  
平台: 国家税务总局-重大违法失信案件信息公布栏

责任人

袁波

代码地址

项目地址:http://192.168.109.110/granite/project-gravel/tree/develop_tax_punish_20210611/scrapy_spiders/gravel_spiders/spiders

项目入口脚本: http://192.168.109.110/granite/project-gravel/blob/develop_tax_punish_20210611/scrapy_spiders/gravel_spiders/spiders/tax_punish.py
代码具体实现板块: http://192.168.109.110/granite/project-gravel/tree/develop_tax_punish_20210611/scrapy_spiders/gravel_spiders/spiders/tax_punish_reqs
(说明: 同理,其它爬虫都根据爬虫名称在同级目录或板块下)

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key: risk_tax_punish

优先级队列说明

risk_tax_punish
说明:无特别处理,都是默认优先级10

任务来源

入口地址包含的36个地区的url地址作为初始任务

任务输入参数(样例)

{
    "area": "beijing",
    "data_type": "list"
}
说明: 此处以北京为例,其它地区同理

任务样例

{
    "area": "beijing",
    "data_type: "list",
    "outbound": "tax_punish", 
    "routed_count": 1, 
    "submitter": "taskhub", 
    "group_retry_times": 0, 
    "submit_time": "2021-04-15 15:11:02", 
    "token_scope": "tax_punish", 
    "retry_limits": 2, 
    "rt": false, 
    "priority": null, 
    "task_uuid": "c1577311-f58b-4235-8778-ce08d54df118", 
    "retry_times": 0
}
说明: 其它地区爬虫同理,切换不同地区的拼音拼写即可

任务参数说明

{
    "area": "beijing",
    "data_type": "list"
}
说明: 其它地区爬虫同理,切换不同地区的入口url即可

data_type说明

list: 入口url
detail: 某个详情页的数据

爬虫结果的超级数据

{
	"data": {
		"tax_punish": {
			"tax_authority": "朝阳区",
			"taxpayer": "北京德高惠众科技有限公司",
			"tax_code": "91110105327231716C",
			"case_nature": "虚开增值税专用发票或者虚开用于骗取出口退税、抵扣税款的其他发票",
			"credit_no": "",
			"org_code": "327231716",
			"company_address": "北京市朝阳区康家沟145号A111",
			"publish_date": "",
			"legal_person": "郭金峰",
			"legal_person_sex": "男",
			"legal_person_code": "610203********5410",
			"legal_person_card": "",
			"financial_officer_name": "",
			"financial_officer_sex": "",
			"financial_officer_code": "",
			"financial_officer_card": "",
			"in_illegal_legal_person": "郭金峰",
			"in_illegal_legal_person_sex": "男",
			"in_illegal_legal_person_code": "610203********5410",
			"in_illegal_legal_person_card": "",
			"real_officer_name": "",
			"real_officer_sex": "",
			"real_officer_code": "",
			"real_officer_card": "",
			"inter_info": "",
			"illegal_facts": "经国家税务总局北京市税务局第二稽查局检查,发现其在2017年11月01日至2018年09月30日期间,主要存在以下问题:对外虚开增值税销项发票50份,金额493.18万元,税额78.91万元。",
			"punish_info": "依照《中华人民共和国税收征收管理法》等相关法律法规的有关规定,出具《已证实虚开通知单》。",
			"url": "http://beijing.chinatax.gov.cn/bjsat/office/jsp/zdsswfaj/wwidquery"
		}
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-06-18 11:39:51.724",
	"spider_end_time": "2021-06-18 11:39:53.034",
	"task_params": {
		"data_type": "list",
		"area": "beijing"
	},
	"metadata": {
		"current_page": 2
	},
	"spider_name": "risk_tax_punish",
	"spider_ip": "10.8.1.38"
}

实际爬虫结果的数据结构

{
	"data": {
		"tax_punish": {
			"tax_authority": "朝阳区",
			"taxpayer": "北京德高惠众科技有限公司",
			"tax_code": "91110105327231716C",
			"case_nature": "虚开增值税专用发票或者虚开用于骗取出口退税、抵扣税款的其他发票",
			"credit_no": "",
			"org_code": "327231716",
			"company_address": "北京市朝阳区康家沟145号A111",
			"publish_date": "",
			"legal_person": "郭金峰",
			"legal_person_sex": "男",
			"legal_person_code": "610203********5410",
			"legal_person_card": "",
			"financial_officer_name": "",
			"financial_officer_sex": "",
			"financial_officer_code": "",
			"financial_officer_card": "",
			"in_illegal_legal_person": "郭金峰",
			"in_illegal_legal_person_sex": "男",
			"in_illegal_legal_person_code": "610203********5410",
			"in_illegal_legal_person_card": "",
			"real_officer_name": "",
			"real_officer_sex": "",
			"real_officer_code": "",
			"real_officer_card": "",
			"inter_info": "",
			"illegal_facts": "经国家税务总局北京市税务局第二稽查局检查,发现其在2017年11月01日至2018年09月30日期间,主要存在以下问题:对外虚开增值税销项发票50份,金额493.18万元,税额78.91万元。",
			"punish_info": "依照《中华人民共和国税收征收管理法》等相关法律法规的有关规定,出具《已证实虚开通知单》。",
			"url": "http://beijing.chinatax.gov.cn/bjsat/office/jsp/zdsswfaj/wwidquery"
		}
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-06-18 11:39:51.724",
	"spider_end_time": "2021-06-18 11:39:53.034",
	"task_params": {
		"data_type": "list",
		"area": "beijing"
	},
	"metadata": {
		"current_page": 2
	},
	"spider_name": "risk_tax_punish",
	"spider_ip": "10.8.1.38"
}

爬虫运行环境

scrapy

爬虫部署信息

10.8.6.62   5个进程

Taskhub地址

代码编写地址: http://192.168.109.110/granite/project-taskhub/-/tree/master/taskhub/config/gravel/config.d
说明: 暂未编写

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1001    # 需要进一步处理,进行重试的数据
task_result=1101    # 没有找到详情链接的id
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

索引: tax_punish_spider_log-*
爬虫名称: risk_tax_punish
监控频率: 待定
监控起止时间: 待定
报警条件: 待定
报警群:  待定
报警模板:【报警】待定

爬虫待采集结果目录

/data/gravel_spiders/risk_tax_punish

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

归集后存放目录

/data2_227/grvael_spider_result/risk_tax_punish

logstash配置文件名称

project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael_spider_to_es.conf(入es)

logstash文件采集type

type=>"risk_tax_punish"

数据归集的topic

topic_id => "public-company-spider-data"

ES日志索引及筛选条件

index => "public-company-spider-data-%{log_date}"

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • [ ]
  • [ ]
  • [ ]

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages