Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • institution

Last edited by 蒋家升 Aug 11, 2022
Page history
This is an old version of this page. You can view the most recent version or browse the history.

institution

基本信息

事业单位爬虫
institution
search_key 为搜索入口,一般为统一社会信用代码

数据名称(中文)

事业单位

数据英文名称

institution

采集网站(采集入口)

官网PC端入口:
http://search.gjsy.gov.cn/wsss/view
采集文件存放路径:
/data/gravel_spiders/institution

采集频率及采集策略

存量更新策略

目前全量更新一轮

增量采集策略

1.新成立的主体
2.补充的主体

爬虫

事业单位 institution

责任人

蒋家升

爬虫名称

institution

代码地址

项目地址: http://tech.pingansec.com/granite/project-gravel/-/tree/develop_institution/

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • institution

优先级队列说明

  • institution 支持队列优先级

任务来源

taskhub 全量主体信息

任务输入参数(样例) 其中search_key,credit_no,company_name必需

{
	"province": "SAX",                         --省份
	"company_status": "废止",                   --登记状态
	"company_name": "乐东黎族自治县第二小学",      --事业单位名称
	"credit_no": "12468843428892871M",         --统一社会信用代码
	"submit_time": "2021-06-16 19:29:44",      --提交时间
	"search_key": "12468843428892871M"         --搜索关键词, 通常与统一信用代码保持一致
}

任务样例

{
    "province": "SAX",
    "company_status": "废止",
    "company_name": "乐东黎族自治县第二小学",
    "credit_no": "12468843428892871M",
    "submit_time": "2021-06-16 19:29:44",
    "search_key": "12468843428892871M"
}

任务参数说明

{
    "province": "SAX",                         --省份
    "company_status": "废止",                   --登记状态
    "company_name": "乐东黎族自治县第二小学",      --事业单位名称
    "credit_no": "12468843428892871M",         --统一社会信用代码
    "submit_time": "2021-06-16 19:29:44",      --提交时间
    "search_key": "12468843428892871M"         --搜索关键词
}

data_type说明

detail: 详情信息

爬虫结果的超级数据

{
  "companyinfo_item":
  {
    "company_name": "乐东黎族自治县第二小学",
    "credit_no": "12468843428892871M",
    "business_scope": "实施小学义务教育,促进基础教育发展;小学学历教育和相关社会服务。",
    "company_address": "乐东黎族自治县抱由镇吉祥路",
    "legal_person": "罗人鹏",
    "capital_source": "财政补助(全额拨款)",
    "capital": "35.2万元",
    "organizer": "乐东黎族自治县教育局",
    "company_status": "已废止",
    "authority": "乐东黎族自治县事业单位登记管理局",
    "operation_startdate": "",
    "operation_enddate": "",
    "company_type": "事业单位",
    "province": "海南省",
    "province_code": 1024,
    "province_short": "HAIN",
    "data_status": 1,
    "data_source": 1,
    "create_time": "2022-04-12 21:28:57.442503",
    "lastupdatetime": "2022-04-12 21:28:57.442503"
  },
  "result_code": "00000111",
  "mark": "",
  "http_code": 200,
  "error_msg": "",
  "task_result": 1000,
  "data_type": "detail",
  "spider_start_time": "2022-04-12 21:28:26.602",
  "spider_end_time": "2022-04-12 21:29:00.671507",
  "task_params":
  {
    "province": "SAX",
    "company_status": "废止",
    "company_name": "乐东黎族自治县第二小学",
    "credit_no": "12468843428892871M",
    "submit_time": "2021-06-16 19:29:44",
    "search_key": "12468843428892871M"
  },
  "metadata":
  {},
  "spider_name": "institution",
  "spider_ip": "192.168.108.74"
}

实际爬虫结果的数据结构

{
  "companyinfo_item":
  {
    "company_name": "乐东黎族自治县第二小学",   -- 事业单位名称
    "credit_no": "12468843428892871M",  -- 统一社会信用代码
    "business_scope": "实施小学义务教育,促进基础教育发展;小学学历教育和相关社会服务。",  -- 宗旨和业务范围
    "company_address": "乐东黎族自治县抱由镇吉祥路",  --住所
    "legal_person": "罗人鹏",  -- 法定代表人姓名
    "capital_source": "财政补助(全额拨款)",  -- 经费来源
    "capital": "35.2万元",  -- 开办资金
    "organizer": "乐东黎族自治县教育局",  -- 登记管理机关
    "company_status": "已废止",  -- 单位状态
    "authority": "乐东黎族自治县事业单位登记管理局",  -- 登记管理机关
    "operation_startdate": "",  -- 经营开始日期
    "operation_enddate": "",  -- 经营结束日期
    "company_type": "事业单位",  -- 登记类型固定值为 事业单位
    "province": "海南省",  -- 省份
    "province_code": 1024,  -- 省份(数字编号)
    "province_short": "HAIN",  -- 省份(英文缩写)
    "data_status": 1,
    "data_source": 1,
    "create_time": "2022-04-12 21:28:57.442503",
    "lastupdatetime": "2022-04-12 21:28:57.442503"  -- 数据最后更新时间
  },
  "result_code": "00000111",
  "mark": "",
  "http_code": 200,
  "error_msg": "",
  "task_result": 1000,
  "data_type": "detail",
  "spider_start_time": "2022-04-12 21:28:26.602",
  "spider_end_time": "2022-04-12 21:29:00.671507",
  "task_params":
  {
    "province": "SAX",
    "company_status": "废止",
    "company_name": "乐东黎族自治县第二小学",
    "credit_no": "12468843428892871M",
    "submit_time": "2021-06-16 19:29:44",
    "search_key": "12468843428892871M"
  },
  "metadata":
  {},
  "spider_name": "institution",
  "spider_ip": "192.168.108.74"
}

爬虫运行环境

scrapy

爬虫部署信息

target: 10.8.6.51
spider_name: institution 
5个进程  

Taskhub相关

任务提交

提交任务地址: http://10.8.6.222:18518/task/

任务提交示例: curl -L -X POST 'http://10.8.6.222:8526/task/' -H 'Content-Type: application/json' --data-raw '{"spider_name": "institution","province": "SAX","company_status": "废止","company_name": "乐东黎族自治县第二小学","credit_no": "12468843428892871M","submit_time": "2021-06-16 19:29:44","search_key": "12468843428892871M","company_major_type": 4}'
相当于task_params再加入"spider_name": "institution"

Taskhub重试调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

归集后存放目录

logstash配置文件名称

logstash文件采集type

数据归集的topic

topic_id => ""

ES日志索引及筛选条件

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages