|
|
# **基本信息**
|
|
|
```
|
|
|
公司
|
|
|
```
|
|
|
|
|
|
|
|
|
## 数据名称(中文)
|
|
|
<!-- 该项数据的中文标准名称,后续沟通交流使用的规范名称,如:工商公示股东信息、失信被执行人、一般纳税人等-->
|
|
|
###
|
|
|
```
|
|
|
公司
|
|
|
```
|
|
|
|
|
|
## 数据英文名称
|
|
|
<!--英文名称,后续流程中所有涉及到英文名称均以此为准,如:partner、shixin、general_taxpayer等-->
|
|
|
```
|
|
|
itjuzi_company
|
|
|
```
|
|
|
|
|
|
## 采集网站(采集入口)
|
|
|
<!--采集的入口地址,不能只是一个网站域名,具体到该网站的数据入口-->
|
|
|
```
|
|
|
详情页: https://www.itjuzi.com/company/1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 采集频率及采集策略
|
|
|
<!--说明该项数据更新频率,存量数据的更新频率及策略、新增数据的采集频率及策略-->
|
|
|
|
|
|
### 存量更新策略
|
|
|
<!--无需更新?每天全量更新?逐条轮询更新?多久更新完一轮?或其他-->
|
|
|
```
|
|
|
https://www.itjuzi.com/investfirm/35985
|
|
|
详情页链接从1~39002402 ,后续累加
|
|
|
预计有不到20w个公司
|
|
|
```
|
|
|
|
|
|
|
|
|
### 增量采集策略
|
|
|
<!--新增数据怎么来?无需单独采集新增数据?-->
|
|
|
```
|
|
|
待更新
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
# **爬虫**
|
|
|
```
|
|
|
itjuzi_company
|
|
|
```
|
|
|
|
|
|
|
|
|
## 责任人
|
|
|
```
|
|
|
袁波
|
|
|
```
|
|
|
|
|
|
## 爬虫名称
|
|
|
```
|
|
|
itjuzi_company
|
|
|
```
|
|
|
|
|
|
<!--spider_name-->
|
|
|
|
|
|
## 代码地址
|
|
|
```
|
|
|
项目地址:http://192.168.109.110/granite/project-gravel/-/tree/itjuzi_20211119/scrapy_spiders/gravel_spiders/spiders/itjuzi_reqs
|
|
|
|
|
|
|
|
|
## 队列名称及队列地址
|
|
|
<!--redis host port db key 优先级说明-->
|
|
|
-
|
|
|
* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
|
|
|
* redis port: 6379
|
|
|
* redis db: 7
|
|
|
* redis key:
|
|
|
* itjuzi_company:10
|
|
|
|
|
|
```
|
|
|
|
|
|
## 任务输入参数(样例)
|
|
|
```json
|
|
|
{
|
|
|
"platform": "itjuzi_company",
|
|
|
"invest_event_id": "22568"
|
|
|
}
|
|
|
```
|
|
|
|
|
|
|
|
|
## data_type说明
|
|
|
<!--可能产生的data_type说明-->
|
|
|
```
|
|
|
data_type: detail # 结果数据类型
|
|
|
```
|
|
|
|
|
|
## 实际爬虫结果的数据结构
|
|
|
<!--可能与超级数据一致,可能不同的data_type的爬虫结果结构不同,超级数据是把所有data_type的结果组合在一起-->
|
|
|
```json
|
|
|
{
|
|
|
"data": [{
|
|
|
"company_detail": {
|
|
|
"company_id": "35985",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"finance_rounds": "已被收购",
|
|
|
"one_word_desc": "共享自行车服务提供商",
|
|
|
"logo_url": "https://cdn.itjuzi.com/images/7678bf2f47323d846e7acfc4e6917ec6.jpg",
|
|
|
"weibo": "https:\\u002F\\u002Fweibo.com\\u002Fu\\u002F6038290538",
|
|
|
"wechat_official_account": "Mobike_sharing_bike",
|
|
|
"official_website": "http://www.mobike.com/",
|
|
|
"phone": "\"400-811-7799\"",
|
|
|
"email": "\"bd@mobike.com\"",
|
|
|
"office_address": "\"北京市海淀区学院路甲5号2幢平房B北-3042室\"",
|
|
|
"development_stage": "成长发展期",
|
|
|
"company_status": "运营中",
|
|
|
"finance_demand": "不需要融资",
|
|
|
"primary_industry": ["汽车交通"],
|
|
|
"secondary_industry": ["交通出行"],
|
|
|
"industry_label": ["自行车", "交通出行", "共享单车", "共享出行", "出行服务", "行", "腾讯系", "Google系", "连续获投", "这些公司和摩拜单车、OFO一样提供共享单车服务", "腾讯在2016年的投资事件", "创新工场在2016年的投资事件", "科技部公布2017年独角兽名单", "16位华人投资家的“点睛之笔”"],
|
|
|
"company_desc": "摩拜单车是一家互联网短途出行解决方案,是无桩借还车模式的智能硬件,旨在让用户无需办卡,只需下载摩拜单车App完成注册、扫码解锁、支付、还车的全过程服务。2020年12月14日晚,摩拜App、摩拜微信小程序将停止服务和运营。目前,摩拜单车已接入美团App。",
|
|
|
"company_ic_fullname": "北京摩拜科技有限公司",
|
|
|
"office_province": "北京",
|
|
|
"office_city": "海淀区",
|
|
|
"establish_date": "2015-1",
|
|
|
"team_scale": "300-1000",
|
|
|
"related_org_list": [],
|
|
|
"product_list": [{
|
|
|
"product_name": "摩拜单车",
|
|
|
"product_desc": "帮助每一个人更便捷地完成城市短途出行"
|
|
|
}, {
|
|
|
"product_name": "摩拜单车中国",
|
|
|
"product_desc": "摩拜单车各种好玩的消息,都能在这里找到。 如果有单车使用过程中的问题,请找对应的城市号报障或拨打我们的客服热线哦~"
|
|
|
}],
|
|
|
"member_list": [{
|
|
|
"rank": 1,
|
|
|
"member_id": "13905",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "在职",
|
|
|
"name": "在职",
|
|
|
"position": "总裁",
|
|
|
"individual_resume": "胡玮炜,摩拜单车联合创始人、总裁,前GeekCar极客汽车创始人、CEO,资深媒体人。行走于汽车江湖多年,职业贯穿汽车厂商、财经类报纸、都市类媒体、网络媒体和一线杂志,以灵秀气质、犀利笔锋和上下求索的精神见长。曾服务于上汽乘用车、 《每日经济新闻》、《新京报》、 腾讯、《IT经理世界》、《商业价值》,而后创业。"
|
|
|
}, {
|
|
|
"rank": 2,
|
|
|
"member_id": "69144",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "在职",
|
|
|
"name": "在职",
|
|
|
"position": "CMO",
|
|
|
"individual_resume": "郑顺景,摩拜单车CMO首席营销官,原特斯拉中国区第一任总经理。"
|
|
|
}, {
|
|
|
"rank": 3,
|
|
|
"member_id": "3448",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "在职",
|
|
|
"name": "在职",
|
|
|
"position": "董事长",
|
|
|
"individual_resume": "王兴,美团网创始人及CEO。连续创业者,此前曾创办校内网、海内网、饭否等。2001年毕业于清华大学,2003年放弃美国学业回国创业立校内网,06被千橡集团收购;2007年创办饭否网;2010年创办团购网站美团网。"
|
|
|
}, {
|
|
|
"rank": 4,
|
|
|
"member_id": "28990",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "在职",
|
|
|
"name": "在职",
|
|
|
"position": "总经理",
|
|
|
"individual_resume": "王慧文,美团网联合创始人、副总裁;前人人网联合创始人。王慧文与王兴从两人2004年创办的第一个项目开始,王慧文就跟随着王兴,从校内、饭否、海内、到如今的美团。是王兴创业以来最忠实的伙伴。"
|
|
|
}, {
|
|
|
"rank": 5,
|
|
|
"member_id": "29590",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "已离职",
|
|
|
"name": "王晓峰",
|
|
|
"position": "原CEO",
|
|
|
"individual_resume": "王晓峰,北京摩拜科技有限公司CEO,曾担任Uber上海总经理、腾讯副总经理、Coty销售总监、Google中国华东渠道负责人等,还曾在宝洁先后担任各种销售岗位销售各种产品 从纸尿裤到SK II 从品客薯片到洗衣粉 。"
|
|
|
}, {
|
|
|
"rank": 6,
|
|
|
"member_id": "8630",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "已离职",
|
|
|
"name": "李斌",
|
|
|
"position": "原董事长",
|
|
|
"individual_resume": "李斌,易车网创始人、总裁。毕业于北京大学社会学系,辅修法律及计算机。在大三的时候,曾在中国青年报当记者;1996年初创办北京南极科技发展有限公司,1997年创办北京科文书业信息技术有限公司;2000年6月,李斌创立了易车服务网。"
|
|
|
}, {
|
|
|
"rank": 7,
|
|
|
"member_id": "87180",
|
|
|
"company_id": "35985",
|
|
|
"company_type": "company",
|
|
|
"company_short_name": "摩拜单车",
|
|
|
"is_demission": "已离职",
|
|
|
"name": "夏一平",
|
|
|
"position": "原CTO",
|
|
|
"individual_resume": "夏一平,集度汽车CEO。曾担任摩拜单车联合创始人兼首席技术官。夏一平是国内车联网领域资深的产品技术专家. 曾在福特、菲亚特克莱斯勒的车联网产品研发部门负责产品和技术的研发, 拥有国内外发明, 实用新型和软件著作权专利等20多项。夏一平本科毕业于南京邮电大学通信工程专业。"
|
|
|
}]
|
|
|
}
|
|
|
}],
|
|
|
"http_code": 200,
|
|
|
"error_msg": "",
|
|
|
"task_result": 1000,
|
|
|
"data_type": "detail",
|
|
|
"spider_start_time": "2021-11-23 22:24:34.850",
|
|
|
"spider_end_time": "2021-11-23 22:24:35.754",
|
|
|
"task_params": {
|
|
|
"platform": "itjuzi_company",
|
|
|
"company_id": "35985"
|
|
|
},
|
|
|
"metadata": {},
|
|
|
"spider_name": "itjuzi_company",
|
|
|
"spider_ip": "10.8.1.54",
|
|
|
"proxy_ip": ""
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## 爬虫运行环境
|
|
|
<!--udm模块?scrapy?或其他-->
|
|
|
```buildoutcfg
|
|
|
scrapy
|
|
|
```
|
|
|
|
|
|
|
|
|
## 爬虫部署信息
|
|
|
<!--部署在哪些机器?每个机器多少进程?项目名称是什么?-->
|
|
|
```
|
|
|
crontab任务对应机器collie用户: 待添加
|
|
|
爬虫部署机器: 10.8.6.75 10个进程
|
|
|
```
|
|
|
|
|
|
|
|
|
## Taskhub地址
|
|
|
```
|
|
|
暂不需要
|
|
|
```
|
|
|
|
|
|
## Taskhub调度规则说明
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
## 爬虫监控指标设计
|
|
|
<!--监控爬虫正常运行的指标是什么?报警规则是什么?-->
|
|
|
```
|
|
|
待完善
|
|
|
```
|
|
|
|
|
|
## 爬虫待采集结果目录
|
|
|
```
|
|
|
/data/gravel_spiders/itjuzi_company
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
# **数据归集**
|
|
|
|
|
|
## 责任人
|
|
|
```
|
|
|
范召贤
|
|
|
```
|
|
|
|
|
|
## 数据归集方式
|
|
|
|
|
|
- [ ] 爬虫直接写kafka
|
|
|
|
|
|
- [x] 爬虫写文件logstash采集
|
|
|
|
|
|
## 爬虫结果目录
|
|
|
```
|
|
|
/data/gravel_spiders/itjuzi_company
|
|
|
```
|
|
|
|
|
|
## 归集后存放目录
|
|
|
```
|
|
|
/data2_227/grvael_spider_result/itjuzi_company
|
|
|
```
|
|
|
|
|
|
## logstash配置文件名称
|
|
|
```
|
|
|
project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
|
|
|
project-deploy/logstash/10.8.6.229/conf.d/grvael/grvael_spider_to_es.conf(入es)
|
|
|
```
|
|
|
|
|
|
## logstash文件采集type
|
|
|
```
|
|
|
type=>"itjuzi_company"
|
|
|
```
|
|
|
|
|
|
## 数据归集的topic
|
|
|
```
|
|
|
topic_id => "general-taxpayer"
|
|
|
```
|
|
|
|
|
|
## ES日志索引及筛选条件
|
|
|
```
|
|
|
index => "gravel-spider-data-%{log_date}"
|
|
|
```
|
|
|
|
|
|
## 监控指标看板
|
|
|
|
|
|
|
|
|
## 数据保留策略
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
# **数据清洗**
|
|
|
|
|
|
## 责任人
|
|
|
|
|
|
|
|
|
## 代码地址
|
|
|
|
|
|
|
|
|
## 部署地址
|
|
|
<!--机器及线上代码地址-->
|
|
|
|
|
|
|
|
|
## 部署方法及说明
|
|
|
<!--运行方法及运行命令、supervisor配置、supervisor的program等-->
|
|
|
- [ ] crontab + data_pump
|
|
|
- [ ] supervisor + data_pump
|
|
|
- [ ] supervisor + consumer
|
|
|
|
|
|
## 数据接收来源
|
|
|
<!--来源于kafka还是归集的文件、topic的group?-->
|
|
|
|
|
|
|
|
|
## 数据存储表地址
|
|
|
|
|
|
* 数据库地址:
|
|
|
* 表名: |
|
|
\ No newline at end of file |