首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 84 条评论
博客作者 Extended Opportunity
Millions of Free Traffic with AI Tools – https://ext-opp.com/AIVault
博客作者 Extended Opportunity
This Dude Creates Bitc0in Out Of Thin Air [CRAZY]
I didn’t believe it either at first…
Until i saw it works in action >> https://ext-opp.com/Coinz
You see, Bitc0in and the entire Crypt0 market is about to go through the rough…
Some smart people will get in now, and make massive gains,
And some will stand and watch people make money…
Ultimately, I want you to get in now…
But here is the issue…
If you put any kind of money in it now, it would be a huge risk.
Especially if you put in money that you can’t afford to lose…
But what my friend, Seyi, did is insane…
He created the world’s first AI app that literally generates Bitc0in and ETH out of thin air…
All you need to do is just connect your wallet, and that’s it…
To create your account with Coinz, and start receiving daily Bitc0in & etheruem for 100% free click here >> https://ext-opp.com/Coinz
But don’t delay, because the price of Coinz will double very soon…
Cheers
博客作者 Extended Opportunity
After Generating Millions Online, I’ve Created A Foolproof Money Making System, & For a Limited Time You Get It For FREE… https://ext-opp.com/RPM
博客作者 Extended Opportunity
After Generating Millions Online, I’ve Created A Foolproof Money Making System, & For a Limited Time You Get It For FREE… https://ext-opp.com/RPM
博客作者 Extended Opportunity
After Generating Millions Online, I’ve Created A Foolproof Money Making System, & For a Limited Time You Get It For FREE… https://ext-opp.com/RPM
博客作者 Extended Opportunity
ChatGPT powered Autoresponder with Free SMTP at Unbeatable 1-Time Price! https://ext-opp.com/NewsMailer
博客作者 Extended Opportunity
An Ultimate Web-Hosting Solution For Business Owners https://ext-opp.com/HostsMaster
博客作者 Extended Opportunity
An Ultimate Web-Hosting Solution For Business Owners https://ext-opp.com/HostsMaster
博客作者 Extended Opportunity
An Ultimate Web-Hosting Solution For Business Owners https://ext-opp.com/HostsMaster
博客作者 Extended Opportunity
MobiApp AI – True Android & iOS Mobile Apps Builder (Zero Coding Required) https://ext-opp.com/MobiAppAI
博客作者 Extended Opportunity
MobiApp AI – True Android & iOS Mobile Apps Builder (Zero Coding Required) https://ext-opp.com/MobiAppAI
博客作者 Extended Opportunity
MobiApp AI – True Android & iOS Mobile Apps Builder (Zero Coding Required) https://ext-opp.com/MobiAppAI
博客作者 Extended Opportunity
MobiApp AI – True Android & iOS Mobile Apps Builder (Zero Coding Required) https://ext-opp.com/MobiAppAI
博客作者 Extended Opportunity
Word’s First NLP & ML Based Email, Voice & Video Marketing Autoresponder Thats Boost Email Delivery, Click & Open Rates Instantly https://ext-opp.com/VidMailsAI
博客作者 Extended Opportunity
Hey, did you know there are app that mass generate hundreds of redirects to your link from different domains? Get it here – https://ext-opp.com/BUS