初入python爬虫之Scrapy框架

首先安装python环境（废话，去百度怎么安装，后期有时间补上），使用pip命令安装scrapy，再使用scrapy命令创建项目

pip install scrapy 

scrapy startproject projectname

projectname就是你要创建项目的名字

项目结构如下

爬虫文件就写在spiders里面（__init__.py文件只是声明这个文件夹是一个python包）

首先创建一个py文件用来写爬虫，直接贴代码慢慢解释

import scrapy
import urllib.parse
import json
import re


class JobScrapy(scrapy.Spider):
    name = '51job'
    allowed_domains = ['www.51job.com','search.51job.com']##
    start_urls = ['https://search.51job.com/']
    page = 1
    pagesize= 0
    jobtype=['0100','7700','7200','7300','7800','7400','2700','7900']
    urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \
           str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \
                       'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare='

    url = "search.51job.com"
    def __init__(self, value ,fileName):
        self.value = value
        self.fileName = fileName
        self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8')

    def parse(self, response):
        urls = self.urls
        yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面

    def fond_parse(self, response):
        print(response)

首先解析这个类，继承了Spider 而它也就是爬虫的一个组件。

name属性是这个爬虫模块的名字，在启动爬虫是要与模块名对应

start_urls属性是开始爬取的第一个页面

allowed_domains属性指定了允许爬取的所有域名，不在此域名内的都会被过滤

parse方法是start_url爬取的回调函数，在这里处理（我初学的时候爬了首页，其实这个url应该就是目标页，然后直接取数据，懒得修改了）首页爬取的返回值，可以通过正则表达式，xpath定位等方法找到元素位置

yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)

scrapy.Request 是一次普通请求默认get，可以修改为post也可以用FormRequest表单请求

dont_filter=True 允许爬取重复页

callback是回调方法

回调方法里面可以继续处理数据或者获取新的页面，比如爬取列表页面后去爬详情页面。后面一些处理后面再写

开一个新坑，目前51job详情页面爬取有滑动验证，有时间我会研究处理的，以及后续伪装ua

查看评论 - 84 条评论

Comments | 84 条评论

博客作者 katy

回复

发布于 2023-02-28 03:17

Small girls are the easiest to throw around in bed https://is.gd/dBsd60
博客作者 carmella

回复

发布于 2023-03-03 03:46

Would we snuggle or fuck? :) https://is.gd/dBsd60
博客作者 meagan

回复

发布于 2023-03-05 20:00

I like watching myself masturbate, do you? https://is.gd/dBsd60
博客作者 beatrice

回复

发布于 2023-03-05 20:00

Insert your tongue here https://is.gd/dBsd60
博客作者 georgina

回复

发布于 2023-03-05 20:01

Moving my shorts to the side, would you like a lick? https://is.gd/dBsd60
博客作者 josefina

回复

发布于 2023-03-08 16:06

Tell your friends that I’m your girlfriend ;) https://snip.ly/6327oi
博客作者 ellen

回复

发布于 2023-03-08 16:06

Can I sit on you? https://snip.ly/6327oi
博客作者 polly

回复

发布于 2023-03-08 16:06

Would you like me to wrap my boobs around your big hard cock? https://snip.ly/6327oi
博客作者 alisha

回复

发布于 2023-03-08 16:07

I mighttttt be getting fucked in the ass tonight, we’ll see how it goes? https://snip.ly/6327oi
博客作者 shana

回复

发布于 2023-03-08 16:07

Need someone to play with these while I ride https://snip.ly/6327oi
博客作者 margarita

回复

发布于 2023-03-08 16:07

How would you like to take me from behind daddy? https://snip.ly/6327oi
博客作者 mayra

回复

发布于 2023-03-11 01:43

They are so swollen, volunteers for sucking my nipples?? Anyone??? https://tiny.cc/gz35vz
博客作者 ila

回复

发布于 2023-03-13 15:57

Waiting for you climb in to bed with me… https://tiny.cc/gz35vz
博客作者 johnnie

回复

发布于 2023-03-13 15:57

My pussy gets red like this when I’m horny https://tiny.cc/gz35vz
博客作者 latonya

回复

发布于 2023-03-13 15:57

I wonder if it will fit :) https://tiny.cc/gz35vz

取消回复

Markdown Supported while Forbidden

你是我一生只会遇见一次的惊喜 ...

戳我呀 OωO 嘿嘿嘿ヾ(≧∇≦*)ゝ

bilibili~	(=・ω・=)	Tieba