1.xpath解析(各个爬虫语言通用)
(1)环境安装
pip install lxml
(2)解析原理
- 获取页面原码数据 - 实例化etree对象,将页面原码数据加载到该对象中 - 调用该对象的xpath方法进行指定标签的定位(xparh函数必须结合着xpath表达式进行标签的定位和内容的捕获)
(3)xpath语法(返回值是一个列表)
## 一.数据解析### 1.xpath解析(各个爬虫语言通用)#### (1)环境安装```pip install lxml```#### (2)解析原理```- 获取页面原码数据 - 实例化etree对象,将页面原码数据加载到该对象中- 调用该对象的xpath方法进行指定标签的定位(xparh函数必须结合着xpath表达式进行标签的定位和内容的捕获)```#### (3)xpath语法(返回值是一个列表)```属性定位 / 相当于 > (在开头一定从根节点开始) // 相当于 ' ' @ 表示属性 例://div[@class="song"]索引定位(索引从1开始) //ul/li[2]逻辑运算 //a[@href='' and @class='du'] 和 //a[@href='' | @class='du'] 或模糊匹配 //div[contains(@class,'ng')] //div[starts-with(@class,'ng')] 取文本 //div/text() 直系文本内容 //div//text() 非直系文本内容(返回列表)取属性 //div/@href```#### (4)案例##### 案例一:58同城二手房数据爬取```pythonimport requestsfrom lxml import etreeimport osurl='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')with open('./文件夹1/fangyuan.txt','w',encoding='utf-8') as f: for title_price in title_price_list: f.write(title_price) f.close() print("over")```###### *注:区别解析的数据源是原码还是局部数据*```原码数据 tree.HTML('//ul...') 局部数据 tree.HTML('./ul...') #以.开头```##### 测试xpath语法的正确性###### 方式一:xpath.crx(xpath插件)```找到浏览器的 更多工具>拓展程序开启开发者模式将xpath.crx拖动到浏览器中xpath插件启动快捷键:ctrl+shift+x作用:用于测试xpath语法的正确性```![1551257321487](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551257321487.png)###### 方式二:浏览器自带![1551231018948](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551231018948.png)##### 案例二:4k网爬取图片```import requestsfrom lxml import etreeimport urllibheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}page_num=int(input("请输入要爬取的页数:"))if page_num==1: url='http://pic.netbian.com/4kyingshi/index.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夹2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!') else: for page in range(1,page_num+1): url='http://pic.netbian.com/4kyingshi/index_%d.html' % page origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夹2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!')```###### 中文乱码问题```方式一: response.encoding='gbk'方式二: name=name.encode('iso-8859-1').decode('utf-8')```###### 数据来源问题```etree.HTML() #处理网络数据etree.parse() #处理本地数据```##### 案例3:爬取煎蛋网图片```pythonimport requestsfrom lxml import etreeimport urllibimport base64headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://jandan.net/ooxx'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)span_list=tree.xpath('//span[@class="img-hash"]/text()')for span in span_list: src='http:'+base64.b64decode(span).decode("utf-8") picture_data=requests.get(url=src,headers=headers).content name='./文件夹3/'+src.split("/")[-1] with open(name,'wb') as f: f.write(picture_data) f.close()print('over!!!')```###### ##反爬机制3:base64在response返回数据中,图片的src都是相同的,每个图片都有一个span标签存储一串加密字符串,同时发现一个jandan_load_img函数,故猜测该加密字符串通过此函数可能得到图片地址.![1551260850370](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551260850370.png)全局搜索此函数![1551261126014](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261126014.png)发现此函数中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz![1551261205397](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261205397.png)全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz![1551261246264](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261246264.png)函数的最后用到了base64_decode![1551261317520](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261317520.png)故断定该加密字符串用base64解密可得到图片地址##### 案例4:站长素材简历爬取```pythonimport requestsfrom lxml import etreeimport randomheaders={ 'Connection':'close', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://sc.chinaz.com/jianli/free.html'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')for src in src_list: filename='./文件夹4/'+src.split('/')[-1].split('.')[0]+'.rar' print(filename) down_page_data=requests.get(url=src,headers=headers).text tree=etree.HTML(down_page_data) down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href') res=random.choice(down_list) print(res) jianli=requests.get(url=res,headers=headers).content with open(filename,'wb') as f: f.write(jianli) f.close() print('over!!!')```###### ##反爬机制4:Connection经典错误```HTTPConnectionPool(host:xx) Max retries exceeded with url```原因```1.每次数据传输前客户端都要和服务端建立TCP连接,为了节省传输消耗,默认为keep-alive,即连接一次传输多次,然而如果连接迟迟不断开的话,链接池满后,则无法产生新的链接对象,导致请求无法发送2.IP被封3.请求频率太频繁```解决```1.设置请求头中Connection的值为close,每次成功后断开连接2.更换请求IP3.每次请求之间使用sleep进行请求间隔```##### 案例5:解析所有的城市名称```pythonimport requestsfrom lxml import etreeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='https://www.aqistudy.cn/historydata/'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')with open('./文件夹1/city.txt','w',encoding='utf-8') as f: for hot in hot_list: f.write(hot.strip()) common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()') for common in common_list: f.write(common.strip()) f.close()print('over!!!')```##### 案例6:图片懒加载,站长素材婚纱照```pythonimport requestsfrom lxml import etreeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://sc.chinaz.com/tupian/hunsha.html'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)div_list=tree.xpath('//div[@id="container"]/div')for div in div_list: title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8') name='./文件夹1/'+title+'.jpg' photo_url=div.xpath('./div/a/@href')[0] origin_data=requests.get(url=photo_url,headers=headers).text tree=etree.HTML(origin_data) url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0] origin_data=requests.get(url=url_it,headers=headers).content with open(name,'wb') as f: f.write(origin_data) print('over!!!')```###### ##反爬机制5:代理IP使用```pythonimport requestsfrom lxml import etreeimport randomproxie=[{ 'https':'116.197.134.153:80'},{ 'https':'103.224.100.43:8080'},{ 'https':'222.74.237.246:808'}]headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='https://www.baidu.com/s?wd=ip'origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).textwith open('./ip.html','w',encoding='utf-8') as f: f.write(origin_data) print('over!!!')```常用代理网站```www.goubanjia.com快代理西祠代理```代理知识```透明:对方知道使用了代理,且知道真实IP匿名:对方知道使用了代理,不知道真实IP高匿:对方不知道使用了代理,更不知道真实IP```*注:代理IP的类型必须和请求url的协议头 保持一致**https://www.55xia.com下载电影**顺序:动态加载,url加密,element*
(4)案例
案例一:58同城二手房数据爬取
import requestsfrom lxml import etreeimport osurl='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')with open('./文件夹1/fangyuan.txt','w',encoding='utf-8') as f: for title_price in title_price_list: f.write(title_price) f.close() print("over")
注:区别解析的数据源是原码还是局部数据
原码数据 tree.HTML('//ul...') 局部数据 tree.HTML('./ul...') #以.开头
测试xpath语法的正确性
方式一:xpath.crx(xpath插件)
找到浏览器的 更多工具>拓展程序开启开发者模式将xpath.crx拖动到浏览器中xpath插件启动快捷键:ctrl+shift+x作用:用于测试xpath语法的正确性
方式二:浏览器自带
案例二:4k网爬取图片
import requestsfrom lxml import etreeimport urllibheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}page_num=int(input("请输入要爬取的页数:"))if page_num==1: url='http://pic.netbian.com/4kyingshi/index.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夹2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!') else: for page in range(1,page_num+1): url='http://pic.netbian.com/4kyingshi/index_%d.html' % page origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夹2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!')
中文乱码问题
方式一: response.encoding='gbk'方式二: name=name.encode('iso-8859-1').decode('utf-8')
数据来源问题
etree.HTML() #处理网络数据etree.parse() #处理本地数据
案例3:爬取煎蛋网图片
import requestsfrom lxml import etreeimport urllibimport base64headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://jandan.net/ooxx'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)span_list=tree.xpath('//span[@class="img-hash"]/text()')for span in span_list: src='http:'+base64.b64decode(span).decode("utf-8") picture_data=requests.get(url=src,headers=headers).content name='./文件夹3/'+src.split("/")[-1] with open(name,'wb') as f: f.write(picture_data) f.close()print('over!!!')
##反爬机制3:base64
在response返回数据中,图片的src都是相同的,每个图片都有一个span标签存储一串加密字符串,同时发现一个jandan_load_img函数,故猜测该加密字符串通过此函数可能得到图片地址.
全局搜索此函数
发现此函数中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz
全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz
函数的最后用到了base64_decode
故断定该加密字符串用base64解密可得到图片地址
案例4:站长素材简历爬取
import requestsfrom lxml import etreeimport randomheaders={ 'Connection':'close', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://sc.chinaz.com/jianli/free.html'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')for src in src_list: filename='./文件夹4/'+src.split('/')[-1].split('.')[0]+'.rar' print(filename) down_page_data=requests.get(url=src,headers=headers).text tree=etree.HTML(down_page_data) down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href') res=random.choice(down_list) print(res) jianli=requests.get(url=res,headers=headers).content with open(filename,'wb') as f: f.write(jianli) f.close() print('over!!!')
##反爬机制4:Connection
经典错误
HTTPConnectionPool(host:xx) Max retries exceeded with url
原因
1.每次数据传输前客户端都要和服务端建立TCP连接,为了节省传输消耗,默认为keep-alive,即连接一次传输多次,然而如果连接迟迟不断开的话,链接池满后,则无法产生新的链接对象,导致请求无法发送2.IP被封3.请求频率太频繁
解决
1.设置请求头中Connection的值为close,每次成功后断开连接2.更换请求IP3.每次请求之间使用sleep进行请求间隔
案例5:解析所有的城市名称
import requestsfrom lxml import etreeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='https://www.aqistudy.cn/historydata/'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')with open('./文件夹1/city.txt','w',encoding='utf-8') as f: for hot in hot_list: f.write(hot.strip()) common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()') for common in common_list: f.write(common.strip()) f.close()print('over!!!')
案例6:图片懒加载,站长素材婚纱照
import requestsfrom lxml import etreeheaders={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='http://sc.chinaz.com/tupian/hunsha.html'origin_data=requests.get(url=url,headers=headers).texttree=etree.HTML(origin_data)div_list=tree.xpath('//div[@id="container"]/div')for div in div_list: title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8') name='./文件夹1/'+title+'.jpg' photo_url=div.xpath('./div/a/@href')[0] origin_data=requests.get(url=photo_url,headers=headers).text tree=etree.HTML(origin_data) url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0] origin_data=requests.get(url=url_it,headers=headers).content with open(name,'wb') as f: f.write(origin_data) print('over!!!')
##反爬机制5:代理IP
使用
import requestsfrom lxml import etreeimport randomproxie=[{ 'https':'116.197.134.153:80'},{ 'https':'103.224.100.43:8080'},{ 'https':'222.74.237.246:808'}]headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}url='https://www.baidu.com/s?wd=ip'origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).textwith open('./ip.html','w',encoding='utf-8') as f: f.write(origin_data) print('over!!!')
常用代理网站
www.goubanjia.com快代理西祠代理
代理知识
透明:对方知道使用了代理,且知道真实IP匿名:对方知道使用了代理,不知道真实IP高匿:对方不知道使用了代理,更不知道真实IP
注:代理IP的类型必须和请求url的协议头 保持一致
下载电影
顺序:动态加载,url加密,element