一点爬虫的小应用:爬取PyNGL/PyNIO的Tutorial Example

8828 · 发表于 2017-8-2 12:02:36

登录后查看更多精彩内容~

您需要登录才可以下载或查看，没有帐号？立即注册

x

本帖最后由 8828 于 2017-8-13 18:16 编辑

最近想学学PyNGL和PyNIO, 但是没联网，于是学了一下爬虫，把官网Tutorial中的Example爬下来，留着没网的时候用。看了一下家园里爬虫的帖子，不是很多，可能前辈们都玩得很cool了吧，我就趁机发个帖子请前辈们批评指点，感谢! :)

#发帖的时候有的文字没了，吃完饭再补上吧　:- |
##################################
爬取内容：１．网页的讲解文字内容
　　　　　２．程序结果(.jpg)
　　　　　３．脚本(.py)
##################################
准备内容：１．Linux-Python 3.x
　　　　　２.　requests
　　　　　３.　BeautifulSoup
　　　　　４.　PIL
　　　　　５.　BytesIO
　　　　　６.　os
　　　　　７.　time
###################################

爬取思路：这次是纯手动爬取，更高级的有scrapy, selenium...

复制代码

#########################

def get_soup(url):
try:
response = requests.get(url)
response.encoding = response.apparent_encoding
text = response.text
print(response.status_code)
soup = BeautifulSoup(text, 'lxml')
return soup
except:
print("ERROR")

复制代码

做汤：
1.response = request.get(url, [headers]) 模拟浏览器发送请求， headers是一个字典，多数情况下是包含浏览器信息的，增加访问的成功率，可以利用浏览器开发者工具的Network选项查询到
2. 编码
3. 获取html code，对应开发者工具中的Element 选项
4. 打印状态信息，200表示成功
5. 利用BeautifulSoup加工html code，便于提取信息，需要安装lxml库。（也有其他选项）

#############################################

def get_urls(url=tutorial_url):
soup = get_soup(url)
info = soup.find_all('tr')
contents = {}
for infos in info:
try:
intro = infos.get_text().strip().split('\n')
name = intro[0].replace(' ', '_')
link = tutorial_url + infos.a.attrs['href']
contents[name] = link
os.mkdir(path + '/' + name)
os.chdir(path + '/' + name)
with open(name + '.txt', 'a') as f:
for word in intro:
f.write(word)
f.write('\n')
except:
print("URL ERROR")
return contents

复制代码

提取Examples 的url保存在contents字典中：
1. 通过右击inspect网页上感兴趣的元素在html code 的位置可以找到它info = soup.find_all('tr') 找到所有含有‘tr’的标签，里面有想要的url信息。
2. intro = infos.get_text().strip().split('\n') 中 get_text()方法获得文本内容，包含了对Example的介绍
3. link = tutorial_url + infos.a.attrs['href'] 中 .a 表示tag <a href=' url '>, .attrs返回一个字典，['href']获得value
4. os.mkdir, os.chdir 用于创建不同Example的文件夹，以及进入对应文件夹保存文件#####################################################

def get_img_urls(url):
soup = get_soup(url)
imgs = soup.find_all('td')
urls = []
for img in imgs:
try:
link = raw_url + img.a.attrs['href']
urls.append(link)
except:
print("IMG URL ERROR")
return urls

复制代码

获取Examples 中结果的url，这是图片的url，用于爬取图片。思路同上。
#############################################

def get_img_content(urls, dir_path):
for index, url in enumerate(urls):
try:
response = requests.get(url)
time.sleep(0.02)
img = Image.open(BytesIO(response.content))
file = dir_path + '/' + 'output_' + str(index) + '.png'
img.save(file)
except:
print("IMG STORE ERROR")

复制代码

保存图片内容：
1. response获取响应内容
2. time.sleep(0.02) 休息一下，防止ip被封
3. response.content 图片的字节流
4. BytesIO 存放字节流
5. PIL.Image.open() 打开字节流
6. save() 保存
#############################################

def get_content(contents, path=path):
names = list(contents.keys())
urls = list(contents.values())
for name, url in zip(names, urls):
try:
soup = get_soup(url)
main = soup.find('div', class_='main')
text = main.get_text().strip().split('\n')
start = text.index(' 0. #')
end = -3
img_urls = get_img_urls(url)
dir_path = path + '/' + name
os.chdir(dir_path)
with open(name + '.txt', 'a') as f:
for word in text[start:end]:
f.write(word)
f.write('\n')
get_img_content(img_urls, dir_path)
except:
print("CONTENT ERROR")

复制代码

筛选所需信息并存储
1. main = soup.find('div', class_='main') 中.find 返回符合条件的第一项，相当于.find_all()中关键字limit=1， class_对应CSS，也可写成字典形式
2. 其余代码多用于处理字符串，显得很臃肿，编程刚算入门，写代码的水平有限，sorry
#############################################

def get_script(path=path, script_url=script_url):
for i in range(101, 112):
try:
url = script_url.replace(' ', str(i)[1:])
try:
if (str(i)[1:].index('0') == 0):
name = 'Example_' + str(i)[-1]
else:
name = 'Example_' + str(i)[1:]
except:
name = 'Example_' + str(i)[1:]
filename = path + '/' + name + '/' + name + '.py'
response = requests.get(url)
text = response.text.split('\n')
with open(filename, 'w') as f:
for word in text:
f.write(word)
f.write('\n')
except:
print("SCRIPT ERROR")

复制代码

获取脚本内容：
1. 观察脚本的url可以发现，只是换了换数字来对应相关的Example，写个循环就可搞定
2. 原理与前面类似，只是多了一些字符串的技巧，我的字符串技巧太烂，勉强看看吧

#############################################
完整代码：

# coding: utf-8
import requests
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import os
import time
raw_url = 'https://www.pyngl.ucar.edu'
tutorial_url = 'https://www.pyngl.ucar.edu/Tutorial/'
script_url = 'https://www.pyngl.ucar.edu/Examples/Scripts/ngl p.py'
path = '/media/wangweiyi/FILE/Documents/python/pyngl/example'
def get_soup(url):
try:
response = requests.get(url)
response.encoding = response.apparent_encoding
text = response.text
print(response.status_code)
soup = BeautifulSoup(text, 'lxml')
return soup
except:
print("ERROR")
def get_urls(url=tutorial_url):
soup = get_soup(url)
info = soup.find_all('tr')
contents = {}
for infos in info:
try:
intro = infos.get_text().strip().split('\n')
name = intro[0].replace(' ', '_')
link = tutorial_url + infos.a.attrs['href']
contents[name] = link
os.mkdir(path + '/' + name)
os.chdir(path + '/' + name)
with open(name + '.txt', 'a') as f:
for word in intro:
f.write(word)
f.write('\n')
except:
print("URL ERROR")
return contents
def get_img_urls(url):
soup = get_soup(url)
imgs = soup.find_all('td')
urls = []
for img in imgs:
try:
link = raw_url + img.a.attrs['href']
urls.append(link)
except:
print("IMG URL ERROR")
return urls
def get_img_content(urls, dir_path):
for index, url in enumerate(urls):
try:
response = requests.get(url)
time.sleep(0.02)
img = Image.open(BytesIO(response.content))
file = dir_path + '/' + 'output_' + str(index) + '.png'
img.save(file)
except:
print("IMG STORE ERROR")
def get_content(contents, path=path):
names = list(contents.keys())
urls = list(contents.values())
for name, url in zip(names, urls):
try:
soup = get_soup(url)
main = soup.find('div', class_='main')
text = main.get_text().strip().split('\n')
start = text.index(' 0. #')
end = -3
img_urls = get_img_urls(url)
dir_path = path + '/' + name
os.chdir(dir_path)
with open(name + '.txt', 'a') as f:
for word in text[start:end]:
f.write(word)
f.write('\n')
get_img_content(img_urls, dir_path)
except:
print("CONTENT ERROR")
def get_script(path=path, script_url=script_url):
for i in range(101, 112):
try:
url = script_url.replace(' ', str(i)[1:])
try:
if (str(i)[1:].index('0') == 0):
name = 'Example_' + str(i)[-1]
else:
name = 'Example_' + str(i)[1:]
except:
name = 'Example_' + str(i)[1:]
filename = path + '/' + name + '/' + name + '.py'
response = requests.get(url)
text = response.text.split('\n')
with open(filename, 'w') as f:
for word in text:
f.write(word)
f.write('\n')
except:
print("SCRIPT ERROR")
def main(tutorial_url=tutorial_url, script_url=script_url, path=path):
contents = get_urls(tutorial_url)
get_content(contents)
get_script(path, script_url)
if __name__ == '__main__':
　main()

复制代码

weinihou · 发表于 2017-8-2 12:10:06

{:eb502:}{:eb502:}

chiqu296 · 发表于 2017-8-3 12:36:45

{:eb502:}{:eb502:}{:eb502:}{:eb502:}{:eb502:}

lgz · 发表于 2017-8-6 23:03:40

{:5_238:}

放逐流年 · 发表于 2017-8-7 19:07:42

好东西，留着学习学习

faith-f · 发表于 2017-8-7 19:40:51

{:5_213:}{:5_213:}{:5_213:}

diliu · 发表于 2017-8-8 14:18:08

收藏，学习python中

黄叁3 · 发表于 2017-8-8 14:22:21

我要是大学时候有你这样的觉悟就不至于现在画点图磨磨唧唧的了

xiao瑞 · 发表于 2020-1-4 16:02:27

收藏学习，谢谢楼主分享~

		自动登录	找回密码
密码			立即注册

[源代码] 一点爬虫的小应用:爬取PyNGL/PyNIO的Tutorial Example

登录后查看更多精彩内容~

评分