一点爬虫的小应用:爬取PyNGL/PyNIO的Tutorial Example

8828 · 发表于 2017-8-2 12:02:36

登录后查看更多精彩内容~

您需要登录才可以下载或查看，没有帐号？立即注册

x

本帖最后由 8828 于 2017-8-13 18:16 编辑

最近想学学PyNGL和PyNIO, 但是没联网，于是学了一下爬虫，把官网Tutorial中的Example爬下来，留着没网的时候用。看了一下家园里爬虫的帖子，不是很多，可能前辈们都玩得很cool了吧，我就趁机发个帖子请前辈们批评指点，感谢! :)

#发帖的时候有的文字没了，吃完饭再补上吧　:- |
##################################
爬取内容：１．网页的讲解文字内容
　　　　　２．程序结果(.jpg)
　　　　　３．脚本(.py)
##################################
准备内容：１．Linux-Python 3.x
　　　　　２.　requests
　　　　　３.　BeautifulSoup
　　　　　４.　PIL
　　　　　５.　BytesIO
　　　　　６.　os
　　　　　７.　time
###################################

爬取思路：这次是纯手动爬取，更高级的有scrapy, selenium...

复制代码

#########################

def get_soup(url):
try:
response = requests.get(url)
response.encoding = response.apparent_encoding
text = response.text
print(response.status_code)
soup = BeautifulSoup(text, 'lxml')
return soup
except:
print("ERROR")

复制代码

做汤：
1.response = request.get(url, [headers]) 模拟浏览器发送请求， headers是一个字典，多数情况下是包含浏览器信息的，增加访问的成功率，可以利用浏览器开发者工具的Network选项查询到
2. 编码
3. 获取html code，对应开发者工具中的Element 选项
4. 打印状态信息，200表示成功
5. 利用BeautifulSoup加工html code，便于提取信息，需要安装lxml库。（也有其他选项）

#############################################

def get_urls(url=tutorial_url):
soup = get_soup(url)
info = soup.find_all('tr')
contents = {}
for infos in info:
try:
intro = infos.get_text().strip().split('\n')
name = intro[0].replace(' ', '_')
link = tutorial_url + infos.a.attrs['href']
contents[name] = link
os.mkdir(path + '/' + name)
os.chdir(path + '/' + name)
with open(name + '.txt', 'a') as f:
for word in intro:
f.write(word)
f.write('\n')
except:
print("URL ERROR")
return contents

复制代码

提取Examples 的url保存在contents字典中：
1. 通过右击inspect网页上感兴趣的元素在html code 的位置可以找到它info = soup.find_all('tr') 找到所有含有‘tr’的标签，里面有想要的url信息。
2. intro = infos.get_text().strip().split('\n') 中 get_text()方法获得文本内容，包含了对Example的介绍
3. link = tutorial_url + infos.a.attrs['href'] 中 .a 表示tag <a href=' url '>, .attrs返回一个字典，['href']获得value
4. os.mkdir, os.chdir 用于创建不同Example的文件夹，以及进入对应文件夹保存文件#####################################################

def get_img_urls(url):
soup = get_soup(url)
imgs = soup.find_all('td')
urls = []
for img in imgs:
try:
link = raw_url + img.a.attrs['href']
urls.append(link)
except:
print("IMG URL ERROR")
return urls

复制代码

获取Examples 中结果的url，这是图片的url，用于爬取图片。思路同上。
#############################################

def get_img_content(urls, dir_path):
for index, url in enumerate(urls):
try:
response = requests.get(url)
time.sleep(0.02)
img = Image.open(BytesIO(response.content))
file = dir_path + '/' + 'output_' + str(index) + '.png'
img.save(file)
except:
print("IMG STORE ERROR")

复制代码

保存图片内容：
1. response获取响应内容
2. time.sleep(0.02) 休息一下，防止ip被封
3. response.content 图片的字节流
4. BytesIO 存放字节流
5. PIL.Image.open() 打开字节流
6. save() 保存
#############################################

def get_content(contents, path=path):
names = list(contents.keys())
urls = list(contents.values())
for name, url in zip(names, urls):
try:
soup = get_soup(url)
main = soup.find('div', class_='main')
text = main.get_text().strip().split('\n')
start = text.index(' 0. #')
end = -3
img_urls = get_img_urls(url)
dir_path = path + '/' + name
os.chdir(dir_path)
with open(name + '.txt', 'a') as f:
for word in text[start:end]:
f.write(word)
f.write('\n')
get_img_content(img_urls, dir_path)
except:
print("CONTENT ERROR")

复制代码

筛选所需信息并存储
1. main = soup.find('div', class_='main') 中.find 返回符合条件的第一项，相当于.find_all()中关键字limit=1， class_对应CSS，也可写成字典形式
2. 其余代码多用于处理字符串，显得很臃肿，编程刚算入门，写代码的水平有限，sorry
#############################################