- 积分
- 9683
- 贡献
-
- 精华
- 在线时间
- 小时
- 注册时间
- 2016-5-2
- 最后登录
- 1970-1-1
|
登录后查看更多精彩内容~
您需要 登录 才可以下载或查看,没有帐号?立即注册
x
本帖最后由 8828 于 2017-8-13 18:16 编辑
最近想学学PyNGL和PyNIO, 但是没联网,于是学了一下爬虫,把官网Tutorial中的Example爬下来,留着没网的时候用。看了一下家园里爬虫的帖子,不是很多,可能前辈们都玩得很cool了吧,我就趁机发个帖子请前辈们批评指点,感谢! :)
#发帖的时候有的文字没了,吃完饭再补上吧 :- |
##################################
爬取内容:1.网页的讲解文字内容
2.程序结果(.jpg)
3.脚本(.py)
##################################
准备内容:1.Linux-Python 3.x
2. requests
3. BeautifulSoup
4. PIL
5. BytesIO
6. os
7. time
###################################
- 爬取思路:这次是纯手动爬取,更高级的有scrapy, selenium...
复制代码 #########################
- def get_soup(url):
- try:
- response = requests.get(url)
- response.encoding = response.apparent_encoding
- text = response.text
- print(response.status_code)
- soup = BeautifulSoup(text, 'lxml')
- return soup
- except:
- print("ERROR")
复制代码 做汤:
1.response = request.get(url, [headers]) 模拟浏览器发送请求, headers是一个字典,多数情况下是包含浏览器信息的,增加访问的成功率,可以利用 浏览器开发者工具的Network选项查询到
2. 编码
3. 获取html code,对应开发者工具中的Element 选项
4. 打印状态信息,200表示成功
5. 利用BeautifulSoup加工html code, 便于提取信息,需要安装lxml库。(也有其他选项)
#############################################
- def get_urls(url=tutorial_url):
- soup = get_soup(url)
- info = soup.find_all('tr')
- contents = {}
- for infos in info:
- try:
- intro = infos.get_text().strip().split('\n')
- name = intro[0].replace(' ', '_')
- link = tutorial_url + infos.a.attrs['href']
- contents[name] = link
- os.mkdir(path + '/' + name)
- os.chdir(path + '/' + name)
- with open(name + '.txt', 'a') as f:
- for word in intro:
- f.write(word)
- f.write('\n')
- except:
- print("URL ERROR")
- return contents
复制代码 提取Examples 的url保存在contents字典中:
1. 通过右击inspect网页上感兴趣的元素在html code 的位置可以找到它info = soup.find_all('tr') 找到所有含有‘tr’的标签, 里面有想要的url信息。
2. intro = infos.get_text().strip().split('\n') 中 get_text()方法获得文本内容,包含了对Example的介绍
3. link = tutorial_url + infos.a.attrs['href'] 中 .a 表示tag <a href=' url '>, .attrs返回一个字典,['href']获得value
4. os.mkdir, os.chdir 用于创建不同Example的文件夹,以及进入对应文件夹保存文件#####################################################
- def get_img_urls(url):
- soup = get_soup(url)
- imgs = soup.find_all('td')
- urls = []
- for img in imgs:
- try:
- link = raw_url + img.a.attrs['href']
- urls.append(link)
- except:
- print("IMG URL ERROR")
- return urls
复制代码 获取Examples 中结果的url, 这是图片的url,用于爬取图片。思路同上。
#############################################
- def get_img_content(urls, dir_path):
- for index, url in enumerate(urls):
- try:
- response = requests.get(url)
- time.sleep(0.02)
- img = Image.open(BytesIO(response.content))
- file = dir_path + '/' + 'output_' + str(index) + '.png'
- img.save(file)
- except:
- print("IMG STORE ERROR")
复制代码 保存图片内容:
1. response获取响应内容
2. time.sleep(0.02) 休息一下,防止ip被封
3. response.content 图片的字节流
4. BytesIO 存放字节流
5. PIL.Image.open() 打开字节流
6. save() 保存
#############################################
- def get_content(contents, path=path):
- names = list(contents.keys())
- urls = list(contents.values())
- for name, url in zip(names, urls):
- try:
- soup = get_soup(url)
- main = soup.find('div', class_='main')
- text = main.get_text().strip().split('\n')
- start = text.index(' 0. #')
- end = -3
- img_urls = get_img_urls(url)
- dir_path = path + '/' + name
- os.chdir(dir_path)
- with open(name + '.txt', 'a') as f:
- for word in text[start:end]:
- f.write(word)
- f.write('\n')
- get_img_content(img_urls, dir_path)
- except:
- print("CONTENT ERROR")
复制代码 筛选所需信息并存储
1. main = soup.find('div', class_='main') 中.find 返回符合条件的第一项,相当于.find_all()中关键字limit=1, class_对应CSS,也可写成字典形式
2. 其余代码多用于处理字符串,显得很臃肿,编程刚算入门,写代码的水平有限,sorry
#############################################
- def get_script(path=path, script_url=script_url):
- for i in range(101, 112):
- try:
- url = script_url.replace(' ', str(i)[1:])
- try:
- if (str(i)[1:].index('0') == 0):
- name = 'Example_' + str(i)[-1]
- else:
- name = 'Example_' + str(i)[1:]
- except:
- name = 'Example_' + str(i)[1:]
- filename = path + '/' + name + '/' + name + '.py'
- response = requests.get(url)
- text = response.text.split('\n')
- with open(filename, 'w') as f:
- for word in text:
- f.write(word)
- f.write('\n')
- except:
- print("SCRIPT ERROR")
复制代码 获取脚本内容:
1. 观察脚本的url可以发现,只是换了换数字来对应相关的Example,写个循环就可搞定
2. 原理与前面类似,只是多了一些字符串的技巧,我的字符串技巧太烂,勉强看看吧
#############################################
完整代码:
- # coding: utf-8
- import requests
- from bs4 import BeautifulSoup
- from PIL import Image
- from io import BytesIO
- import os
- import time
- raw_url = 'https://www.pyngl.ucar.edu'
- tutorial_url = 'https://www.pyngl.ucar.edu/Tutorial/'
- script_url = 'https://www.pyngl.ucar.edu/Examples/Scripts/ngl p.py'
- path = '/media/wangweiyi/FILE/Documents/python/pyngl/example'
- def get_soup(url):
- try:
- response = requests.get(url)
- response.encoding = response.apparent_encoding
- text = response.text
- print(response.status_code)
- soup = BeautifulSoup(text, 'lxml')
- return soup
- except:
- print("ERROR")
- def get_urls(url=tutorial_url):
- soup = get_soup(url)
- info = soup.find_all('tr')
- contents = {}
- for infos in info:
- try:
- intro = infos.get_text().strip().split('\n')
- name = intro[0].replace(' ', '_')
- link = tutorial_url + infos.a.attrs['href']
- contents[name] = link
- os.mkdir(path + '/' + name)
- os.chdir(path + '/' + name)
- with open(name + '.txt', 'a') as f:
- for word in intro:
- f.write(word)
- f.write('\n')
- except:
- print("URL ERROR")
- return contents
- def get_img_urls(url):
- soup = get_soup(url)
- imgs = soup.find_all('td')
- urls = []
- for img in imgs:
- try:
- link = raw_url + img.a.attrs['href']
- urls.append(link)
- except:
- print("IMG URL ERROR")
- return urls
- def get_img_content(urls, dir_path):
- for index, url in enumerate(urls):
- try:
- response = requests.get(url)
- time.sleep(0.02)
- img = Image.open(BytesIO(response.content))
- file = dir_path + '/' + 'output_' + str(index) + '.png'
- img.save(file)
- except:
- print("IMG STORE ERROR")
- def get_content(contents, path=path):
- names = list(contents.keys())
- urls = list(contents.values())
- for name, url in zip(names, urls):
- try:
- soup = get_soup(url)
- main = soup.find('div', class_='main')
- text = main.get_text().strip().split('\n')
- start = text.index(' 0. #')
- end = -3
- img_urls = get_img_urls(url)
- dir_path = path + '/' + name
- os.chdir(dir_path)
- with open(name + '.txt', 'a') as f:
- for word in text[start:end]:
- f.write(word)
- f.write('\n')
- get_img_content(img_urls, dir_path)
- except:
- print("CONTENT ERROR")
- def get_script(path=path, script_url=script_url):
- for i in range(101, 112):
- try:
- url = script_url.replace(' ', str(i)[1:])
- try:
- if (str(i)[1:].index('0') == 0):
- name = 'Example_' + str(i)[-1]
- else:
- name = 'Example_' + str(i)[1:]
- except:
- name = 'Example_' + str(i)[1:]
- filename = path + '/' + name + '/' + name + '.py'
- response = requests.get(url)
- text = response.text.split('\n')
- with open(filename, 'w') as f:
- for word in text:
- f.write(word)
- f.write('\n')
- except:
- print("SCRIPT ERROR")
- def main(tutorial_url=tutorial_url, script_url=script_url, path=path):
- contents = get_urls(tutorial_url)
- get_content(contents)
- get_script(path, script_url)
- if __name__ == '__main__':
- main()
复制代码
|
评分
-
查看全部评分
|