0001-01

爬虫网站分析

分类的难处在于网站的多样性，所谓多样性包括：认证机制多样性网站登录流程复杂，信息量大，页面跳转次数较多，有时伴随验证码填写(很常见) 信息查找

大众点评爬虫

http://stackoverflow.com/questions/23937933/could-not-run-curl-config-errno-2-no-such-file-or-directory-when-installing # encoding=utf-8 import urllib2 from bs4 import BeautifulSoup # 请求头部信息 header = {'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)'} # {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0"} #测试brand_name brand_name=["星巴

0001年01月01日

python笔记之提取网页中的超链接

对于提取网页中的超链接，先把网页内容读取出来，然后用beautifulsoup来解析是比较方便的。但是我发现一个问题，如果直接提取a标签的h

0001年01月01日

python Cmd实例之网络爬虫应用

废话少说，直接上代码 # encoding=utf-8 import os import multiprocessing from cmd import Cmd import commands from mycrawler.dbUtil import DbUtil import signal # 下载监控 def run_download_watch(): os.system("gnome-terminal -x bash -c 'python ./download_process.py' ") # 下载文件 def run_download(): os.system("gnome-terminal -x bash -c 'python ./download.py' ") # 爬虫 def run_spider(arg): for i in range(len(arg)): os.system("gnome-terminal -x bash -c 'scrapy

0001年01月01日

百里求一的博客

观察，思考，学习

0001-01

爬虫网站分析

大众点评爬虫

python笔记之提取网页中的超链接

python Cmd实例之网络爬虫应用