PYTHON-scrapy抓取实例

自己测试了下抓取58的页面

在测试过程中发现,
python语法很重要,需要多学习。比如缩进和空格一定要规范,if for等基本语法
xpath的使用需要多练习,参考 http://yangchao.me/2014/03/scrapy-xpath/
scrapy的资料多看看,参考http://scrapy-chs.readthedocs.org/zh_CN/0.24/

下一步的目标是抓取增加难度,参考http://blog.javachen.com/2014/06/08/using-scrapy-to-cralw-zhihu.html



#创建pip58
[root@centos6 soft]# cd /usr/local/soft/
[root@centos6 soft]# scrapy startproject pip58
[root@centos6 soft]# ls -l pip58
total 8
drwxr-xr-x. 3 root root 4096 Jun 28 21:03 pip58
-rw-r--r--. 1 root root  254 Jun 28 12:57 scrapy.cfg

[root@centos6 pip58]# cd /usr/local/soft/pip58/pip58
[root@centos6 pip58]# ls -l
total 32
-rw-r--r--. 1 root root    0 Jun 27 10:02 __init__.py
-rw-r--r--. 1 root root  126 Jun 28 13:53 __init__.pyc
-rw-r--r--. 1 root root  342 Jun 28 21:02 items.py
-rw-r--r--. 1 root root  381 Jun 28 21:03 items.pyc
-rw-r--r--. 1 root root  558 Jun 28 21:03 pipelines.py
-rw-r--r--. 1 root root  963 Jun 28 21:03 pipelines.pyc
-rw-r--r--. 1 root root 2976 Jun 28 13:30 settings.py
-rw-r--r--. 1 root root  307 Jun 28 13:53 settings.pyc
drwxr-xr-x. 2 root root 4096 Jun 28 21:20 spiders

#编辑items.py
[root@centos6 pip58]# vi items.py
[root@centos6 pip58]# cat items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Pip58Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass
    #  name = scrapy.Field()
      url  = scrapy.Field()


#编辑pipelines.py
[root@centos6 pip58]# vi pipelines.py
[root@centos6 pip58]# cat pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class Pip58Pipeline(object):
    def __init__(self):
        self.mfile = open('58pip.html','w')
    def process_item(self, item, spider):
        #text = '' + item['name']
        #self.mfile.writelines(text)
        text = '' + item['url'] + '\n'
        self.mfile.writelines(text)
    def close_spider(self,spider):
        self.mfile.close()


#编辑settings.py
[root@centos6 pip58]# vi settings.py
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'pip58.pipelines.Pip58Pipeline': 1,
}

#编辑pip58Spider.py
[root@centos6 pip58]# vi spiders/pip58Spider.py
# -*- coding: utf-8 -*-
# author: hxmupdata
#create time : 2015-06-28

#########################################################

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pip58.items import Pip58Item
import re
from scrapy.http import Request
from scrapy.selector import Selector


class Pip58Spider(CrawlSpider):
     name = "pip58"
     allowed_domains = ["58.com"]
     start_urls = ["http://nj.58.com/xianlin/shutong/pn3/?key=%E4%BB%99%E6%9E%97%20%E4%BF%AE%E9%A9%AC%E6%A1%B6&cmcskey=%E4%BB%99%E6%9E%97%20%E4%BF%AE%E9%A9%AC%E6%A1%B6&final=1&specialtype=gls&&&nearby=xianlin&PGTID=152219790188149198143820269&ClickID=2"]
     rules = (Rule(SgmlLinkExtractor(allow=('shtml'),restrict_xpaths=('//a[@class="t"]')),callback='parse_shtml',follow=True),)
     def parse_shtml(self,response):
         urlItem = Pip58Item()
         sel = Selector(response)
         for divA in sel.xpath('//div[@class="su_con"]/a'):
             pUrl = divA.xpath('.//@href').extract()
             urlItem['url'] = ''.join(pUrl)
             yield urlItem

#执行爬虫
[root@centos6 pip58]# cd /usr/local/soft/pip58/
[root@centos6 pip58]# ls
pip58  scrapy.cfg
[root@centos6 pip58]# scrapy crawl pip58

2015-06-28 22:03:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 20556,
 'downloader/request_count': 36,
 'downloader/request_method_count/GET': 36,
 'downloader/response_bytes': 493423,
 'downloader/response_count': 36,
 'downloader/response_status_count/200': 36,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 28, 14, 3, 1, 718955),
 'item_scraped_count': 140,
 'log_count/DEBUG': 177,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 36,
 'scheduler/dequeued': 36,
 'scheduler/dequeued/memory': 36,
 'scheduler/enqueued': 36,
 'scheduler/enqueued/memory': 36,
 'start_time': datetime.datetime(2015, 6, 28, 14, 2, 59, 934541)}
2015-06-28 22:03:01 [scrapy] INFO: Spider closed (finished)

#查看结果
[root@centos6 pip58]# tail -n 30 58pip.html 
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/30291875478449727
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/36999010217430547
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/30249891956123967
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/30600991417391935
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/59127697489677332
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/30600991417391935
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html
http://shop.58.com/58466253604664896
#
http://c.58cdn.com.cn/ui7/help/pf_xx_v1.html
http://about.58.com/345.html



python-scrapy抓取图片

仿造 http://www.cnblogs.com/JohnnyShy/p/4132113.html 来实现一遍


#指定目录
[root@centos6 ~]# cd /usr/local/soft/
#新建项目
[root@centos6 soft]# scrapy startproject moko
[root@centos6 soft]# cd moko
[root@centos6 moko]# ls
moko  scrapy.cfg
[root@centos6 moko]# cd moko
[root@centos6 moko]# ls
__init__.py  __init__.pyc  items.py  items.pyc  pipelines.py  pipelines.pyc  settings.py  settings.pyc  spiders

改动items.py
[root@centos6 moko]# vi items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MokoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    #pass

#改动pipelines.py
[root@centos6 moko]# vi pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


#class MokoPipeline(object):
#    def process_item(self, item, spider):
#        return item


from moko.items import MokoItem

class MokoPipeline(object):
    def __init__(self):
        self.mfile = open('test.html', 'w')
    def process_item(self, item, spider):
        text = ''
        self.mfile.writelines(text)
    def close_spider(self, spider):
        self.mfile.close()


#改动 settings.py

[root@centos6 moko]# vi settings.py
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'moko.pipelines.MokoPipeline': 1,
}


#建立spider
[root@centos6 moko]# vi spiders/mokospider.py

# -*- coding: utf-8 -*-
#File name :spyders/mokospider.py
#Author:Jhonny Zhang
#mail:veinyy@163.com
#create Time : 2014-11-29
#############################################################################

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from moko.items import MokoItem
import re
from scrapy.http import Request
from scrapy.selector import Selector


class MokoSpider(CrawlSpider):
    name = "moko"
    allowed_domains = ["moko.cc"]
    start_urls=["http://www.moko.cc/post/aaronsky/list.html"]
    rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')),  callback = 'parse_img', follow=True),)
    def parse_img(self, response):
        urlItem = MokoItem()
        sel = Selector(response)
        for divs in sel.xpath('//div[@class="pic dBd"]'):
            img_url=divs.xpath('.//img/@src2').extract()[0]
            urlItem['url'] = img_url
            yield urlItem

#cd到moko的目录下执行爬虫命令
[root@centos6 moko]# cd /usr/local/soft/moko/
[root@centos6 moko]# scrapy crawl moko

2015-06-28 12:34:14 [scrapy] DEBUG: Scraped from <200 http://www.moko.cc/post/1098595.html>
None
2015-06-28 12:34:14 [scrapy] DEBUG: Scraped from <200 http://www.moko.cc/post/1098595.html>
None
2015-06-28 12:34:14 [scrapy] DEBUG: Scraped from <200 http://www.moko.cc/post/1091764.html>
None
2015-06-28 12:34:14 [scrapy] DEBUG: Scraped from <200 http://www.moko.cc/post/1091764.html>
None
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/moko/post/6.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098157.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098157.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098269.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098269.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098157.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098157.html)
2015-06-28 12:34:14 [scrapy] DEBUG: Crawled (200)  (referer: http://www.moko.cc/post/1098269.html)

#查看结果
[root@centos6 moko]# ls
moko  scrapy.cfg  test.html

然后愉快的打开test.html看到效果了。

PYTHON-scrapy安装

scrapy是一个python的爬虫。使用python开发的。

安装它过程如下:

注意: easy_install 需要用python2.7.6对应版本的。
easy_install -U pip==1.3.1 pip需要安装指定版本的。


[root@centos6 soft]# cat /etc/centos-release 
CentOS release 6.6 (Final)
[root@centos6 soft]# python -V
Python 2.7.6


#安装setuptool就是easy_install

[root@centos6 soft]# wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz

--2015-06-27 12:57:49--  https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
Resolving pypi.python.org... 103.245.222.223
Connecting to pypi.python.org|103.245.222.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 796957 (778K) [application/octet-stream]
Saving to: “setuptools-1.4.2.tar.gz”

100%[====================================================================================>] 796,957      238K/s   in 3.3s    

2015-06-27 12:57:53 (238 KB/s) - “setuptools-1.4.2.tar.gz” saved [796957/796957]

[root@centos6 soft]# tar -zxvf setuptools-1.4.2.tar.gz

[root@centos6 soft]# cd setuptools-1.4.2

[root@centos6 setuptools-1.4.2]# python setup.py install

#若已经有/usr/bin/easy_install那么就需要检查下是否是2.7.6的。不是的话,如下操作

[root@centos6 soft]# find / -name easy_install
/usr/local/soft/.pyenv/shims/easy_install
/usr/local/soft/.pyenv/pyenv.d/exec/pip-rehash/easy_install
/usr/local/soft/.pyenv/versions/2.7.6/bin/easy_install
/usr/bin/easy_install

[root@centos6 soft]# mv /usr/bin/easy_install /usr/bin/easy_install_2.6.6

[root@centos6 soft]# ln -s /usr/local/soft/.pyenv/versions/2.7.6/bin/easy_install /usr/bin/easy_install

安装pip

[root@centos6 soft]# easy_install pip
Searching for pip
Best match: pip 7.0.3
Adding pip 7.0.3 to easy-install.pth file
Installing pip script to /usr/local/soft/.pyenv/versions/2.7.6/bin
Installing pip3.4 script to /usr/local/soft/.pyenv/versions/2.7.6/bin
Installing pip3 script to /usr/local/soft/.pyenv/versions/2.7.6/bin

Using /usr/local/soft/.pyenv/versions/2.7.6/lib/python2.7/site-packages
Processing dependencies for pip
Finished processing dependencies for pip

[root@centos6 soft]# curl https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py | python
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1388k  100 1388k    0     0  35657      0  0:00:39  0:00:39 --:--:-- 79358
/tmp/tmpb011GA/pip.zip/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
Requirement already up-to-date: pip in ./.pyenv/versions/2.7.6/lib/python2.7/site-packages

#如果发现pip无法安装scrapy那么说明pip版本不对,安装提示重新安装pip==1.3.1的。

[root@centos6 soft]# pip install scrapy 
Traceback (most recent call last):
  File "/usr/bin/pip", line 5, in 
    from pkg_resources import load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 2797, in 
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 576, in resolve
pkg_resources.DistributionNotFound: pip==1.3.1

[root@centos6 soft]# easy_install -U pip==1.3.1
Searching for pip==1.3.1
Reading https://pypi.python.org/simple/pip/
Best match: pip 1.3.1
Downloading https://pypi.python.org/packages/source/p/pip/pip-1.3.1.tar.gz#md5=cbb27a191cebc58997c4da8513863153
Processing pip-1.3.1.tar.gz
Writing /tmp/easy_install-OuhrAD/pip-1.3.1/setup.cfg
Running pip-1.3.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-OuhrAD/pip-1.3.1/egg-dist-tmp-bizzQ0
warning: no files found matching '*.html' under directory 'docs'
warning: no previously-included files matching '*.txt' found under directory 'docs/_build'
no previously-included directories found matching 'docs/_build/_sources'
Adding pip 1.3.1 to easy-install.pth file
Installing pip script to /usr/local/soft/.pyenv/versions/2.7.6/bin
Installing pip-2.7 script to /usr/local/soft/.pyenv/versions/2.7.6/bin

Installed /usr/local/soft/.pyenv/versions/2.7.6/lib/python2.7/site-packages/pip-1.3.1-py2.7.egg
Processing dependencies for pip==1.3.1
Finished processing dependencies for pip==1.3.1

[root@centos6 soft]# pip install scrapy 
Downloading/unpacking scrapy
  Running setup.py egg_info for package scrapy

#安装成功后,检查下是否ok

[root@centos6 soft]# scrapy -h
Scrapy 1.0.0 - no active project

Usage:
  scrapy  [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands      
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy  -h" to see more info about a command

python-版本管理

centos6.6的默认是python2.6.6


[root@centos6 soft]# cat /etc/centos-release 
CentOS release 6.6 (Final)
[root@centos6 soft]# python -V
Python 2.6.6

有些软件是需要升级到python2.7+的。那么就需要版本管理。
版本管理建议使用pyenv

安装pyenv之前需要一些准备


[root@centos6 soft]# yum groupinstall -y development
[root@centos6 soft]# yum install -y zlib-dev openssl-devel sqlite-devel bzip2-devel

准备安装pyenv,需要先cd到某个文件夹


[root@centos6 soft]#cd /usr/local/soft
[root@centos6 soft]#git clone git://github.com/yyuu/pyenv.git .pyenv

配置下pyenv的环境变量


[root@centos6 soft]# vi /etc/profile.d/pyenv.sh

#!/bin/bash
export PYENV_ROOT="/usr/local/soft/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval"$(pyenv init -)"

生效


[root@centos6 soft]# source /etc/profile.d/pyenv.sh
-bash: evalexport PATH="/usr/local/soft/.pyenv/shims:${PATH}"
export PYENV_SHELL=bash
source '/usr/local/soft/.pyenv/libexec/../completions/pyenv.bash'
pyenv rehash 2>/dev/null
pyenv() {
  local command
  command="$1"
  if [ "$#" -gt 0 ]; then
    shift
  fi

  case "$command" in
  rehash|shell)
    eval "`pyenv "sh-$command" "$@"`";;
  *)
    command pyenv "$command" "$@";;
  esac
}: No such file or directory

检查pyenv


[root@centos6 soft]# pyenv versions
* system (set by /usr/local/soft/.pyenv/version)

[root@centos6 soft]# pyenv -h
pyenv 20150601-4-g1140634
Usage: pyenv  []

Some useful pyenv commands are:
   commands    List all available pyenv commands
   local       Set or show the local application-specific Python version
   global      Set or show the global Python version
   shell       Set or show the shell-specific Python version
   install     Install a Python version using python-build
   uninstall   Uninstall a specific Python version
   rehash      Rehash pyenv shims (run this after installing executables)
   version     Show the current Python version and its origin
   versions    List all Python versions available to pyenv
   which       Display the full path to an executable
   whence      List all Python versions that contain the given executable

See `pyenv help ' for information on a specific command.
For full documentation, see: https://github.com/yyuu/pyenv#readme

安装python2.7


[root@centos6 soft]# pyenv install 2.7.6
Downloading Python-2.7.6.tgz...
-> https://yyuu.github.io/pythons/99c6860b70977befa1590029fae092ddb18db1d69ae67e8b9385b66ed104ba58
Installing Python-2.7.6...
patching file ./Modules/readline.c
patching file ./Lib/site.py
WARNING: The Python readline extension was not compiled. Missing the GNU readline lib?
Installing pip from https://bootstrap.pypa.io/get-pip.py...
Installed Python-2.7.6 to /usr/local/soft/.pyenv/versions/2.7.6

[root@centos6 soft]# pyenv versions
* system (set by /usr/local/soft/.pyenv/version)
  2.7.6

docker学习- 私有库(registry)

私有库很有必要,可以建立自己的库,方便管理。

发现安装过程有很多问题:

register如何启动

push前为什么要打tag

push的方法貌似也有问题,是所有images都push还是只push一个

docker为什么我push时没有提示需要登陆?

如何看register上有哪些镜像

关于register的配置,默认存储地址貌似容易丢失

暂时还没成功,明天补下。

找到些靠谱的博客:

 

http://www.tuicool.com/articles/7V7vYn

http://www.vpsee.com/2013/11/build-your-own-docker-private-regsitry-service/

http://seanlook.com/2014/11/13/deploy-private-docker-registry-with-nginx-ssl/

经过努力终于成功,过程如下:

安装


yum install docker-registry

启动,如果第一次,会pull registry



[root@centos6 ~]# docker run -d -p 5001:5000 registry

3272ca0f72c3f29664bf5677fd71a163c1540ea014a5a89f1d0bdf8e8c5dfbeb

[root@centos6 ~]# docker images

换一个机器

修改 other_args=”–insecure-registry 192.168.131.147:5001 –iptables=false”   192.168.131.147:5001是上面的机器


[root@centos6 ~]# vi /etc/sysconfig/docker

[root@centos6 ~]# cat /etc/sysconfig/docker
# /etc/sysconfig/docker
#
# Other arguments to pass to the docker daemon process
# These will be parsed by the sysv initscript and appended
# to the arguments list passed to docker -d

other_args="--insecure-registry 192.168.131.147:5001 --iptables=false"
DOCKER_CERT_PATH=/etc/docker

push


[root@centos6 ~]# docker push 192.168.131.147:5001/custom_base 
The push refers to a repository [192.168.131.147:5001/custom_base] (len: 1)
Sending image list
Pushing repository 192.168.131.147:5001/custom_base (1 tags)
f1b10cd84249: Image successfully pushed 
c852f6d61e65: Image successfully pushed 
7322fbe74aa5: Image successfully pushed 
725d907c8a7d: Image successfully pushed 
Pushing tag for rev [725d907c8a7d] on {http://192.168.131.147:5001/v1/repositories/custom_base/tags/test}