[原创]推荐大家一款强大的php版本的爬虫封装库--phpQuery

Page

[原创]推荐大家一款强大的php版本的爬虫封装库--phpQuery

750Anson16-08-03

最近发现一款功能十分强大的php爬虫封装库--phpQuery，正如它的名字，它就是由TobiaszCudnik模仿jQuery强大的选择器而开发出来的封装库，核心代码5700行。

phpQuery介绍：

采用jquery选择器的思想，超级强大的选择器让我们不用自己写正则表达式就可以获取html、

xhtml、xml、php等页面的代码节点的内容，便捷地爬取别人网站的页面数据；

项目地址：https://code.google.com/p/phpquery/

github地址：https://github.com/TobiaszCudnik/phpquery

核心文件：

phpQuery.zip

核心语法：pq('.info h1')->text();选择类为info标签的下一节点h1里面的文本内容；

pq('.info')->html();选择类为info标签下面的html代码；

这里选择器的语法跟jquery的基本一样，除了‘.’用‘->’代替，因为在php里面‘.’表示连

接符。

phpquery里面有以下10种装载文件的方式：

phpQuery::newDocument($html, $contentType = null)
//从标记中创建新的文档，如果没有声明$contentType,自动加载默认标记，如果失败，使用utf-8的text/html格式
 
phpQuery::newDocumentFile($file, $contentType = null) 
//从文件中创建新的文档，跟newDocument原理相似

phpQuery::newDocumentHTML($html, $charset = 'utf-8')
phpQuery::newDocumentXHTML($html, $charset = 'utf-8')
phpQuery::newDocumentXML($html, $charset = 'utf-8')
phpQuery::newDocumentPHP($html, $contentType = null) 

phpQuery::newDocumentFileHTML($file, $charset = 'utf-8')
phpQuery::newDocumentFileXHTML($file, $charset = 'utf-8')
phpQuery::newDocumentFileXML($file, $charset = 'utf-8')
phpQuery::newDocumentFilePHP($file, $contentType)

而在实际使用中，用的比较多的是最后四种：

newDocumentFileHTML实例：爬取tp0.top博客首页的9篇文章标题

<?php

header('content-type:text/html;charset=utf-8');

include 'phpQuery.php'; 

$url = "http://www.tp0.top";

$dom = phpQuery::newDocumentFileHTML($url);

for ($i=0; $i < 8; $i++) { 
	$arr[$i] = pq('.wz h3 a:eq('.$i.')')->text();
}

var_dump($arr);

打印结果：

下面演示phpquery爬取XML文件的步骤

newDocumentFileXML实例：爬取info.xml信息

xml文件：

<?xml version="1.0" encoding="utf-8"?> 
<root> 
  <contact> 
     <name>张三</name> 
     <age>22</age> 
  </contact> 
  <contact> 
     <name>王五</name> 
     <age>18</age> 
  </contact> 
</root>

php代码：

<?php

header('content-type:text/html;charset=utf-8');

include 'phpQuery.php'; 

$url = "info.xml";

$dom = phpQuery::newDocumentFileXML($url);

for ($i=0; $i < 2; $i++) { 
	$arr[] = pq('contact name:eq('.$i.')')->text();
}

var_dump($arr);

效果图