スクレイピングが簡単にできる「Goutte」

ウェブサイトから必要な情報だけを取得する方法をスクレイピングと言います。
PHPを使っている場合はfile_get_contentsでウェブページの情報を文字列として取得しpreg_matchなどを使って必要な部分を取得します。
ウェブページの構造が単純であれば、これでもいいのですが、ちょっと複雑になるとかなり面倒です。
「Goutte」を使うとCSSのセレクタと同様に指定して取得できるのでとても簡単です。

GitHub - FriendsOfPHP/Goutte: Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper. Contribute to FriendsOfPHP/Goutte development by creating an account on GitHub.

ここからダウンロードしたgoutte-xxxxx.pharを適当なフォルダに保存しておきます。
PHP5.3用の最新は「goutte-v1.0.7.phar」で、PHP5.4用の最新は「goutte-v2.0.4.phar」です。
私の環境では「goutte-v1.0.7.phar」が正しく動作しているので、これを使っています。

基本
階層

基本

次のサンプルは$urlで指定したページにある全てのaタグのリンク(href)とテキスト(text)を取得します。
結果は$linksに代入されます。

require_once "goutte-v1.0.7.phar";
use Goutte\Client;
$url = "http://...../sample.htm";
$client = new Client();
$crawler = $client->request("GET", $url);
$links=array();
$crawler->filter("a")->each(function ($node) use (&$links){
$h=$node->attr("href");
$t=$node->text();
$links[]=array("href"=>$h,"text"=>$t);
});

filter(“a”)の部分で<a>を指定しています。
使い方を例示します。
filter(“#apple”)とすると<* id=”apple”>を指定します。
filter(“div.apple”)とすると<div class=”apple”>を指定します。
filter(“div.apple>p”)とすると<div class=”apple”>の配下のpを指定します。
filter(“th,td”)とすると<th>と<td>を指定します。
CSSのセレクタと同じです。
$node->attr(“href”)とすることでリンクを取得できます。attr(“class”)とすればクラスを取得できます。
$node->text()とすることで<a>と</a>に挟まれたテキストを取得できます。

階層

次のサンプルは$urlで指定したページにある全てのtableタグ、trタグ、tdタグのテキストを取得します。
タグにidやclassが指定されていない場合に、全部取得して何番目だけを使う、というような使い方ができます。

require_once "goutte-v1.0.7.phar";
use Goutte\Client;
$url = "http://...../sample.htm";
$client = new Client();
$crawler = $client->request("GET", $url);
$arraytable=array();
$crawler->filter("table")->each(function ($nodetable) use (&$arraytable){
$arraytr=array();
$nodetable->filter("tr")->each(function ($nodetr) use (&$arraytr){
$arraytd=array();
$nodetr->filter("th,td")->each(function ($nodetd) use (&$arraytd){
$t=$nodetd->text();
$arraytd[]=$t;
});
$arraytr[]=$arraytd;
});
$arraytable[]=$arraytr;
});

$arraytable[0][1][0]とすると最初のtableの2番目のtrの最初のtdのテキストを取得できます。