数据采集利器——PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser(PHP实现的简单的HTML DOM解析器)。

Description, Requirement & Features

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

Download & Documents

Quick Start

  • How to get HTML elements?
1
2
3
4
5
6
7
8
9
10
11
// Create DOM from URL or file  
$html = <strong>file_get_html</strong>(‘http://www.google.com/’);

// Find all images
foreach($html-><strong>find</strong>(‘img’) as $element)
echo $element-><strong>src</strong> . ‘<br>’;

// Find all links
foreach($html-><strong>find</strong>(‘a’) as $element)
echo $element-><strong>href</strong> . ‘<br>’;

  • How to modify HTML elements?
1
2
3
4
5
6
7
8
9
// Create DOM from string  
$html = <strong>str_get_html</strong>(‘<div id="hello">Hello</div><div id="world">World</div>’);

$html-><strong>find</strong>(‘div’, 1)-><strong>class</strong> = ‘bar’;

$html-><strong>find</strong>(‘div[id=hello]’, 0)-><strong>innertext</strong> = ‘foo’;

echo $html; // Output: <div id="hello"><strong>foo</strong></div><div id="world" <strong>class="bar"</strong>>World</div>

  • Extract contents from HTML
1
2
3
// Dump contents (without tags) from HTML  
echo <strong>file_get_html</strong>(‘http://www.google.com/’)-><strong>plaintext</strong>;

  • Scraping Slashdot!
1
2
3
4
5
6
7
8
9
10
11
12
13
// Create DOM from URL  
$html = <strong>file_get_html</strong>(‘http://slashdot.org/’);

// Find all article blocks
foreach($html-><strong>find</strong>(‘div.article’) as $article) {
$item[‘title’]     = $article-><strong>find</strong>(‘div.title’, 0)-><strong>plaintext</strong>;
$item[‘intro’]    = $article-><strong>find</strong>(‘div.intro’, 0)-><strong>plaintext</strong>;
$item[‘details’] = $article-><strong>find</strong>(‘div.details’, 0)-><strong>plaintext</strong>;
$articles[] = $item;
}

print_r($articles);

 

Author

Ludis

Posted on

2014-09-19

Updated on

2014-09-19

Licensed under

Comments