php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍

Overview

HtmlParser

Total Downloads Build Status

php html解析工具,类似与PHP Simple HTML DOM Parser。 由于基于php模块dom,所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。

注意:html代码必须是utf-8编码字符,如果不是请转成utf-8
如果有乱码的问题参考:http://www.fwolf.com/blog/post/314

现在支持composer

"require": {"bupt1987/html-parser": "dev-master"}

加载composer
require 'vendor/autoload.php';

================================================================================

Example
<?php
require 'vendor/autoload.php';

$html = '<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>test</title>
  </head>
  <body>
    <p class="test_class test_class1">p1</p>
    <p class="test_class test_class2">p2</p>
    <p class="test_class test_class3">p3</p>
    <div id="test1">测试1</div>
  </body>
</html>';
$html_dom = new \HtmlParser\ParserDom($html);
$p_array = $html_dom->find('p.test_class');
$p1 = $html_dom->find('p.test_class1',0);
$div = $html_dom->find('div#test1',0);
foreach ($p_array as $p){
	echo $p->getPlainText() . "\n";
}
echo $div->getPlainText() . "\n";
echo $p1->getPlainText() . "\n";
echo $p1->getAttr('class') . "\n";
echo "show html:\n";
echo $div->innerHtml() . "\n";
echo $div->outerHtml() . "\n";
?>

基础用法

// 查找所有a标签
$ret = $html->find('a');

// 查找a标签的第一个元素
$ret = $html->find('a', 0);

// 查找a标签的倒数第一个元素
$ret = $html->find('a', -1); 

// 查找所有含有id属性的div标签
$ret = $html->find('div[id]');

// 查找所有含有id属性为foo的div标签
$ret = $html->find('div[id=foo]'); 

高级用法

// 查找所有id=foo的元素
$ret = $html->find('#foo');

// 查找所有class=foo的元素
$ret = $html->find('.foo');

// 查找所有拥有 id属性的元素
$ret = $html->find('*[id]'); 

// 查找所有 anchors 和 images标记 
$ret = $html->find('a, img'); 

// 查找所有有"title"属性的anchors and images 
$ret = $html->find('a[title], img[title]');

层级选择器

// Find all <li> in <ul> 
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div'); 

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]'); 

嵌套选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
       foreach($ul->find('li') as $li) 
       {
             // do something...
       }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

属性过滤

支持属性选择器操作:

过滤	描述
[attribute]	匹配具有指定属性的元素.
[!attribute]	匹配不具有指定属性的元素。
[attribute=value]	匹配具有指定属性值的元素
[attribute!=value]	匹配不具有指定属性值的元素
[attribute^=value]	匹配具有指定属性值开始的元素
[attribute$=value]	匹配具有指定属性值结束的元素
[attribute*=value]	匹配具有指定属性的元素,且该属性包含了一定的值

Dom扩展用法

获取dom通过扩展实现更多的功能,详见:http://php.net/manual/zh/book.dom.php

/**
 * @var \DOMNode
 */
$oHtml->node

$oHtml->node->childNodes
$oHtml->node->parentNode
$oHtml->node->firstChild
$oHtml->node->lastChild
等等...

Comments
  • 效率问题

    效率问题

    hi 哥们,我用 xhprof 做了一个测试,发现 html-parser 确实比 simple_html_dom 效率和内存占用低,但是我测试了基于 php5.5 的 DOMDocument 我发现 DOMDocument 效率是非常的高,这是为什么呢。

    bupt1987/html-parser:

    Overall Summary 
    Total Incl. Wall Time (microsec):   3,038 microsecs
    Total Incl. CPU (microsecs):    3,039 microsecs
    Total Incl. MemUse (bytes): 223,928 bytes
    Total Incl. PeakMemUse (bytes): 231,896 bytes
    Number of Function Calls:   272
    

    PHP5.5 DOMDocument:

    Overall Summary 
    Total Incl. Wall Time (microsec):   418 microsecs
    Total Incl. CPU (microsecs):    419 microsecs
    Total Incl. MemUse (bytes): 8,616 bytes
    Total Incl. PeakMemUse (bytes): 15,464 bytes
    Number of Function Calls:   16
    
    opened by wozzup 8
  • 使用时出现乱码。

    使用时出现乱码。

    我想使用该解析器,从文章的Html页面解析出标题。 具体使用: $config = array( 'indent' => TRUE, 'output-xhtml' => TRUE, 'wrap' => '200' ); $html = new HtmlParserModel(); $html->parseStr($str, $config, 'utf8'); $result = $html->find('p'); foreach ($result as $value) { //想打印出来,看一下内容,然后再用正则去匹配 echo $value->getPlainText(); } 结果,打印出来的都是乱码,不知道如何是好。 qq20140303151519

    opened by xiangjihan 6
  • 解析获取内容,显示错误

    解析获取内容,显示错误

    //简介内容
    $desc = $jianDom->find('div.jianji');
    //作者名字
    $authorName = $jianDom->find('div.title',0);
    
    echo $desc[0]->innerHtml(). "\n". $authorName->innerHtml()."\n";
    

    显示的内容,跟网页上的不一样。

    很奇怪。。。。。估计是编码问题。。。。期待解决回复。

    谢谢

    opened by lovesuzhou 1
  • 层级选择失败

    层级选择失败

    不知道为什么。。我抓取的那个页面有个列表需要抓。用的是

    列表。结构大概如下: `

    .......
    xxxxx
    xxxxx
    xxxxx
    .......
    xxxxx
    xxxxx
    xxxxx
    ` 我使用 #list dd 进行选择时没有选择到元素,只有直接使用dd 选择才能抓取列表。甚至我直接 body dd 都选择不到元素
    opened by VirensCn 1
  • fix: php73

    fix: php73 "Compilation failed"

    fix: php73 "Compilation failed: invalid range in character class at offset 4" 修改后低版本不会报错

    各php报错验证:

    • 修改前:https://3v4l.org/No7Pc
    • 修改后:https://3v4l.org/MVjqi

    具体可以进一步验证一下

    opened by XinRoom 0
  • PHP Simple DOM Parser 功能補齊了

    PHP Simple DOM Parser 功能補齊了

    因為我是 PHP Simple DOM Parser 的愛用者,但作者群不維護了.. 超多 Bug 看到您改寫的很棒,我寫了其它部分把 PHP Simple DOM Parser 的功能都補上 目前測試過 ok,我有Pull 假如作者覺得不錯的 code 您可以納入

    詳細的操作 http://shinbonlin.github.io/html-parser/

    opened by terrylinooo 0
  • Add setAttr() and save()

    Add setAttr() and save()

    Add setAttr() to support setting attribute

    增加setAttr()方法,整合getAttr()的取值方式,例:

    $tag_blocks->find('a.post-tag', 0); $tag->setAttr('href', 'your url'); // test echo $tag->getAttr('href'); // result: your url

    增加储存本文功能 Add function save()

    增加储存本文功能,经过修改后的DOM本文能储存并输出为字串 Add function save() to store modified DOM and could be exported into a string variable .

    for example:

    $dom_html = new ParserDom($get_dom_html); echo $dom_html->save();

    opened by terrylinooo 0
  • Add Magic function __get()

    Add Magic function __get()

    Add Magic function __get to make this library works like original idea - PHP Simple Dom Parser. Now you can use like this $example->plaintext, $example->outertext, $example->innertext

    opened by terrylinooo 0
Releases(v3.0.0)
Owner
俊杰jerry
lazy lazy lazy
俊杰jerry
Simple URL parser

urlparser Simple URL parser This is a simple URL parser, which returns an array of results from url of kind /module/controller/param1:value/param2:val

null 1 Oct 29, 2021
This is a simple, streaming parser for processing large JSON documents

Streaming JSON parser for PHP This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to

Salsify 687 Jan 4, 2023
Better Markdown Parser in PHP

Parsedown Better Markdown Parser in PHP - Demo. Features One File No Dependencies Super Fast Extensible GitHub flavored Tested in 5.3 to 7.3 Markdown

Emanuil Rusev 14.3k Jan 8, 2023
Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

league/commonmark league/commonmark is a highly-extensible PHP Markdown parser created by Colin O'Dell which supports the full CommonMark spec and Git

The League of Extraordinary Packages 2.4k Jan 1, 2023
A super fast, highly extensible markdown parser for PHP

A super fast, highly extensible markdown parser for PHP What is this? A set of PHP classes, each representing a Markdown flavor, and a command line to

Carsten Brandt 989 Dec 16, 2022
An HTML5 parser and serializer for PHP.

HTML5-PHP HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has w

null 1.2k Dec 31, 2022
Advanced shortcode (BBCode) parser and engine for PHP

Shortcode Shortcode is a framework agnostic PHP library allowing to find, extract and process text fragments called "shortcodes" or "BBCodes". Example

Tomasz Kowalczyk 358 Nov 26, 2022
Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Parsica The easiest way to build robust parsers in PHP.

null 0 Feb 22, 2022
This is a php parser for plantuml source file.

PlantUML parser for PHP Overview This package builds AST of class definitions from plantuml files. This package works only with php. Installation Via

Tasuku Yamashita 5 May 29, 2022
Efficient, easy-to-use, and fast PHP JSON stream parser

JSON Machine Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP 5.6+. See TL;DR.

Filip Halaxa 801 Dec 28, 2022
A PHP hold'em range parser

mattjmattj/holdem-range-parser A PHP hold'em range parser Installation No published package yet, so you'll have to clone the project manually, or add

Matthias Jouan 1 Feb 2, 2022
Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.

PHP Markdown PHP Markdown Lib 1.9.0 - 1 Dec 2019 by Michel Fortin https://michelf.ca/ based on Markdown by John Gruber https://daringfireball.net/ Int

Michel Fortin 3.3k Jan 1, 2023
A New Markdown parser for PHP5.4

Ciconia - A New Markdown Parser for PHP The Markdown parser for PHP5.4, it is fully extensible. Ciconia is the collection of extension, so you can rep

Kazuyuki Hayashi 357 Jan 3, 2023
A lightweight lexical string parser for BBCode styled markup.

Decoda A lightweight lexical string parser for BBCode styled markup. Requirements PHP 5.6.0+ Multibyte Composer Contributors "Marten-Plain" emoticons

Miles Johnson 194 Dec 27, 2022
Convert HTML to Markdown with PHP

HTML To Markdown for PHP Library which converts HTML to Markdown for your sanity and convenience. Requires: PHP 7.2+ Lead Developer: @colinodell Origi

The League of Extraordinary Packages 1.5k Dec 28, 2022
HTML sanitizer, written in PHP, aiming to provide XSS-safe markup based on explicitly allowed tags, attributes and values.

TYPO3 HTML Sanitizer ℹ️ Common safe HTML tags & attributes as given in \TYPO3\HtmlSanitizer\Builder\CommonBuilder still might be adjusted, extended or

TYPO3 GitHub Department 22 Dec 14, 2022
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 54 May 23, 2022
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 51 Jan 15, 2021
A simple Atom/RSS parsing library for PHP.

SimplePie SimplePie is a very fast and easy-to-use class, written in PHP, that puts the 'simple' back into 'really simple syndication'. Flexible enoug

SimplePie 1.5k Dec 18, 2022