php html parser，类似与PHP Simple HTML DOM Parser，但是比它快好几倍

俊杰jerry

Last update: Dec 29, 2022

Related tags

Overview

HtmlParser

php html解析工具，类似与PHP Simple HTML DOM Parser。由于基于php模块dom，所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。

注意：html代码必须是utf-8编码字符，如果不是请转成utf-8
如果有乱码的问题参考：http://www.fwolf.com/blog/post/314

现在支持composer

"require": {"bupt1987/html-parser": "dev-master"}

加载composer
require 'vendor/autoload.php';

================================================================================

Example

<?php
require 'vendor/autoload.php';

$html = '<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>test</title>
  </head>
  <body>
    <p class="test_class test_class1">p1</p>
    <p class="test_class test_class2">p2</p>
    <p class="test_class test_class3">p3</p>
    <div id="test1">测试1</div>
  </body>
</html>';
$html_dom = new \HtmlParser\ParserDom($html);
$p_array = $html_dom->find('p.test_class');
$p1 = $html_dom->find('p.test_class1',0);
$div = $html_dom->find('div#test1',0);
foreach ($p_array as $p){
	echo $p->getPlainText() . "\n";
}
echo $div->getPlainText() . "\n";
echo $p1->getPlainText() . "\n";
echo $p1->getAttr('class') . "\n";
echo "show html:\n";
echo $div->innerHtml() . "\n";
echo $div->outerHtml() . "\n";
?>

基础用法

// 查找所有a标签
$ret = $html->find('a');

// 查找a标签的第一个元素
$ret = $html->find('a', 0);

// 查找a标签的倒数第一个元素
$ret = $html->find('a', -1); 

// 查找所有含有id属性的div标签
$ret = $html->find('div[id]');

// 查找所有含有id属性为foo的div标签
$ret = $html->find('div[id=foo]');

高级用法

// 查找所有id=foo的元素
$ret = $html->find('#foo');

// 查找所有class=foo的元素
$ret = $html->find('.foo');

// 查找所有拥有 id属性的元素
$ret = $html->find('*[id]'); 

// 查找所有 anchors 和 images标记 
$ret = $html->find('a, img'); 

// 查找所有有"title"属性的anchors and images 
$ret = $html->find('a[title], img[title]');

层级选择器

// Find all <li> in <ul> 
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div'); 

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]');

嵌套选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
       foreach($ul->find('li') as $li) 
       {
             // do something...
       }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

属性过滤

支持属性选择器操作:

过滤	描述
[attribute]	匹配具有指定属性的元素.
[!attribute]	匹配不具有指定属性的元素。
[attribute=value]	匹配具有指定属性值的元素
[attribute!=value]	匹配不具有指定属性值的元素
[attribute^=value]	匹配具有指定属性值开始的元素
[attribute$=value]	匹配具有指定属性值结束的元素
[attribute*=value]	匹配具有指定属性的元素,且该属性包含了一定的值

Dom扩展用法

获取dom通过扩展实现更多的功能，详见：http://php.net/manual/zh/book.dom.php

/**
 * @var \DOMNode
 */
$oHtml->node

$oHtml->node->childNodes
$oHtml->node->parentNode
$oHtml->node->firstChild
$oHtml->node->lastChild
等等...

Comments

效率问题

hi 哥们，我用 xhprof 做了一个测试，发现 html-parser 确实比 simple_html_dom 效率和内存占用低，但是我测试了基于 php5.5 的 DOMDocument 我发现 DOMDocument 效率是非常的高，这是为什么呢。

bupt1987/html-parser:

Overall Summary 
Total Incl. Wall Time (microsec):   3,038 microsecs
Total Incl. CPU (microsecs):    3,039 microsecs
Total Incl. MemUse (bytes): 223,928 bytes
Total Incl. PeakMemUse (bytes): 231,896 bytes
Number of Function Calls:   272

PHP5.5 DOMDocument:

Overall Summary 
Total Incl. Wall Time (microsec):   418 microsecs
Total Incl. CPU (microsecs):    419 microsecs
Total Incl. MemUse (bytes): 8,616 bytes
Total Incl. PeakMemUse (bytes): 15,464 bytes
Number of Function Calls:   16

opened by wozzup 8

使用时出现乱码。

我想使用该解析器，从文章的Html页面解析出标题。具体使用： $config = array( 'indent' => TRUE, 'output-xhtml' => TRUE, 'wrap' => '200' ); $html = new HtmlParserModel(); $html->parseStr($str, $config, 'utf8'); $result = $html->find('p'); foreach ($result as $value) { //想打印出来，看一下内容，然后再用正则去匹配 echo $value->getPlainText(); } 结果，打印出来的都是乱码，不知道如何是好。

opened by xiangjihan 6

解析获取内容，显示错误

//简介内容
$desc = $jianDom->find('div.jianji');
//作者名字
$authorName = $jianDom->find('div.title',0);

echo $desc[0]->innerHtml(). "\n". $authorName->innerHtml()."\n";

显示的内容，跟网页上的不一样。

很奇怪。。。。。估计是编码问题。。。。期待解决回复。

谢谢

opened by lovesuzhou 1

层级选择失败

不知道为什么。。我抓取的那个页面有个列表需要抓。用的是
列表。结构大概如下： `

.......

xxxxx

xxxxx

xxxxx

.......

xxxxx

xxxxx

xxxxx

` 我使用 #list dd 进行选择时没有选择到元素，只有直接使用dd 选择才能抓取列表。甚至我直接 body dd 都选择不到元素

opened by VirensCn 1
fix: php73 "Compilation failed"
fix: php73 "Compilation failed: invalid range in character class at offset 4" 修改后低版本不会报错

各php报错验证：

修改前：https://3v4l.org/No7Pc

修改后：https://3v4l.org/MVjqi

具体可以进一步验证一下
opened by XinRoom 0
PHP Simple DOM Parser 功能補齊了

因為我是 PHP Simple DOM Parser 的愛用者，但作者群不維護了.. 超多 Bug 看到您改寫的很棒，我寫了其它部分把 PHP Simple DOM Parser 的功能都補上目前測試過 ok，我有Pull 假如作者覺得不錯的 code 您可以納入

詳細的操作 http://shinbonlin.github.io/html-parser/

opened by terrylinooo 0
Add setAttr() and save()

Add setAttr() to support setting attribute

增加setAttr()方法，整合getAttr()的取值方式，例：

$tag_blocks->find('a.post-tag', 0); $tag->setAttr('href', 'your url'); // test echo $tag->getAttr('href'); // result: your url

增加储存本文功能 Add function save()

增加储存本文功能，经过修改后的DOM本文能储存并输出为字串 Add function save() to store modified DOM and could be exported into a string variable .

for example:

$dom_html = new ParserDom($get_dom_html); echo $dom_html->save();

opened by terrylinooo 0
Add Magic function __get()

Add Magic function __get to make this library works like original idea - PHP Simple Dom Parser. Now you can use like this $example->plaintext, $example->outertext, $example->innertext

opened by terrylinooo 0

Releases(v3.0.0)

v3.0.0(Apr 21, 2016)

Source code(tar.gz)
Source code(zip)
v2.2.1(Jun 23, 2015)

fix outerHtml 输出Unicode编码的bug
Source code(tar.gz)
Source code(zip)
v2.2(Jun 10, 2015)

新增 innerHtml 和 outerHtml 方法
Source code(tar.gz)
Source code(zip)
v2.1(May 21, 2015)

Source code(tar.gz)
Source code(zip)

Owner

俊杰jerry

lazy lazy lazy

GitHub

Simple URL parser

urlparser Simple URL parser This is a simple URL parser, which returns an array of results from url of kind /module/controller/param1:value/param2:val

1 Oct 29, 2021

This is a simple, streaming parser for processing large JSON documents

Streaming JSON parser for PHP This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to

687 Jan 4, 2023

Better Markdown Parser in PHP

Parsedown Better Markdown Parser in PHP - Demo. Features One File No Dependencies Super Fast Extensible GitHub flavored Tested in 5.3 to 7.3 Markdown

14.3k Jan 8, 2023

Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

league/commonmark league/commonmark is a highly-extensible PHP Markdown parser created by Colin O'Dell which supports the full CommonMark spec and Git

2.4k Jan 1, 2023

A super fast, highly extensible markdown parser for PHP

A super fast, highly extensible markdown parser for PHP What is this? A set of PHP classes, each representing a Markdown flavor, and a command line to

989 Dec 16, 2022

An HTML5 parser and serializer for PHP.

HTML5-PHP HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has w

1.2k Dec 31, 2022

Advanced shortcode (BBCode) parser and engine for PHP

Shortcode Shortcode is a framework agnostic PHP library allowing to find, extract and process text fragments called "shortcodes" or "BBCodes". Example

358 Nov 26, 2022

Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Parsica The easiest way to build robust parsers in PHP.

0 Feb 22, 2022

This is a php parser for plantuml source file.

PlantUML parser for PHP Overview This package builds AST of class definitions from plantuml files. This package works only with php. Installation Via

5 May 29, 2022

Efficient, easy-to-use, and fast PHP JSON stream parser

JSON Machine Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP 5.6+. See TL;DR.

801 Dec 28, 2022

A PHP hold'em range parser

mattjmattj/holdem-range-parser A PHP hold'em range parser Installation No published package yet, so you'll have to clone the project manually, or add

1 Feb 2, 2022

Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.

PHP Markdown PHP Markdown Lib 1.9.0 - 1 Dec 2019 by Michel Fortin https://michelf.ca/ based on Markdown by John Gruber https://daringfireball.net/ Int

3.3k Jan 1, 2023

A New Markdown parser for PHP5.4

Ciconia - A New Markdown Parser for PHP The Markdown parser for PHP5.4, it is fully extensible. Ciconia is the collection of extension, so you can rep

357 Jan 3, 2023

A lightweight lexical string parser for BBCode styled markup.

Decoda A lightweight lexical string parser for BBCode styled markup. Requirements PHP 5.6.0+ Multibyte Composer Contributors "Marten-Plain" emoticons

194 Dec 27, 2022

Convert HTML to Markdown with PHP

HTML To Markdown for PHP Library which converts HTML to Markdown for your sanity and convenience. Requires: PHP 7.2+ Lead Developer: @colinodell Origi

1.5k Dec 28, 2022

HTML sanitizer, written in PHP, aiming to provide XSS-safe markup based on explicitly allowed tags, attributes and values.

TYPO3 HTML Sanitizer ℹ️ Common safe HTML tags & attributes as given in \TYPO3\HtmlSanitizer\Builder\CommonBuilder still might be adjusted, extended or

22 Dec 14, 2022

php html parser，类似与PHP Simple HTML DOM Parser，但是比它快好几倍

Related tags

Overview

HtmlParser

Example

基础用法

高级用法

层级选择器

嵌套选择器

属性过滤

Dom扩展用法

Comments

Releases(v3.0.0)

v3.0.0(Apr 21, 2016)

v2.2.1(Jun 23, 2015)

v2.2(Jun 10, 2015)

v2.1(May 21, 2015)

Owner

俊杰jerry

Simple URL parser

This is a simple, streaming parser for processing large JSON documents

Better Markdown Parser in PHP

Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

A super fast, highly extensible markdown parser for PHP

An HTML5 parser and serializer for PHP.

Advanced shortcode (BBCode) parser and engine for PHP

Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

This is a php parser for plantuml source file.

Efficient, easy-to-use, and fast PHP JSON stream parser

A PHP hold'em range parser

Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.

A New Markdown parser for PHP5.4

A lightweight lexical string parser for BBCode styled markup.

Convert HTML to Markdown with PHP

HTML sanitizer, written in PHP, aiming to provide XSS-safe markup based on explicitly allowed tags, attributes and values.

A simple PHP library for handling Emoji

A simple PHP library for handling Emoji

A simple Atom/RSS parsing library for PHP.