php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍

Overview

HtmlParser

Total Downloads Build Status

php html解析工具,类似与PHP Simple HTML DOM Parser。 由于基于php模块dom,所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。

注意:html代码必须是utf-8编码字符,如果不是请转成utf-8
如果有乱码的问题参考:http://www.fwolf.com/blog/post/314

现在支持composer

"require": {"bupt1987/html-parser": "dev-master"}

加载composer
require 'vendor/autoload.php';

================================================================================

Example
<?php
require 'vendor/autoload.php';

$html = '<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>test</title>
  </head>
  <body>
    <p class="test_class test_class1">p1</p>
    <p class="test_class test_class2">p2</p>
    <p class="test_class test_class3">p3</p>
    <div id="test1">测试1</div>
  </body>
</html>';
$html_dom = new \HtmlParser\ParserDom($html);
$p_array = $html_dom->find('p.test_class');
$p1 = $html_dom->find('p.test_class1',0);
$div = $html_dom->find('div#test1',0);
foreach ($p_array as $p){
	echo $p->getPlainText() . "\n";
}
echo $div->getPlainText() . "\n";
echo $p1->getPlainText() . "\n";
echo $p1->getAttr('class') . "\n";
echo "show html:\n";
echo $div->innerHtml() . "\n";
echo $div->outerHtml() . "\n";
?>

基础用法

// 查找所有a标签
$ret = $html->find('a');

// 查找a标签的第一个元素
$ret = $html->find('a', 0);

// 查找a标签的倒数第一个元素
$ret = $html->find('a', -1); 

// 查找所有含有id属性的div标签
$ret = $html->find('div[id]');

// 查找所有含有id属性为foo的div标签
$ret = $html->find('div[id=foo]'); 

高级用法

// 查找所有id=foo的元素
$ret = $html->find('#foo');

// 查找所有class=foo的元素
$ret = $html->find('.foo');

// 查找所有拥有 id属性的元素
$ret = $html->find('*[id]'); 

// 查找所有 anchors 和 images标记 
$ret = $html->find('a, img'); 

// 查找所有有"title"属性的anchors and images 
$ret = $html->find('a[title], img[title]');

层级选择器

// Find all <li> in <ul> 
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div'); 

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]'); 

嵌套选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
       foreach($ul->find('li') as $li) 
       {
             // do something...
       }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

属性过滤

支持属性选择器操作:

过滤	描述
[attribute]	匹配具有指定属性的元素.
[!attribute]	匹配不具有指定属性的元素。
[attribute=value]	匹配具有指定属性值的元素
[attribute!=value]	匹配不具有指定属性值的元素
[attribute^=value]	匹配具有指定属性值开始的元素
[attribute$=value]	匹配具有指定属性值结束的元素
[attribute*=value]	匹配具有指定属性的元素,且该属性包含了一定的值

Dom扩展用法

获取dom通过扩展实现更多的功能,详见:http://php.net/manual/zh/book.dom.php

/**
 * @var \DOMNode
 */
$oHtml->node

$oHtml->node->childNodes
$oHtml->node->parentNode
$oHtml->node->firstChild
$oHtml->node->lastChild
等等...

Issues
  • 你好,请问2.1版本是稳定版吗

    你好,请问2.1版本是稳定版吗

    你好,请问2.1版本是稳定版吗?谢谢

    opened by lagolas 11
  • 效率问题

    效率问题

    hi 哥们,我用 xhprof 做了一个测试,发现 html-parser 确实比 simple_html_dom 效率和内存占用低,但是我测试了基于 php5.5 的 DOMDocument 我发现 DOMDocument 效率是非常的高,这是为什么呢。

    bupt1987/html-parser:

    Overall Summary 
    Total Incl. Wall Time (microsec):   3,038 microsecs
    Total Incl. CPU (microsecs):    3,039 microsecs
    Total Incl. MemUse (bytes): 223,928 bytes
    Total Incl. PeakMemUse (bytes): 231,896 bytes
    Number of Function Calls:   272
    

    PHP5.5 DOMDocument:

    Overall Summary 
    Total Incl. Wall Time (microsec):   418 microsecs
    Total Incl. CPU (microsecs):    419 microsecs
    Total Incl. MemUse (bytes): 8,616 bytes
    Total Incl. PeakMemUse (bytes): 15,464 bytes
    Number of Function Calls:   16
    
    opened by wozzup 8
  • 使用时出现乱码。

    使用时出现乱码。

    我想使用该解析器,从文章的Html页面解析出标题。 具体使用: $config = array( 'indent' => TRUE, 'output-xhtml' => TRUE, 'wrap' => '200' ); $html = new HtmlParserModel(); $html->parseStr($str, $config, 'utf8'); $result = $html->find('p'); foreach ($result as $value) { //想打印出来,看一下内容,然后再用正则去匹配 echo $value->getPlainText(); } 结果,打印出来的都是乱码,不知道如何是好。 qq20140303151519

    opened by xiangjihan 6
  • 可否增加获取标签对内容的功能?

    可否增加获取标签对内容的功能?

    做html解析挺好用的,可否考虑一下增加提取内容、属性值的功能?

    opened by qizhihere 2
  • 一点建议

    一点建议

    这是一个很精巧轻量级的小轮子,为什么不合并成一个文件呢

    opened by wenpeng 2
  • 所需的php最低版本是多少?

    所需的php最低版本是多少?

    我看见composer.json里面有说明"php": ">=5.4.0"。我生产环境的php版本是:PHP 5.3.10。我在使用你以前的HtmlParserModel没什么问题。不知道能否直接使用你现在的版本?

    opened by iyaozhen 2
  • 你好,用outerHtml()取出来的字符编码为什么是Unicode

    你好,用outerHtml()取出来的字符编码为什么是Unicode

    你好,用outerHtml()取出来的字符编码为什么是Unicode

    首期预告:女神拼了!范冰冰 用innerHtml取出来的就是正常的文字呢 谢谢

    opened by lagolas 2
  • 请问一下 这个类库可直接替换simple html dom parser 类库吗

    请问一下 这个类库可直接替换simple html dom parser 类库吗

    我想直接用这个类库替换 simple html dom parser 类库, 这个类库是否实现了simple htmldom parser 中的所有方法

    请问是否可行呢

    opened by sursir 2
  • Can't find or parse meta tags

    Can't find or parse meta tags

    Can't find or parse meta tags

    opened by 5452 1
  • 怎么修改节点属性

    怎么修改节点属性

    null

    opened by dijing1987 1
  • 与PHP版本不兼容 我的7.3

    与PHP版本不兼容 我的7.3

    提示错误: preg_match_all(): Compilation failed: invalid range in character class at offset 4

    opened by gclinux 5
  • hello,中文乱码

    hello,中文乱码

    大神的代码确实牛x。不过发现容易出现乱码现象。 我确定字符串是utf8格式,但是仍然产生乱码。 大神有空更新一下就太好了。

    opened by hahadaba 1
  • tbody为什么要continue

    tbody为什么要continue

    aac6e3fe-7ade-42af-b559-ba6fb58d60c6

    opened by yjysanshu 1
Releases(v3.0.0)
Owner
俊杰jerry
lazy lazy lazy
俊杰jerry
Simple URL parser

urlparser Simple URL parser This is a simple URL parser, which returns an array of results from url of kind /module/controller/param1:value/param2:val

null 1 Oct 29, 2021
Better Markdown Parser in PHP

Parsedown Better Markdown Parser in PHP - Demo. Features One File No Dependencies Super Fast Extensible GitHub flavored Tested in 5.3 to 7.3 Markdown

Emanuil Rusev 14k Jan 11, 2022
Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

league/commonmark league/commonmark is a highly-extensible PHP Markdown parser created by Colin O'Dell which supports the full CommonMark spec and Git

The League of Extraordinary Packages 2.1k Jan 14, 2022
A super fast, highly extensible markdown parser for PHP

A super fast, highly extensible markdown parser for PHP What is this? A set of PHP classes, each representing a Markdown flavor, and a command line to

Carsten Brandt 961 Jan 13, 2022
An HTML5 parser and serializer for PHP.

HTML5-PHP HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has w

null 972 Jan 6, 2022
Advanced shortcode (BBCode) parser and engine for PHP

Shortcode Shortcode is a framework agnostic PHP library allowing to find, extract and process text fragments called "shortcodes" or "BBCodes". Example

Tomasz Kowalczyk 347 Dec 25, 2021
Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Parsica The easiest way to build robust parsers in PHP.

null 325 Jan 12, 2022
This is a php parser for plantuml source file.

PlantUML parser for PHP Overview This package builds AST of class definitions from plantuml files. This package works only with php. Installation Via

Tasuku Yamashita 4 Jan 4, 2022
Efficient, easy-to-use, and fast PHP JSON stream parser

JSON Machine Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP 5.6+. See TL;DR.

Filip Halaxa 625 Jan 14, 2022
Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.

PHP Markdown PHP Markdown Lib 1.9.0 - 1 Dec 2019 by Michel Fortin https://michelf.ca/ based on Markdown by John Gruber https://daringfireball.net/ Int

Michel Fortin 3.3k Jan 10, 2022
A New Markdown parser for PHP5.4

Ciconia - A New Markdown Parser for PHP The Markdown parser for PHP5.4, it is fully extensible. Ciconia is the collection of extension, so you can rep

Kazuyuki Hayashi 364 Jan 12, 2022
A lightweight lexical string parser for BBCode styled markup.

Decoda A lightweight lexical string parser for BBCode styled markup. Requirements PHP 5.6.0+ Multibyte Composer Contributors "Marten-Plain" emoticons

Miles Johnson 192 Jan 8, 2022
Convert HTML to Markdown with PHP

HTML To Markdown for PHP Library which converts HTML to Markdown for your sanity and convenience. Requires: PHP 7.2+ Lead Developer: @colinodell Origi

The League of Extraordinary Packages 1.4k Jan 15, 2022
HTML sanitizer, written in PHP, aiming to provide XSS-safe markup based on explicitly allowed tags, attributes and values.

TYPO3 HTML Sanitizer ℹ️ Common safe HTML tags & attributes as given in \TYPO3\HtmlSanitizer\Builder\CommonBuilder still might be adjusted, extended or

TYPO3 GitHub Department 13 Dec 9, 2021
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 52 May 15, 2021
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 51 Jan 15, 2021
A simple Atom/RSS parsing library for PHP.

SimplePie SimplePie is a very fast and easy-to-use class, written in PHP, that puts the 'simple' back into 'really simple syndication'. Flexible enoug

SimplePie 1.4k Jan 8, 2022
This is a simple php project to help a friend how parse a xml file.

xml-parser-with-laravie Requirements PHP 7.4+ Composer 2+ How to to setup to test? This is very simple, just follow this commands git clone https://gi

Lucas Saraiva 2 Dec 3, 2021
UpToDocs scans a Markdown file for PHP code blocks, and executes each one in a separate process.

UpToDocs UpToDocs scans a Markdown file for PHP code blocks, and executes each one in a separate process. Include this in your CI workflows, to make s

Mathias Verraes 52 Nov 6, 2021