Simple and fast HTML parser

Last update: Dec 30, 2022

Related tags

Scraping html-parser xml-parser

Overview

DiDOM

README на русском

DiDOM - simple and fast HTML parser.

Installation
Quick start
Creating new document
Search for elements
Verify if element exists
Search in element
Supported selectors
Output
Working with elements
Working with cache
Miscellaneous
Comparison with other parsers

Installation

To install DiDOM run the command:

composer require imangazaliev/didom

Quick start

use DiDom\Document;

$document = new Document('http://www.news.com/', true);

$posts = $document->find('.post');

foreach($posts as $post) {
    echo $post->text(), "\n";
}

Creating new document

DiDom allows to load HTML in several ways:

With constructor

// the first parameter is a string with HTML
$document = new Document($html);

// file path
$document = new Document('page.html', true);

// or URL
$document = new Document('http://www.example.com/', true);

The second parameter specifies if you need to load file. Default is false.

Signature:

__construct($string = null, $isFile = false, $encoding = 'UTF-8', $type = Document::TYPE_HTML)

$string - an HTML or XML string or a file path.

$isFile - indicates that the first parameter is a path to a file.

$encoding - the document encoding.

$type - the document type (HTML - Document::TYPE_HTML, XML - Document::TYPE_XML).

With separate methods

$document = new Document();

$document->loadHtml($html);

$document->loadHtmlFile('page.html');

$document->loadHtmlFile('http://www.example.com/');

There are two methods available for loading XML: loadXml and loadXmlFile.

These methods accept additional options:

$document->loadHtml($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$document->loadHtmlFile($url, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$document->loadXml($xml, LIBXML_PARSEHUGE);
$document->loadXmlFile($url, LIBXML_PARSEHUGE);

Search for elements

DiDOM accepts CSS selector or XPath as an expression for search. You need to path expression as the first parameter, and specify its type in the second one (default type is Query::TYPE_CSS):

With method `find()`:

use DiDom\Document;
use DiDom\Query;

...

// CSS selector
$posts = $document->find('.post');

// XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);

If the elements that match a given expression are found, then method returns an array of instances of DiDom\Element, otherwise - an empty array. You could also get an array of DOMElement objects. To get this, pass false as the third parameter.

With magic method `__invoke()`:

$posts = $document('.post');

Warning: using this method is undesirable because it may be removed in the future.

With method `xpath()`:

$posts = $document->xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");

You can do search inside an element:

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

Verify if element exists

To verify if element exist use has() method:

if ($document->has('.post')) {
    // code
}

If you need to check if element exist and then get it:

if ($document->has('.post')) {
    $elements = $document->find('.post');
    // code
}

but it would be faster like this:

if (count($elements = $document->find('.post')) > 0) {
    // code
}

because in the first case it makes two queries.

Search in element

Methods find(), first(), xpath(), has(), count() are available in Element too.

Example:

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

Method `findInDocument()`

If you change, replace, or remove an element that was found in another element, the document will not be changed. This happens because method find() of Element class (a, respectively, the first () and xpath methods) creates a new document to search.

To search for elements in the source document, you must use the methods findInDocument() and firstInDocument():

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head')->firstInDocument('title')->remove();

Warning: methods findInDocument() and firstInDocument() work only for elements, which belong to a document, and for elements created via new Element(...). If an element does not belong to a document, LogicException will be thrown;

Supported selectors

DiDom supports search by:

tag
class, ID, name and value of an attribute
pseudo-classes:
- first-, last-, nth-child
- empty and not-empty
- contains
- has

// all links
$document->find('a');

// any element with id = "foo" and "bar" class
$document->find('#foo.bar');

// any element with attribute "name"
$document->find('[name]');
// the same as
$document->find('*[name]');

// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');

// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');

// all links starting with https
$document->find('a[href^=https]');

// all images with the extension png
$document->find('img[src$=png]');

// all links containing the string "example.com"
$document->find('a[href*=example.com]');

// text of the links with "foo" class
$document->find('a.foo::text');

// address and title of all the fields with "bar" class
$document->find('a.bar::attr(href|title)');

Output

Getting HTML

With method `html()`:

$posts = $document->find('.post');

echo $posts[0]->html();

Casting to string:

$html = (string) $posts[0];

Formatting HTML output

$html = $document->format()->html();

An element does not have format() method, so if you need to output formatted HTML of the element, then first you have to convert it to a document:

$html = $element->toDocument()->format()->html();

Inner HTML

$innerHtml = $element->innerHtml();

Document does not have the method innerHtml(), therefore, if you need to get inner HTML of a document, convert it into an element first:

$innerHtml = $document->toElement()->innerHtml();

Getting XML

echo $document->xml();

echo $document->first('book')->xml();

Getting content

$posts = $document->find('.post');

echo $posts[0]->text();

Creating a new element

Creating an instance of the class

use DiDom\Element;

$element = new Element('span', 'Hello');

// Outputs "<span>Hello</span>"
echo $element->html();

First parameter is a name of an attribute, the second one is its value (optional), the third one is element attributes (optional).

An example of creating an element with attributes:

$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];

$element = new Element('textarea', 'Text', $attributes);

An element can be created from an instance of the class DOMElement:

use DiDom\Element;
use DOMElement;

$domElement = new DOMElement('span', 'Hello');

$element = new Element($domElement);

Using the method `createElement`

$document = new Document($html);

$element = $document->createElement('span', 'Hello');

Getting the name of an element

$element->tag;

Getting parent element

$document = new Document($html);

$input = $document->find('input[name=email]')[0];

var_dump($input->parent());

Getting sibling elements

$document = new Document($html);

$item = $document->find('ul.menu > li')[1];

var_dump($item->previousSibling());

var_dump($item->nextSibling());

Getting the child elements

$html = '<div>Foo<span>Bar</span><!--Baz--></div>';

$document = new Document($html);

$div = $document->first('div');

// element node (DOMElement)
// string(3) "Bar"
var_dump($div->child(1)->text());

// text node (DOMText)
// string(3) "Foo"
var_dump($div->firstChild()->text());

// comment node (DOMComment)
// string(3) "Baz"
var_dump($div->lastChild()->text());

// array(3) { ... }
var_dump($div->children());

Getting document

$document = new Document($html);

$element = $document->find('input[name=email]')[0];

$document2 = $element->getDocument();

// bool(true)
var_dump($document->is($document2));

Working with element attributes

Creating/updating an attribute

With method `setAttribute`:

$element->setAttribute('name', 'username');

With method `attr`:

$element->attr('name', 'username');

With magic method `__set`:

$element->name = 'username';

Getting value of an attribute

With method `getAttribute`:

$username = $element->getAttribute('value');

With method `attr`:

$username = $element->attr('value');

With magic method `__get`:

$username = $element->name;

Returns null if attribute is not found.

Verify if attribute exists

With method `hasAttribute`:

if ($element->hasAttribute('name')) {
    // code
}

With magic method `__isset`:

if (isset($element->name)) {
    // code
}

Removing attribute:

With method `removeAttribute`:

$element->removeAttribute('name');

With magic method `__unset`:

unset($element->name);

Comparing elements

$element  = new Element('span', 'hello');
$element2 = new Element('span', 'hello');

// bool(true)
var_dump($element->is($element));

// bool(false)
var_dump($element->is($element2));

Appending child elements

$list = new Element('ul');

$item = new Element('li', 'Item 1');

$list->appendChild($item);

$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($items);

Adding a child element

$list = new Element('ul');

$item = new Element('li', 'Item 1');
$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($item);
$list->appendChild($items);

Replacing element

$element = new Element('span', 'hello');

$document->find('.post')[0]->replace($element);

Waning: you can replace only those elements that were found directly in the document:

// nothing will happen
$document->first('head')->first('title')->replace($title);

// but this will do
$document->first('head title')->replace($title);

More about this in section Search for elements.

Removing element

$document->find('.post')[0]->remove();

Warning: you can remove only those elements that were found directly in the document:

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head title')->remove();

More about this in section Search for elements.

Working with cache

Cache is an array of XPath expressions, that were converted from CSS.

Getting from cache

use DiDom\Query;

...

$xpath    = Query::compile('h2');
$compiled = Query::getCompiled();

// array('h2' => '//h2')
var_dump($compiled);

Cache setting

Query::setCompiled(['h2' => '//h2']);

Miscellaneous

`preserveWhiteSpace`

By default, whitespace preserving is disabled.

You can enable the preserveWhiteSpace option before loading the document:

$document = new Document();

$document->preserveWhiteSpace();

$document->loadXml($xml);

`count`

The count () method counts children that match the selector:

// prints the number of links in the document
echo $document->count('a');

// prints the number of items in the list
echo $document->first('ul')->count('li');

`matches`

Returns true if the node matches the selector:

$element->matches('div#content');

// strict match
// returns true if the element is a div with id equals content and nothing else
// if the element has any other attributes the method returns false
$element->matches('div#content', true);

`isElementNode`

Checks whether an element is an element (DOMElement):

$element->isElementNode();

`isTextNode`

Checks whether an element is a text node (DOMText):

$element->isTextNode();

`isCommentNode`

Checks whether the element is a comment (DOMComment):

$element->isCommentNode();

Comparison with other parsers

Comments

Как вставить элемент, не child?
Есть список абзацев

<p>... <p> <p>...<p> ..

нужно после N-го абзаца вставить элемент (div) с заданным содержанием и сохранить результат:

$dom = new \DiDom\Document($text); $child = new \DiDom\Element('div', '[inread=100]'); $selector = $dom->find('p:nth-child(4)')[0]->appendChild($child); return $dom->html();

Создает структуру:

<p>... <div>[inread=100]</div></p>

А нужно

<p>...</p> <div>[inread=100]</div>

В ридми намёков не нашёл.

Возможно ли сделать то, что мне нужно?
opened by KarelWintersky 19

Cannot Modify The Element From Document

Sample Code:

$document = new Document('<div id="content"><div class="a">b</div><div class="a">c</div></div>');
$contentDom = $document->find('#content')[0];
$contentDom->find('.a')[0]->remove();
echo $contentDom->innerHtml();

Expected Result:

<div class="a">c</div>

Result:

<div class="a">b</div><div class="a">c</div>

opened by shtse8 14

$DiDom\Document::load expects parameter 4 to be integer, string given$

DiDom\Document::load expects parameter 4 to be integer, string given

Hello,

I developed a plugin for WordPress using DiDom as its DOM parser. It reads the content of the generated HTML page and modifies some parts of it.

It works great, but for some people, it doesn't, and it is always the same error message: DiDom\Document::load expects parameter 4 to be integer, string given.

It's very hard for me to propose a solution to those users, even for me, this error message is quite cryptic. So basically, would it be possible to enhance this error message, to understand exactly what is wrong? This error is internal to DiDom, so it would be good to know what is wrong with the input, basically that the HTML markup is wrong or something else. Right now I have no idea except trying another parser, but since DiDom works fast and nicely, I would rather spend a bit more time here understanding how to go around this issue.

Thanks a lot :)

opened by jordymeow 11
file_get_contents(https://.....com): failed to open stream !!

Hey, I'm trying to parse this URL 'eloquentbyexample.com' and it's sucks with this exception: file_get_contents(https://eloquentbyexample.com): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found ' in /..../vendor/imangazaliev/didom/src/DiDom/Document.php:252. Have an idea to solve this? thx!

opened by atefBB 11

Multiple Elements

Hi,

I looking to try and pull down images in SVG and also other elements from the page and loop through them. How do i go about this?

    $names = ['a', 'b', 'c', 'd'];

    foreach ($names as $name) {
        $name = $name;
    }

    $document = new Document('http://www.website.com/link/' . $name, true);

    $posts = $document->find('.classname');
    // $title = $document->find('.title'); ????????
    // Does it need to go here e.g $icon = $document->find('img[src$=svg]');

    foreach($posts as $post) {
        echo $post->text(), "\n";
        // How do i loop through the images & titles?
    }

Any help is much appreciated.

Thanks Jake.

opened by JakeHenshall 11

Ничего не возвращают lastChild() nextSibling() и previousSibling()

Вот ваш пример:

$html = '
<ul>
    <li>Foo</li>
    <li>Bar</li>
    <li>Baz</li>
</ul>
';
$document = new Document($html);
$list = $document->first('ul');
// string(3) "Baz"
echo '<br><b>$list->child(2)->text():</b><br>'.highlight_string(print_r($list->child(2)->text(), true), true).'<br>';
// string(3) "Foo"
echo '<br><b>$list->firstChild()->text():</b><br>'.highlight_string(print_r($list->firstChild()->text(), true), true).'<br>';
// string(3) "Baz" - нет ничего
echo '<br><b>$list->lastChild()->text():</b><br>'.highlight_string(print_r($list->lastChild()->text(), true), true).'<br>';

$document = new Document($html);
$item = $document->find('ul > li')[1];
echo '<br><b>$item->previousSibling():</b><br>'.highlight_string(print_r($item->previousSibling()->text(), true), true).'<br>';
echo '<br><b>$item->nextSibling():</b><br>'.highlight_string(print_r($item->nextSibling()->text(), true), true).'<br>';

$list->lastChild()->text() ничего не возвращает nextSibling() и previousSibling() так же ничего

opened by Grafs 9

Catchable fatal error

$item = $document->find('ul.menu > li')[1];
// предыдущий элемент
var_dump($item->previousSibling());

Catchable fatal error: Argument 1 passed to DiDom\Element::setNode() must be an instance of DOMElement, instance of DOMText given, called in \vendor\imangazaliev\didom\src\DiDom\Element.php on line 32 and defined in \vendor\imangazaliev\didom\src\DiDom\Element.php on line 452

opened by Grafs 9

Encoding issue

HI

I'm using your lib to parse some HTML pages. When the task is run as crontab job, everything is OK. But once I try to parse the same page via interactive action in browser, it parses the source page in wrong encoding. Any recommendations to fix it? Attaching an example of print_r of some piece of parsed page

opened by Andrewkha 9

Can't set attribute in loop

Code example:

$html = new Document($file_name, true);
foreach ( $html->find($selector)[0]->find('img') as $element ) {
    $element->src = self::embed($path);
}

Expected: <img src="'data:image/jpg;base64,base64_encode_output" /> Got old src attr value.

opened by DarkPreacher 8

Select by ng-atrr

Есть следующий список

<div class="panel-heading">
	<select class="form-control"
		ng-init="rId = 'H255RC51833'"
		ng-model="rId">
			<option value="H255RC51833">Полулюкс 2местный</option>
			<option value="H255RC51834">Люкс 2местный</option>
			<option value="H255RC51829">Стандартный 1местный</option>
			<option value="H255RC51830">Стандартный 2местный</option>
			<option value="H255RC51831">Стандартный Улучшенный 1 местный</option>
			<option value="H255RC51832">Стандартный Улучшенный 2 местный</option>
			<option value="H255RC51835">Коттедж 2 местный с 1 спальней</option>
			<option value="H255RC51836">Коттедж 3 местный с 2 спальнями</option>
			<option value="H255RC51837">Коттедж 5 местный с 3 спальнями</option>
	</select>
</div>

уникальным параметром которого является ng-model="rId".

пытался по разному его получить, но ничего не выходит.

$document()->find('select.form-control option'); // так получаю все селекты на странице
$document()->find('select[ng-model="rId"] option'); // не выбирает
$document()->find('select::attr(ng-model=rId) option'); // выбирает все селекты

Может быть не работает это из-за того, что атрибут с дефисом ng-model?

opened by loveorigami 6

I cant installed many times

I was try installed on my XAMPP but always error

Problem 1 - The requested package imangazaliev/didom No version set (parsed as 1.0.0) is satisfiable by imangazaliev/didom[No version set (parsed as 1.0.0)] but these conflict with your requirements or minimum-stability.

Installation failed, reverting ./composer.json to its original content.

can you help me how to fix it ?

opened by HeriAzhar 6
trim() expects parameter 1 to be string, boolean given (0)

[TypeError] trim() expects parameter 1 to be string, boolean given (0) /vendor/imangazaliev/didom/src/DiDom/Query.php:108 #0: trim(boolean) /vendor/imangazaliev/didom/src/DiDom/Query.php:108 #1: DiDom\Query::parseAndConvertSelector(string, string) /vendor/imangazaliev/didom/src/DiDom/Query.php:74 #2: DiDom\Query::cssToXpath(string) /vendor/imangazaliev/didom/src/DiDom/Query.php:53 #3: DiDom\Query::compile(string, string) /vendor/imangazaliev/didom/src/DiDom/Document.php:402 #4: DiDom\Document->find(string)

opened by vaajnur 1
find не работает в cml файлах с кириллическими тегами

Например не работает селектор: $document->find("Номер1С"); Пример файла для проверки https://dropmefiles.com/GpWXl Код:

<КоммерческаяИнформация xmlns="urn:1C.ru:commerceml_3" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ВерсияСхемы="3.1" ДатаФормирования="2022-08-31T18:42:51" Ид="1"> <Контейнер> <Документ> <Ид>64749</Ид> <НомерВерсии>AAAAAAAkbfQ=</НомерВерсии> <ПометкаУдаления>false</ПометкаУдаления> <Номер>64749</Номер> <Номер1С>0000-064749</Номер1С> <Дата>2022-08-23</Дата> <Дата1С>2022-08-23</Дата1С> <Время>00:00:00</Время> <ХозОперация>Заказ товара</ХозОперация> </Документ> </Контейнер> </КоммерческаяИнформация>

opened by lyrmin 0
Разобрать строку внутри элемента

Всем привет, может кто подскажет, есть такая строка

Link1 и Link2
как можно разобрать строку через find чтобы получить такой массив объектов [ 0 => 'Link1', 1 => 'и', 2 => 'Link2' ]

opened by staixe 0

Как получить только дочерние элементы (без рекурсивного поиска) ? (аналог children(selector) из jquery?)

Представим ситуацию, у нас таблица, а внутри её ячеек есть ещё таблицы.

$table = new \DiDom\Element('table');

$table->setInnerHtml('
<tr>
    <td>Строка 1 Ячейка 1</td>
    <td>Строка 1 Ячейка 2</td>
    <td>
        <table>
            <tr>
                <td>Субтаблица: Строка 1 Ячейка 1</td>
                <td>Субтаблица: Строка 1 Ячейка 2</td>
            </tr>
            <tr>
                <td>Субтаблица: Строка 2 Ячейка 1</td>
                <td>Субтаблица: Строка 2 Ячейка 2</td>
            </tr>
        </table>
    </td>
</tr>
<tr>
    <td>Строка 2 Ячейка 1</td>
    <td>Строка 2 Ячейка 2</td>
</tr>
');

Как нам получить только элементы <td> из самого элемента $table, но не из его дочерних элементов?

Поскольку метод find('tr > td') даёт нам все вложенные элементы <td>

$tds = $table->find('tr > td');

$result = [];

foreach ($tds as $td) {
    $result[] = $td->text();
}

$result:

Array
(
    [0] => Строка 1 Ячейка 1
    [1] => Строка 1 Ячейка 2
    [2] => 
                Субтаблица: Строка 1 Ячейка 1
                Субтаблица: Строка 1 Ячейка 2
            
                Субтаблица: Строка 2 Ячейка 1
                Субтаблица: Строка 2 Ячейка 2
    
    [3] => Субтаблица: Строка 1 Ячейка 1
    [4] => Субтаблица: Строка 1 Ячейка 2
    [5] => Субтаблица: Строка 2 Ячейка 1
    [6] => Субтаблица: Строка 2 Ячейка 2
    [7] => Строка 2 Ячейка 1
    [8] => Строка 2 Ячейка 2
)

А нужно:

Array
(
    [0] => Строка 1 Ячейка 1
    [1] => Строка 1 Ячейка 2
    [2] => 
                Субтаблица: Строка 1 Ячейка 1
                Субтаблица: Строка 1 Ячейка 2
            
                Субтаблица: Строка 2 Ячейка 1
                Субтаблица: Строка 2 Ячейка 2

    [3] => Строка 2 Ячейка 1
    [4] => Строка 2 Ячейка 2
)

opened by rusproject 0

Releases(2.0)

2.0(May 8, 2022)
Breaking changes

Minimum PHP version bumped to 7.2

Remove __invoke method from Document, Element and DocumentFragment that was deprecated early

Remove magic property Element::$tag. Use tagName() method instead

Rename Element::getDocument() to ownerDocument()

What's new

Add Node::setInnerXml() method (i. e. for Element and DocumentFragment too)

Source code(tar.gz)
Source code(zip)
1.18(Jul 27, 2021)
Fix a bug when a call of Element::previousSibling() with selector returns a previous sibling when there is not matching element

Source code(tar.gz)
Source code(zip)
1.17(Jul 26, 2021)
Add support of multiple pseudoclasses (#125)

Source code(tar.gz)
Source code(zip)
1.16.4(Jul 26, 2021)
Handle nested pseudo-classes with expression correctly

Source code(tar.gz)
Source code(zip)
1.16.3(Feb 9, 2021)
Fix parsing of a style property in "style" attribute when the value contains a colon

Source code(tar.gz)
Source code(zip)
1.16.1(Nov 16, 2020)
Fix deprecation notice in PHP 8 for libxml_disable_entity_loader

Source code(tar.gz)
Source code(zip)
1.16(May 12, 2020)
Add Node::insertSiblingBefore() and Node::insertSiblingAfter() methods for inserting sibling nodes

Source code(tar.gz)
Source code(zip)
1.15(Mar 13, 2020)
Add support of document fragments

Source code(tar.gz)
Source code(zip)
1.14.1(Jan 17, 2019)
Fix an exception when selecting comment element with XPath

Add support of DOMCdataSection nodes

Add methods createTextNode(), createComment(), createCdataSection() to the Document class

Source code(tar.gz)
Source code(zip)
1.14(Dec 17, 2018)
Add Element::innerXml() method

Source code(tar.gz)
Source code(zip)
1.13(Mar 13, 2018)
Add Element::outerHtml() method

Add Element::prependChild() method

Add Element::insertBefore() and Element::insertAfter() methods

Add Element::style() method for more convenient inline styles manipulation

Add Element::classes() method for more convenient class manipulation

Source code(tar.gz)
Source code(zip)
1.12(Mar 13, 2018)
Many fixes and improvements

Source code(tar.gz)
Source code(zip)
1.11.1(Aug 26, 2017)
Fix bug with unregistered PHP functions in XPath in Document::has() and Document::count() methods

Source code(tar.gz)
Source code(zip)
1.11(Aug 13, 2017)
Add Element::isElementNode() method

Add ability to retrieve only specific attributes in Element::attributes() method

Add Element::removeAllAttributes() method

Add ability to specify selector and node type in Element::previousSibling() and Element::nextSibling() methods

Add Element::previousSiblings() and Element::nextSiblings() methods

Many minor fixes and improvements

Source code(tar.gz)
Source code(zip)
1.10.6(Jul 21, 2017)
Fix bug with XML document loading

Source code(tar.gz)
Source code(zip)
1.10.5(Jun 19, 2017)
Fix issue #85

Source code(tar.gz)
Source code(zip)
1.10.4(Jun 11, 2017)
Use mb_convert_encoding in the Encoder if it is available

Source code(tar.gz)
Source code(zip)
1.10.3(May 24, 2017)
Add Element::removeChild() and Element::removeChildren() methods

Fix bug in Element::matches() method

Element::matches() method now returns false if node is not DOMElement

Add Element::hasChildren() method

Source code(tar.gz)
Source code(zip)
1.10.2(May 16, 2017)
Fix bug in setInnerHtml: can't rewrite existing content

Throw InvalidSelectorException instead of InvalidArgumentException when selector is empty

Source code(tar.gz)
Source code(zip)
1.10.1(May 14, 2017)
Fix attributes ends-with XPath

Method Element::matches() now can check children nodes

Source code(tar.gz)
Source code(zip)
1.10(May 14, 2017)
Fix HTML saving mechanism

Throw InvalidSelectorException instead of RuntimeException in Query class

Source code(tar.gz)
Source code(zip)
1.9.1(Feb 2, 2017)
Add ability to search in owner document using current node as context

Bugs fixed

Source code(tar.gz)
Source code(zip)
1.9.0(Jan 31, 2017)
Method appendChild() now returns appended node(s)

Add ability to search elements in context

Source code(tar.gz)
Source code(zip)
1.8.8(Nov 26, 2016)
Bugs fixed

Source code(tar.gz)
Source code(zip)
1.8.7(Nov 26, 2016)
Add getLineNo method to Element

Source code(tar.gz)
Source code(zip)
1.8.6(Oct 30, 2016)
Fix issue #55

Source code(tar.gz)
Source code(zip)
1.8.5(Oct 25, 2016)
Add support of DOMComment

Source code(tar.gz)
Source code(zip)
1.8.4(Oct 22, 2016)
Add ability to create an element by selector

Add closest method

Source code(tar.gz)
Source code(zip)
1.8.3(Oct 19, 2016)
Add method Element::isTextNode()

Many minor fixes

Source code(tar.gz)
Source code(zip)
1.8.2(Oct 19, 2016)
Add ability to check that element matches selector

Add ability counting nodes by selector

Many minor fixes

Source code(tar.gz)
Source code(zip)

Owner

GitHub

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

485 Dec 31, 2022

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

68 Dec 27, 2022

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

57 Dec 16, 2022

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 1, 2023

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 4, 2023

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

1.3k Dec 28, 2022

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

0 Sep 14, 2021

A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

9 Sep 24, 2022

Simple and fast HTML parser

Related tags

Overview

DiDOM

Contents

Installation

Quick start

Creating new document

With constructor

With separate methods

Search for elements

With method find():

With magic method __invoke():

With method xpath():

Verify if element exists

Search in element

Method findInDocument()

Supported selectors

Output

Getting HTML

With method html():

Casting to string:

Formatting HTML output

Inner HTML

Getting XML

Getting content

Creating a new element

Creating an instance of the class

Using the method createElement

Getting the name of an element

Getting parent element

Getting sibling elements

Getting the child elements

Getting document

Working with element attributes

Creating/updating an attribute

With method setAttribute:

With method attr:

With magic method __set:

Getting value of an attribute

With method getAttribute:

With method attr:

With magic method __get:

Verify if attribute exists

With method hasAttribute:

With magic method __isset:

Removing attribute:

With method removeAttribute:

With magic method __unset:

Comparing elements

Appending child elements

Adding a child element

Replacing element

Removing element

Working with cache

Getting from cache

Cache setting

Miscellaneous

preserveWhiteSpace

count

matches

isElementNode

isTextNode

isCommentNode

Comparison with other parsers

Comments

Link1 и Link2

Releases(2.0)

2.0(May 8, 2022)

Breaking changes

What's new

1.18(Jul 27, 2021)

1.17(Jul 26, 2021)

1.16.4(Jul 26, 2021)

1.16.3(Feb 9, 2021)

1.16.1(Nov 16, 2020)

1.16(May 12, 2020)

1.15(Mar 13, 2020)

1.14.1(Jan 17, 2019)

1.14(Dec 17, 2018)

With method `find()`:

With magic method `__invoke()`:

With method `xpath()`:

Method `findInDocument()`

With method `html()`:

Using the method `createElement`

With method `setAttribute`:

With method `attr`:

With magic method `__set`:

With method `getAttribute`:

With method `attr`:

With magic method `__get`:

With method `hasAttribute`:

With magic method `__isset`:

With method `removeAttribute`:

With magic method `__unset`:

`preserveWhiteSpace`

`count`

`matches`

`isElementNode`

`isTextNode`

`isCommentNode`