A configurable and extensible PHP web spider

Matthijs van den Bos

Last update: Dec 28, 2022

Related tags

Scraping php-spider

Overview

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are stuck with v3, you can still use PHP Spider v0.4.x. The reason for this is because of a BC break in the EventDispatcher v5, which we needed to support to keep up with modern frameworks.

PHP-Spider Features

supports two traversal algorithms: breadth-first and depth-first
supports crawl depth limiting, queue size limiting and max downloads limiting
supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
comes with a useful set of URI filters, such as Domain limiting
supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
supports custom request handling logic
supports Basic, Digest and NTLM HTTP authentication. See example.
comes with a useful set of persistence handlers (memory, file)
supports custom persistence handlers
collects statistics about the crawl for reporting
dispatches useful events, allowing developers to add even more custom behavior
supports a politeness policy

This Spider does not support Javascript.

Installation

The easiest way to install PHP-Spider is with composer. Find it on Packagist.

Usage

This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.

First create the spider

$spider = new Spider('http://www.dmoz.org');

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.

$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

Execute the crawl

$spider->crawl();

When crawling is done, we could get some info about the crawl

echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}

Contributing

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.

There a few requirements for a Pull Request to be accepted:

Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
Prove that the code works with unit tests and that coverage remains 100%;

Note: An easy way to check if your code conforms to PHP-Spider is by running the script bin/static-analysis, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.

Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run bin/coverage-enforce.

Support

For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License

PHP-Spider is licensed under the MIT license.

Comments

Upgrading to support Symfony 6.

Hello,

do you have plans to upgrade the package for supporting "symfony/finder": "^6.0"?

It would be awesome if so. Because I have to upgrade one package to support latest Laravel version but can't do this because of "symfony/finder": "^3.0.0||^4.0.0||^5.0.0" in your composer.json

Thank you in advance

opened by DmitrySidorenkoShim 4
Migrating from PSR-0 to PSR-4 autoloading

Hey @mvdbos

here the mentioned update for PHP Spider. I've ran the tests and it worked all well. Would be great if you could review it and make sure it's up to your standards.

Cheers, Peter

opened by spekulatius 4
suitable as link checker?

Is this suitable to use as a base for developing a link checker?

I've given it a quick go, but cant find an easy way to get the response code for each link found?

I'd be wanting to create a report of 404's, 500's etc that need attention. 200's for the sitemap. 301's and 302's that maybe need fixing...

opened by bobemoe 4
Wrong Event Arguments

@mvdbos i think your "Simplify build and clean up use statements" commit introduced some evil changes. :)

https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Downloader/Downloader.php#L8

Symfony\Contracts\EventDispatcher\EventDispatcherInterface does not exists in S3. Why not using the default Symfony\Component\EventDispatcher\EventDispatcherInterface?

https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Downloader/Downloader.php#L94

=> Wrong argument order (event name has to be the first argument)

https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/QueueManager/InMemoryQueueManager.php#L85

=> Wrong argument order (event name has to be the first argument)

https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Spider.php#L230

=> Wrong argument order (event name has to be the first argument)

opened by solverat 4

Cannot retrieve href attribute of source page

I try to retrieve all href attributes of the top categories on the startpage like 'Arts', 'Business','Computers' a.s.o. But I cannot succeed.

<aside class="arts" xpath="1">
        <div id="home-cat-arts" class="category arts mobile" onclick="window.location.href='/Arts/'">
            <h2 class="top-cat"><a href="/Arts/">Arts</a></h2>
            <h3 class="sub-cat"><a href="/Arts/Movies/">Movies</a>, 
                                <a href="/Arts/Television/">Television</a>, 
                                <a href="/Arts/Music/">Music</a>...</h3>
        </div>
</aside>

I tried the folllowing:

// Create Spider
$spider = new Spider('http://dmoztools.net');

// Add a URI discoverer. Without it, the spider does nothing.
 $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//section[@id='category-section']//h2[@class='top-cat']"));

// Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

// Let's add something to enable us to stop the script
$spider->getDispatcher()->addListener(
    SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
    function (Event $event) {
        echo "\nCrawl aborted by user.\n";
        exit();
    }
);

// Add a listener to collect stats to the Spider and the QueueMananger.
// There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

// Execute crawl
$spider->crawl();

// Build a report
echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

// Finally we could do some processing on the downloaded resources
echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
   echo "\n$resource->getCrawler()->filterXpath("//a/@href")->text();
}

A little help could be very nice!

opened by taifunorkan 3

"EventDispatcher::dispatch() must be an object, string given"
Hello @mvdbos,

I'm about to use php-spider as part of a Laravel 7 project and got this error when starting a crawl:

Argument 1 passed to Symfony\Component\EventDispatcher\EventDispatcher::dispatch() must be an object, string given, called in /var/www/dev.project.com/vendor/vdb/php-spider/src/VDB/Spider/QueueManager/InMemoryQueueManager.php on line 88

I started researching it more and I guessed that it comes from a breaking change introduced with the upgrade of symfony/event-dispatcher from v4.4.5 to v5.0.5.

I've checked a bit more and found out that my version of php-spider is actually reverted from v0.4.2 down to v0.2. It's because v0.2 didn't require event-dispatcher and was therefore matching my set of requirements.

I've looked closer at the error and found that switching the parameters on line 87 & 88 in the InMemoryQueueManager-class fixed it. I've prepared a PR and started writing this issue as I've found the old issue about this: https://github.com/mvdbos/php-spider/issues/61 haha. I could have solved it quicker :)

It would be great if you could let me know if the PR works or you or needs further tweaks.

Just in case - Interesting versions / packages:

PHP 7.2.24 laravel/framework v7.2.2 guzzle/guzzle v3.8.1 symfony/event-dispatcher v5.0.5 symfony/event-dispatcher-contracts v2.0.1 vdb/php-spider v0.2 vdb/uri v0.2

Cheers, Peter
opened by spekulatius 3
Some feature questions
Hi, I have some questions regarding the features of this crawler, which are not covered by the documentation.

Does php-spider support JavaScript (content and URLs generated via JavaScript)

Does php-spider follows robot.txt files?

Is php-spider able to leverage a sitemap?

Is it possible to crawl sites that require authentication?
opened by koolma 3
is php-spider a abandoned repo?

Hi @mvdbos,

because it's a little bit complicated to get in contact with you i'll try it with a issue. :)

php-spider is still the greatest web spider out there but the latest release uses guzzle 3, which has reached the EOL months ago. Is there any chance of a new release any time soon?

it would be great to hear from you and hopefully you get a chance to reply.

thanks

opened by solverat 3
install problems
I'm not a PhD in php here but composer is having issues installing this on my machine. Very interested in seeing what you have put together with your spider...

I don't have curl installed as I'm just testing this on my local Windows 7 machine running xampp, but when I run the command in composer

php composer.phar require vdb/php-spider Please provide a version constraint for the way/generators requirement: 0.1.*

I'm told the
vdb/uri dev-master
can't be found.

I've tried installing this from a clone, but no luck...

any thoughts?

Update Just re-shifted some things and now I'm getting this error in the example_simple.php

Fatal error: Call to undefined function VDB\Spider\pcntl_signal() in C:\xampp\htdocs\test\src\VDB\Spider\Spider.php on line 98
0 - Backlog
opened by aggied 3
Allowing guzzle 7

Hello @mvdbos

in my Laravel sitemap generator I use your PHP spider. While making it work for Laravel 8 I stumbled upon a conflict in the required versions of guzzle.

Here is an update of the composer.json to allow guzzle v7. I've ran the unit tests and examples. Both worked and didn't indicate any issues with the update. Do you think is good for merging?

Cheers, Peter

opened by spekulatius 2
Is it possible to count/track the number of links pointing to a page?

Hello @mvdbos

really neat library. I'm using it crawl my own page and build a comprehensive sitemap from it. I was wondering, if it is possible to count/track the number of links pointing to a page?

Cheers, Peter

opened by spekulatius 2
Pass Query Parameters

how to pass Query Params with all Crawling links

Sample for this https://example.com?locale=en https://example.com/blog/details/1?locale=en

I want this, to handle crawling with multi languages sites

Can help me please

opened by MahmoudSaidHaggag 1
Follow only internal redirects

Hello @mvdbos

I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:

I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.

Appreciate any ideas on how you would approach this.

Cheers, Peter

opened by spekulatius 2
Robots.txt filtering?

Hello @mvdbos,

I hope you are doing well!

I was wondering what your approach (if any) is to using the spider with robots.txt pattern for filtering?

The UriFilter seems to support only allow and no disallow and hence wouldn't support the most used case of robots files.

I was just thinking I ask if you got ideas on this before I build something myself.

Thank you in advance,

Peter

opened by spekulatius 6

Owner

Matthijs van den Bos

GitHub https://mvdbos.github.io/php-spider/

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

485 Dec 31, 2022

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

327 Dec 30, 2022

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

57 Dec 16, 2022

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

11 Nov 8, 2022

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

68 Dec 27, 2022

Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

60 Nov 30, 2022

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 1, 2023

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 4, 2023

Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

1.1k Jan 3, 2023

Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

1.9k Jan 1, 2023

The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

134 Jan 4, 2023

Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

2 Feb 24, 2022

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

1 Mar 24, 2022

Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

2.1k Dec 30, 2022

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

0 Sep 14, 2021

A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

9 Sep 24, 2022

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

21 Nov 19, 2021

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

1.7k Dec 30, 2022