A configurable and extensible PHP web spider

Related tags

Scraping php-spider
Overview

Build Status Latest Stable Version Total Downloads License

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are stuck with v3, you can still use PHP Spider v0.4.x. The reason for this is because of a BC break in the EventDispatcher v5, which we needed to support to keep up with modern frameworks.

PHP-Spider Features

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Installation

The easiest way to install PHP-Spider is with composer. Find it on Packagist.

Usage

This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.

First create the spider

$spider = new Spider('http://www.dmoz.org');

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.

$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

Execute the crawl

$spider->crawl();

When crawling is done, we could get some info about the crawl

echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}

Contributing

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.

There a few requirements for a Pull Request to be accepted:

  • Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
  • Prove that the code works with unit tests and that coverage remains 100%;

Note: An easy way to check if your code conforms to PHP-Spider is by running the script bin/static-analysis, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.

Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run bin/coverage-enforce.

Support

For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License

PHP-Spider is licensed under the MIT license.

Comments
  • Upgrading to support Symfony 6.

    Upgrading to support Symfony 6.

    Hello,

    do you have plans to upgrade the package for supporting "symfony/finder": "^6.0"?

    It would be awesome if so. Because I have to upgrade one package to support latest Laravel version but can't do this because of "symfony/finder": "^3.0.0||^4.0.0||^5.0.0" in your composer.json

    Thank you in advance

    opened by DmitrySidorenkoShim 4
  • Migrating from PSR-0 to PSR-4 autoloading

    Migrating from PSR-0 to PSR-4 autoloading

    Hey @mvdbos

    here the mentioned update for PHP Spider. I've ran the tests and it worked all well. Would be great if you could review it and make sure it's up to your standards.

    Cheers, Peter

    opened by spekulatius 4
  • suitable as link checker?

    suitable as link checker?

    Is this suitable to use as a base for developing a link checker?

    I've given it a quick go, but cant find an easy way to get the response code for each link found?

    I'd be wanting to create a report of 404's, 500's etc that need attention. 200's for the sitemap. 301's and 302's that maybe need fixing...

    opened by bobemoe 4
  • Wrong Event Arguments

    Wrong Event Arguments

    @mvdbos i think your "Simplify build and clean up use statements" commit introduced some evil changes. :)

    https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Downloader/Downloader.php#L8

    Symfony\Contracts\EventDispatcher\EventDispatcherInterface does not exists in S3. Why not using the default Symfony\Component\EventDispatcher\EventDispatcherInterface?

    https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Downloader/Downloader.php#L94

    => Wrong argument order (event name has to be the first argument)

    https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/QueueManager/InMemoryQueueManager.php#L85

    => Wrong argument order (event name has to be the first argument)

    https://github.com/mvdbos/php-spider/blob/b8447e6fa4f8051b35178f3a1905de764dd9625d/src/VDB/Spider/Spider.php#L230

    => Wrong argument order (event name has to be the first argument)

    opened by solverat 4
  • Cannot retrieve href attribute of source page

    Cannot retrieve href attribute of source page

    I try to retrieve all href attributes of the top categories on the startpage like 'Arts', 'Business','Computers' a.s.o. But I cannot succeed.

    <aside class="arts" xpath="1">
            <div id="home-cat-arts" class="category arts mobile" onclick="window.location.href='/Arts/'">
                <h2 class="top-cat"><a href="/Arts/">Arts</a></h2>
                <h3 class="sub-cat"><a href="/Arts/Movies/">Movies</a>, 
                                    <a href="/Arts/Television/">Television</a>, 
                                    <a href="/Arts/Music/">Music</a>...</h3>
            </div>
    </aside>
    

    I tried the folllowing:

    // Create Spider
    $spider = new Spider('http://dmoztools.net');
    
    // Add a URI discoverer. Without it, the spider does nothing.
     $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//section[@id='category-section']//h2[@class='top-cat']"));
    
    // Set some sane options for this example. In this case, we only get the first 10 items from the start page.
    $spider->getDiscovererSet()->maxDepth = 1;
    $spider->getQueueManager()->maxQueueSize = 10;
    
    // Let's add something to enable us to stop the script
    $spider->getDispatcher()->addListener(
        SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
        function (Event $event) {
            echo "\nCrawl aborted by user.\n";
            exit();
        }
    );
    
    // Add a listener to collect stats to the Spider and the QueueMananger.
    // There are more components that dispatch events you can use.
    $statsHandler = new StatsHandler();
    $spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
    $spider->getDispatcher()->addSubscriber($statsHandler);
    
    // Execute crawl
    $spider->crawl();
    
    // Build a report
    echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
    echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
    echo "\n  FAILED:    " . count($statsHandler->getFailed());
    echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());
    
    // Finally we could do some processing on the downloaded resources
    echo "\n\nDOWNLOADED RESOURCES: ";
    foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
       echo "\n$resource->getCrawler()->filterXpath("//a/@href")->text();
    }
    
    

    A little help could be very nice!

    opened by taifunorkan 3
  • "EventDispatcher::dispatch() must be an object, string given"

    Hello @mvdbos,

    I'm about to use php-spider as part of a Laravel 7 project and got this error when starting a crawl:

    Argument 1 passed to Symfony\Component\EventDispatcher\EventDispatcher::dispatch() must be an object, string given, called in /var/www/dev.project.com/vendor/vdb/php-spider/src/VDB/Spider/QueueManager/InMemoryQueueManager.php on line 88

    I started researching it more and I guessed that it comes from a breaking change introduced with the upgrade of symfony/event-dispatcher from v4.4.5 to v5.0.5.

    I've checked a bit more and found out that my version of php-spider is actually reverted from v0.4.2 down to v0.2. It's because v0.2 didn't require event-dispatcher and was therefore matching my set of requirements.

    I've looked closer at the error and found that switching the parameters on line 87 & 88 in the InMemoryQueueManager-class fixed it. I've prepared a PR and started writing this issue as I've found the old issue about this: https://github.com/mvdbos/php-spider/issues/61 haha. I could have solved it quicker :)

    It would be great if you could let me know if the PR works or you or needs further tweaks.

    Just in case - Interesting versions / packages:

    PHP 7.2.24
    
    laravel/framework                     v7.2.2     
    guzzle/guzzle                         v3.8.1       
    symfony/event-dispatcher              v5.0.5            
    symfony/event-dispatcher-contracts    v2.0.1     
    vdb/php-spider                        v0.2              
    vdb/uri                               v0.2              
    

    Cheers, Peter

    opened by spekulatius 3
  • Some feature questions

    Some feature questions

    Hi, I have some questions regarding the features of this crawler, which are not covered by the documentation.

    1. Does php-spider support JavaScript (content and URLs generated via JavaScript)
    2. Does php-spider follows robot.txt files?
    3. Is php-spider able to leverage a sitemap?
    4. Is it possible to crawl sites that require authentication?
    opened by koolma 3
  • is php-spider a abandoned repo?

    is php-spider a abandoned repo?

    Hi @mvdbos,

    because it's a little bit complicated to get in contact with you i'll try it with a issue. :)

    php-spider is still the greatest web spider out there but the latest release uses guzzle 3, which has reached the EOL months ago. Is there any chance of a new release any time soon?

    it would be great to hear from you and hopefully you get a chance to reply.

    thanks

    opened by solverat 3
  • install problems

    install problems

    I'm not a PhD in php here but composer is having issues installing this on my machine. Very interested in seeing what you have put together with your spider...

    I don't have curl installed as I'm just testing this on my local Windows 7 machine running xampp, but when I run the command in composer

    php composer.phar require vdb/php-spider
    Please provide a version constraint for the way/generators requirement: 0.1.*

    I'm told the

    vdb/uri dev-master
    can't be found.

    I've tried installing this from a clone, but no luck...

    any thoughts?

    Update Just re-shifted some things and now I'm getting this error in the example_simple.php

    Fatal error: Call to undefined function VDB\Spider\pcntl_signal() in C:\xampp\htdocs\test\src\VDB\Spider\Spider.php on line 98

    0 - Backlog 
    opened by aggied 3
  • Allowing guzzle 7

    Allowing guzzle 7

    Hello @mvdbos

    in my Laravel sitemap generator I use your PHP spider. While making it work for Laravel 8 I stumbled upon a conflict in the required versions of guzzle.

    Here is an update of the composer.json to allow guzzle v7. I've ran the unit tests and examples. Both worked and didn't indicate any issues with the update. Do you think is good for merging?

    Cheers, Peter

    opened by spekulatius 2
  • Is it possible to count/track the number of links pointing to a page?

    Is it possible to count/track the number of links pointing to a page?

    Hello @mvdbos

    really neat library. I'm using it crawl my own page and build a comprehensive sitemap from it. I was wondering, if it is possible to count/track the number of links pointing to a page?

    Cheers, Peter

    opened by spekulatius 2
  • Pass Query Parameters

    Pass Query Parameters

    how to pass Query Params with all Crawling links

    Sample for this https://example.com?locale=en https://example.com/blog/details/1?locale=en

    I want this, to handle crawling with multi languages sites

    Can help me please

    opened by MahmoudSaidHaggag 1
  • Follow only internal redirects

    Follow only internal redirects

    Hello @mvdbos

    I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:

    I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.

    Appreciate any ideas on how you would approach this.

    Cheers, Peter

    opened by spekulatius 2
  • Robots.txt filtering?

    Robots.txt filtering?

    Hello @mvdbos,

    I hope you are doing well!

    I was wondering what your approach (if any) is to using the spider with robots.txt pattern for filtering?

    The UriFilter seems to support only allow and no disallow and hence wouldn't support the most used case of robots files.

    I was just thinking I ask if you got ideas on this before I build something myself.

    Thank you in advance,

    Peter

    opened by spekulatius 6
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

null 68 Dec 27, 2022
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

null 2.1k Dec 30, 2022
Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

AOE 0 Sep 14, 2021
A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

Mohammad Qasemi 9 Sep 24, 2022
It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Techie Sneh 21 Nov 19, 2021
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Mark Beech 1.7k Dec 30, 2022