PHP Scraper - an highly opinionated web-interface for PHP

Overview

PHP Scraper

PHP Scraper

An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath selectors, preparing data structures, etc. Instead, you can just "go to a website" and get an array with all details relevant to your scraping project.

Under the hood, it uses Goutte and a few other packages. See composer.json.

Sponsors

This project is sponsored by:

Want to sponsor this project? Contact me.

Examples

Here are a few impressions on the way the library works. More examples are on the project website.

Get the Title of a Website

All scraping functionality can be accessed either as a function call or a property call. On the example of title scraping this would like like this:

$web = new \spekulatius\phpscraper();

$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

Scrape the Images from a Website

Scraping the images including the attributes of the img-tags:

$web = new \spekulatius\phpscraper();

/**
 * Navigate to the test page.
 *
 * This page contains twice the image "cat.jpg".
 * Once with a relative path and once with an absolute path.
 */
$web->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html');

var_dump($web->imagesWithDetails);
/**
 * Contains:
 *
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'absolute path',
 *     'width' => null,
 *     'height' => null,
 * ],
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'relative path',
 *     'width' => null,
 *     'height' => null,
 * ]
 */

See the full documentation for more information and examples.

Comments
  • Bump jeremykendall/php-domain-parser to version 6

    Bump jeremykendall/php-domain-parser to version 6

    I'm trying to integrate this into an app that is already locked to psr/log:^3.

    jeremykendall/php-domain-parser version 5 is locked to psr/log:^1

    There are 2 tests that fail when bumping to the new version. I can fix and submit a PR.

    opened by tacman 10
  • Versioning and Changelog

    Versioning and Changelog

    Hi

    Its a great project, we use that library, its great :+1: thanks for all the work!

    We are not sure what the versioning concept of this library looks like, also we could not find information about what has changed (a changelog), there are only tags: https://github.com/spekulatius/PHPScraper/tags. So it would be nice to read something about what is the versioning strategy and when are breaks expected.

    Of course semver would be nice with an 1.0.0 release and a CHANGELOG.md and UPGRADE.md. My personal example would look like this https://github.com/nadar/quill-delta-parser/blob/master/CHANGELOG.md and this https://github.com/nadar/quill-delta-parser/blob/master/UPGRADE.md

    Thanks and keep up to great work!

    opened by nadar 8
  • [Proposal] Add HTTP proxy support

    [Proposal] Add HTTP proxy support

    I'm working on a project that needs to be constantly changing proxies and taking a look in the Goute client, the underlying implementation already supports it.

    Research

    1. HttpBrowser, which Goute is based on, supports a custom client, implementing the HttpClientInterface (source code)
    ...
    class HttpBrowser extends AbstractBrowser
    {
        private $client;
    
        public function __construct(HttpClientInterface $client = null, History $history = null, CookieJar $cookieJar = null)
    ...
            $this->client = $client ?? HttpClient::create();
    ...
    
    1. The HttpClient::create() supports proxy using the $defaultOptions parameter, that gets passed to the selected HttpClient (source code)
            public static function create(array $defaultOptions = [], int $maxHostConnections = 6, int $maxPendingPushes = 50): HttpClientInterface
    

    Implementation details

    The idea is to expose this functionality though a setProxy function in the core class. The library will continue to dynamically select the httpClient accordingly.

    I just made a POC in my fork with all necessary code to make this work (a6589da).

    public function setProxy(string $proxy)
    {
        $httpClient = HttpClient::create([
            'proxy' => $proxy
        ]);
    
        $this->client = new Client($httpClient);
    
        return $this;
    }
    

    How to use

    $web = new phpscraper;
    $web->__call('setProxy', [
        'http://user:[email protected]:3128',
    ]);
    

    If this feature gets approved, I will open a PR with it. If anything needs to be changed, let me know.

    opened by nathabonfim59 6
  • Getting Deprecated: strpos() error in PHP PHP 8.1.8

    Getting Deprecated: strpos() error in PHP PHP 8.1.8

    I am using this example https://phpscraper.de/examples/extract-keywords.html with PHP 8.1.8

    Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 858
    PHP Deprecated:  strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 859
    
    opened by khanakia 5
  • Make public function to access Client (Goutte)

    Make public function to access Client (Goutte)

    I wanted to know the response code of the url but could not get access.

    Looking at Goutte under the hood, I think I could get it if I could do something like:

    $web->getClient()->getInternalResponse()->getStatusCode()
    

    So, I would like a function to get access to the $client, like:

    public function getClient()
    {
       return $this->client;
    }
    
    opened by amurrell 4
  • SSL connect error

    SSL connect error

    Hi

    Im getting an

    SSL connect error for "https://bbc.in/3cNMnkw".

    when connecting to some websites. I'm writing a tweetdeck style system and so will scrape the site that the tweet links to to pick up the details so that a nicer link can be generated. Any ideas on how to solve this ? Most work fine, its just a few that fail.

    Also Im sometimes getting a "This browser is no longer supported" coming through as a description from the page. I notice you are passing through 'Mozilla/5.0 (compatible; PHP Scraper/0.x; +https://phpscraper.de)' as the user agent. Is it worth altering the user agent ? I dont think I can pass it in as a parameter anywhere from what I can see?

    Thanks.

    opened by tonybyng 4
  • TypeError

    TypeError

    When i run the sample code:

    $web = new \Spekulatius\PHPScraper\PHPScraper();
    $web->go('https://www.google.com/');
    echo $web->title;
    

    It return:

    Spekulatius\PHPScraper\Core::setHttpClient(): Argument #1 ($httpClient) must be of type Symfony\Component\HttpClient\CurlHttpClient, Symfony\Component\HttpClient\NativeHttpClient given, called in C:\www\web-crawer\vendor\spekulatius\phpscraper\src\PHPScraper.php on line 108
    

    Environments

    PHP: 8.1.13 PHPScraper: 1.0.1

    opened by alanx15a2 3
  • deprecate magic properties / methods

    deprecate magic properties / methods

    In my branch I've removed the magic __get and __call methods, and moved what was core into the phpscraper, so now there is only one class.

    After a while I got tired of find/replace, so I created a rector rule to change the properties to method.

    Is there a demo repository that uses I can use to test my branch? Tests are passing, except for the ones related to internal/external links, which I'll address in another issue.

    opened by tacman 3
  • charset in headers method

    charset in headers method

    Is this method used? Where is charset() defined?

        /**
         * Get the header collected as an array
         *
         * @return array
         */
        public function headers()
        {
            return [
                'charset' => $this->charset(),
                'contentType' => $this->contentType(),
                'viewport' => $this->viewport(),
                'canonical' => $this->canonical(),
                'csrfToken' => $this->csrfToken(),
            ];
        }
    
    opened by tacman 3
  • SSL certificate problem

    SSL certificate problem

    Symfony\Component\HttpClient\Exception\TransportException: SSL certificate problem: unable to get local issuer certificate for ***

    Getting an error when trying to get data from a link that doesn't have ssl

    image
    opened by datlechin 3
  • composer require fails on mac due to strange characters in filenames in the /tests directory

    composer require fails on mac due to strange characters in filenames in the /tests directory

    Problem

    Running composer require spekulatius/phpscraper fails on macOS 10.15.5 with the following:

    [RuntimeException]                                                                                                                                                                                                          
     Failed to extract spekulatius/phpscraper: (50) '/usr/bin/unzip' -qq '/xxx/xxx/vendor/composer/tmp-123dcc14bbc3272649fdd489b0ecc9a3' -d '/xxx/xxx/vendor/composer/74030a62'  
                                                                                                                                                                                                                                
    error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/katze-+?-++-+?.jpg                                                                    
            Illegal byte sequence                                                                                                                                                                                               
    error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/???.jpg                                                                               
            Illegal byte sequence                                                                                                                                                                                               
    error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/katze-+?-++-+?.jpg                                                                
            Illegal byte sequence                                                                                                                                                                                               
    error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/???.jpg                                                                           
            Illegal byte sequence                                                                                                                                                                                               
    

    It seems like the unzip command within composer fails due to the strange characters.

    opened by BrettGregson 3
  • What location PHPSCrapper based on?

    What location PHPSCrapper based on?

    Hi @spekulatius,

    Some user using my https://github.com/datlechin/flarum-link-preview that used PHPScrapper, They see the scrap content is in other language, is the website content based on where the website is hosted?

    image

    https://discuss.flarum.org/d/30011-link-preview/178

    opened by datlechin 1
  • Idea: Directly exposing received headers

    Idea: Directly exposing received headers

    Thanks, so it actually depends on the header. So far headers haven't been processed much. Exposing them would be beneficial in general. What do you think?
    

    Originally posted by @spekulatius in https://github.com/spekulatius/PHPScraper/pull/164#discussion_r1045676088

    opened by spekulatius 1
  • Idea: Implement low-level util to access the web.

    Idea: Implement low-level util to access the web.

    E.g.

    // GET request
    $web->get('https://...');
    
    // POST request
    $response = $web->post('https://...', [
      'param' => 'first param',
    ]);
    
    // ...
    

    This could be done either directly in PHPScraper or built upon another specialized lib such as Symfony HTTP. Exposing the functionality of the existing dependency sounds like a reasonable way to go, if the idea is of interest.

    opened by spekulatius 1
  • Parsing structured data (microdata)

    Parsing structured data (microdata)

    #16 proposes adding support for JSONLD.

    There isn't only JSONLD - structured data can be provided also in the microdata notation, and: good news - there is a project which parses microdata and converts it to the same data structure as JSONLD: https://github.com/yusufkandemir/microdata-parser

    So it should be possible to use both and treat it just like an additional JSONLD block!

    A first test:

    $jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromHTML($web->client->getResponse()->getContent(), $web->currentUrl())->toArray();
    

    Internally this project uses an own DOM document class derived from DOMDocument. It has a function to import a DOMDocument - but Symphonys response class doesn't allow to access the DOMDocument.

    I did a small test, but didn't fiddle it out how to pass the DOMDocument without reparsing - my try which didn't work:

    $dom = new \DOMDocument('1.0', 'UTF-8');
    $dom->importNode($web->filterFirst('//*')->getNode(0), true);
    $jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromDOMDocument($dom);
    

    What makes sense? Adding separate PHPScraper functions for JSONLD and microdata? Or mixing both automatically? (my opinion: mixing)

    How should support for microdata look like? Adding the other project to PHPScraper? Extending the existing classes or porting the whole functionality to PHPScraper?

    opened by eposjk 2
  • get http status code

    get http status code

    How do I check if the scraped page is an error page? It would be helpful to have the http status for that - something like

    $web->go('https://httpstat.us/404');
    echo $web->status;     // prints 404
    

    Just before posting, I found the solution myself:

    $web->client->getResponse()->getStatusCode()
    

    It would be great to have this added to the documentation. Or maybe add $web->status as shortcut?

    opened by eposjk 7
Releases(1.0.0)
Owner
Peter Thaleikis
Software engineer focused on solutions using open source and simply filling in the gaps to fulfill the requirements.
Peter Thaleikis
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
Extractor (scraper, crawler, parser) of products from Allegro

Extractor (scraper, crawler, parser) of products from Allegro

Daniel Yatsura 1 May 11, 2022
On-Page SEO Crawler Tool with Interface

upzon I developed this project with PHP & MYSQL and python. If you have basic python and php knowledge, it is quite simple to use this program. I'm us

null 5 Oct 27, 2021
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

null 68 Dec 27, 2022
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Mark Beech 1.7k Dec 30, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Daniel Reis 46 Sep 21, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

João Eduardo Fornazari 1 Nov 26, 2021