Goutte, a simple PHP Web Scraper

Last update: Jul 2, 2022

Goutte, a simple PHP Web Scraper

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

Requirements

Goutte depends on PHP 7.1+.

Installation

Add fabpot/goutte as a require dependency in your composer.json file:

composer require fabpot/goutte

Usage

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\HttpBrowser):

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

To use your own HTTP settings, you may create and pass an HttpClient instance to Goutte. For example, to add a 60 second request timeout:

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$client = new Client(HttpClient::create(['timeout' => 60]));

Click on links:

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

Extract data:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

Submit forms:

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

More Information

Read the documentation of the BrowserKit, DomCrawler, and HttpClient Symfony Components for more information about what you can do with Goutte.

Pronunciation

Goutte is pronounced goot i.e. it rhymes with boot and not out.

Technical Information

Goutte is a thin wrapper around the following Symfony Components: BrowserKit, CssSelector, DomCrawler, and HttpClient.

License

Goutte is licensed under the MIT license.

GitHub

https://github.com/FriendsOfPHP/Goutte
Comments
  • 1. Upgrades Guzzle to Guzzle 4

    Two initiatives are working towards goals that will require Goutte in Drupal 8's composer.json. One is adding Behat to Drupal 8 The other is re-architecting Drupals' WebTestBase on top of Mink. Drupal 8 currently uses Guzzle 4. We'd rather not rely on Guzzle 3 and Guzzle 4 at the same time. This updates Goutte to use Guzzle 4. It will require a new 2.0 branch as it bumps php to 5.4. I'm happy to help maintain this version.

    Reviewed by larowlan at 2014-04-17 08:50
  • 2. H: Complete form on ASP.NET WebForms website

    Hello,

    I'm trying to get data from here: https://wyobiz.wy.gov/Business/FilingSearch.aspx

    I'm trying to check if business name is free or not. But this website is ASP.NET Web Forms and has this:

                '__VIEWSTATE' => '',
                '__VIEWSTATEGENERATOR' => '9E6EC73D',
                '__EVENTVALIDATION' => '',
    

    Is it possible to submit this form somehow and get returned data?

    Thank you.

    My code:

            $crawler = $client->request('GET', 'https://wyobiz.wy.gov/Business/FilingSearch.aspx');
            $form = $crawler->selectButton('Search')->form();
            $formValues = $form->getValues();
            $crawler = $client->submit($form, array(
                '__VIEWSTATE' => $formValues['__VIEWSTATE'],
                '__VIEWSTATEGENERATOR' => $formValues['__VIEWSTATEGENERATOR'],
                '__EVENTVALIDATION' => $formValues['__EVENTVALIDATION'],
                'ctl00$MainContent$myScriptManager' => 'MainContent_myScriptManager',
                'ctl00$MainContent$txtFilingName' => 'Google',
                'ctl00$MainContent$searchOpt' => 'chkSearchStartWith',
                'ctl00$MainContent$txtFilingID' => null,
            ));
            echo $crawler->html();
    

    Error:

    The current node list is empty.

    Reviewed by kironet at 2017-03-21 04:46
  • 3. How to upload file with Guzzle client?

    I'm trying to do BDD testing on an upload method. I'm using Behat with Mink in a symfony2 project. I'm trying to use the Guzzle client to send a form with file.

    $url = $this->minkParameters["base_url"] . '/' . ltrim($url, '/'); 
    $file = new \Symfony\Component\HttpFoundation\File\UploadedFile($path, "video");
    $fields = json_encode($table->getColumnsHash()[0]); //array("user" => "test")
    $this->client->request("POST", $url, array('content-type' => 'multipart/form-data'), array($file), array(), $fields);
    

    I receive this error:

    Call to undefined method GuzzleHttp\Stream\Stream::addFile()
    

    Where am i wrong? Thank you

    Reviewed by stuzzo at 2015-01-12 17:00
  • 4. Allow to override Curl options passed to Guzzle request

    Currently curl options are hardcoded in Client.php, but sometimes more options have to be set (increase timeout or max. redirects).

    Code from Client.php:

     $guzzleRequest->getCurlOptions()
            ->set(CURLOPT_FOLLOWLOCATION, false)
            ->set(CURLOPT_MAXREDIRS, 0)
            ->set(CURLOPT_TIMEOUT, 30);
    
    Reviewed by antonbabenko at 2012-07-10 20:45
  • 5. Add support for other types of CookieJar, eg. FileCookieJar and SessionCookieJar

    When using the standard CookieJar the cookies only last the duration of the PHP script. When the PHP script terminates, the cookies will have been lost. To maintain the cookies across script executions, I updated Goutte to support FileCookieJar and SessionCookieJar.

    Reviewed by hoducha at 2015-11-23 14:52
  • 6. PHP 8 support

    Goutte is one of the few Drupal 9.1 dependencies left that does not yet support PHP 8. It would be great to start running CI with PHP 8 to see what needs to be changed.

    Reviewed by goba at 2020-10-13 13:47
  • 7. Parameters are being ignored when sending a GET request.

    In order to scrape data that change dynamically according to GET parameters in a website, I need to create a GET request with parameters such as:

    //a=a&b=b
    $params = array(
                'a' => 'a',
                'b' => 'b',
    );
    
    $crawler = $client->request('GET', 'http://www.example.com/', $params);
    

    Then the complete URL should be formed by the URI (http://www.example.com/) plus the Request parameters (a=a&b=b) so that the request is sent to the correct URL.

    Reviewed by lucianodelucchi at 2013-07-08 09:49
  • 8. Removing nodes

    Hi,

    Is there any way using Goutte to remove nodes? I'm building a parser but I need to remove header and footer elements, along with others.

    I've briefly looked at the source code for Symfony Crawler and I found the reduce function but I'm unsure on how to use it.

    Reviewed by duncanmcclean at 2019-04-09 09:56
  • 9. Get images source from specific div.

    Hi, I'm trying to get source of images which are in a div by id="demo". Here's what I did:

    $crawler->filter('#demo img')->each(function ($node) {
        $src = $node->attr('src');
        echo $src . '<br>';
    });
    

    It doesn't work But if you try getting any other attribute like image height or image class , it works fine. It only doesn't work when you pass src in attr().

    Reviewed by sam-deepweb at 2017-06-22 05:03
  • 10. Fixes for #146

    Demonstrates bug in #146 Basically Guzzle 4.0 doesn't support boolean header values. So for special case of Https, we convert that to 'on'. Not sure if anything is needed at all.

    Reviewed by larowlan at 2014-05-09 06:45
  • 11. Recent "src/" update breaks Composer install?

    I'm trying to install the latest Goutte version via Composer and I keep geting an error that the "Goutte\Client" can't be found. I noticed that there was a recent change that pulled the code out of the "src/" directory. When Composer makes the autoloader, it still tries to include this in the namespacing:

    'Goutte' => $vendorDir . '/fabpot/goutte/src/',

    If I manually remove it, I get this exception: https://gist.github.com/2704894

    Seems like something's broken....not sure what, though.

    Reviewed by enygma at 2012-05-15 20:36
  • 12. syntax error, unexpected token ")"

    This is the Error i get while creating a new instance of the Client : image It works locally, but not in production. i use php 8.1 locally & 8.0.19 in prod so it should not be a syntax probleme. when i checked prod's laravel logs it says this : [2022-06-22 14:22:19] local.ERROR: syntax error, unexpected token ")" {"userId":2,"exception":"[object] (ParseError(code: 0): syntax error, unexpected token ")" at /home/749128.cloudwaysapps.com/maxxhthfnk/public_html/vendor/symfony/http-client/CurlHttpClient.php:68)

    Reviewed by BenOussama180 at 2022-06-23 08:25
  • 13. SSL routines:tls_process_ske_dhe:dh key too small

    I want to crawl a website that uses lower SSL level, and I get this error:

    Error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small for example.com

    I followed this instruction but with no success: how-to-set-lower-ssl-security-level with the following command to restart php service: $sudo service php8.1-fpm restart

    I also used this code to disable SSL check: $client = new Client(HttpClient::create(['timeout' => 60,'verify_peer' => false, 'verify_host' => false]));

    how can I fix this issue?

    Reviewed by radinmehr at 2022-05-09 05:50
  • 14. how to filter class like this

    How i can filter class like this product-item product-item--list 1/3--tablet-and-up 1/4--desk

    this is page i try to scrape data from https://select.eg/en/collections/%D9%85%D9%88%D8%A8%D8%A7%D9%8A%D9%84-%D8%A7%D8%A8%D9%84

    Reviewed by aroon9002ahmed at 2022-04-15 22:20
  • 15. Why not archiving this project?

    If all this does now is extending symfony browserkit client, why not just archive it? I was so confused by this library and goutte-driver and mink-extension, the whole thing actually does not work.

    I think we need a new factory in mink-extension that initiates this browserkit client version opposing to all this wrapper madness.

    I've created a PR and made the goutte-driver work but is completely misleading and parameters are actually not being sent either. So, happy to fix the issue but will need some light shed on this issue and the best way to tackle it.

    Reviewed by joshlopes at 2022-03-25 16:54
Related tags
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Jun 16, 2022
PHP Scraper - an highly opinionated web-interface for PHP
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Jun 22, 2022
The most integrated web scraper package for Laravel.
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Jun 20, 2022
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

Jun 28, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Mar 24, 2022
Extractor (scraper, crawler, parser) of products from Allegro

Extractor (scraper, crawler, parser) of products from Allegro

May 11, 2022
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Jun 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Jun 25, 2022
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Jun 29, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

Jul 2, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Apr 15, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

May 12, 2022
Property page web scrapper
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Feb 24, 2022
Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

Jun 25, 2022
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Jun 25, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
:spider: The progressive PHP crawler framework!  优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jun 28, 2022
PHP Discord Webcrawler to log all messages from a Discord Chat.
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Jun 9, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

Nov 26, 2021
This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

Dec 22, 2021