Goutte, a simple PHP Web Scraper

Related tags

Scraping Goutte
Overview

Goutte, a simple PHP Web Scraper

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

Requirements

Goutte depends on PHP 7.1+.

Installation

Add fabpot/goutte as a require dependency in your composer.json file:

composer require fabpot/goutte

Usage

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\HttpBrowser):

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

To use your own HTTP settings, you may create and pass an HttpClient instance to Goutte. For example, to add a 60 second request timeout:

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$client = new Client(HttpClient::create(['timeout' => 60]));

Click on links:

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

Extract data:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

Submit forms:

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

More Information

Read the documentation of the BrowserKit, DomCrawler, and HttpClient Symfony Components for more information about what you can do with Goutte.

Pronunciation

Goutte is pronounced goot i.e. it rhymes with boot and not out.

Technical Information

Goutte is a thin wrapper around the following Symfony Components: BrowserKit, CssSelector, DomCrawler, and HttpClient.

License

Goutte is licensed under the MIT license.

Comments
  • Upgrades Guzzle to Guzzle 4

    Upgrades Guzzle to Guzzle 4

    Two initiatives are working towards goals that will require Goutte in Drupal 8's composer.json. One is adding Behat to Drupal 8 The other is re-architecting Drupals' WebTestBase on top of Mink. Drupal 8 currently uses Guzzle 4. We'd rather not rely on Guzzle 3 and Guzzle 4 at the same time. This updates Goutte to use Guzzle 4. It will require a new 2.0 branch as it bumps php to 5.4. I'm happy to help maintain this version.

    opened by larowlan 16
  • H: Complete form on ASP.NET WebForms website

    H: Complete form on ASP.NET WebForms website

    Hello,

    I'm trying to get data from here: https://wyobiz.wy.gov/Business/FilingSearch.aspx

    I'm trying to check if business name is free or not. But this website is ASP.NET Web Forms and has this:

                '__VIEWSTATE' => '',
                '__VIEWSTATEGENERATOR' => '9E6EC73D',
                '__EVENTVALIDATION' => '',
    

    Is it possible to submit this form somehow and get returned data?

    Thank you.

    My code:

            $crawler = $client->request('GET', 'https://wyobiz.wy.gov/Business/FilingSearch.aspx');
            $form = $crawler->selectButton('Search')->form();
            $formValues = $form->getValues();
            $crawler = $client->submit($form, array(
                '__VIEWSTATE' => $formValues['__VIEWSTATE'],
                '__VIEWSTATEGENERATOR' => $formValues['__VIEWSTATEGENERATOR'],
                '__EVENTVALIDATION' => $formValues['__EVENTVALIDATION'],
                'ctl00$MainContent$myScriptManager' => 'MainContent_myScriptManager',
                'ctl00$MainContent$txtFilingName' => 'Google',
                'ctl00$MainContent$searchOpt' => 'chkSearchStartWith',
                'ctl00$MainContent$txtFilingID' => null,
            ));
            echo $crawler->html();
    

    Error:

    The current node list is empty.

    opened by kironet 15
  • How to upload file with Guzzle client?

    How to upload file with Guzzle client?

    I'm trying to do BDD testing on an upload method. I'm using Behat with Mink in a symfony2 project. I'm trying to use the Guzzle client to send a form with file.

    $url = $this->minkParameters["base_url"] . '/' . ltrim($url, '/'); 
    $file = new \Symfony\Component\HttpFoundation\File\UploadedFile($path, "video");
    $fields = json_encode($table->getColumnsHash()[0]); //array("user" => "test")
    $this->client->request("POST", $url, array('content-type' => 'multipart/form-data'), array($file), array(), $fields);
    

    I receive this error:

    Call to undefined method GuzzleHttp\Stream\Stream::addFile()
    

    Where am i wrong? Thank you

    opened by stuzzo 15
  • Allow to override Curl options passed to Guzzle request

    Allow to override Curl options passed to Guzzle request

    Currently curl options are hardcoded in Client.php, but sometimes more options have to be set (increase timeout or max. redirects).

    Code from Client.php:

     $guzzleRequest->getCurlOptions()
            ->set(CURLOPT_FOLLOWLOCATION, false)
            ->set(CURLOPT_MAXREDIRS, 0)
            ->set(CURLOPT_TIMEOUT, 30);
    
    opened by antonbabenko 14
  • Add support for other types of CookieJar, eg. FileCookieJar and SessionCookieJar

    Add support for other types of CookieJar, eg. FileCookieJar and SessionCookieJar

    When using the standard CookieJar the cookies only last the duration of the PHP script. When the PHP script terminates, the cookies will have been lost. To maintain the cookies across script executions, I updated Goutte to support FileCookieJar and SessionCookieJar.

    opened by hoducha 13
  • PHP 8 support

    PHP 8 support

    Goutte is one of the few Drupal 9.1 dependencies left that does not yet support PHP 8. It would be great to start running CI with PHP 8 to see what needs to be changed.

    opened by goba 10
  • Parameters are being ignored when sending a GET request.

    Parameters are being ignored when sending a GET request.

    In order to scrape data that change dynamically according to GET parameters in a website, I need to create a GET request with parameters such as:

    //a=a&b=b
    $params = array(
                'a' => 'a',
                'b' => 'b',
    );
    
    $crawler = $client->request('GET', 'http://www.example.com/', $params);
    

    Then the complete URL should be formed by the URI (http://www.example.com/) plus the Request parameters (a=a&b=b) so that the request is sent to the correct URL.

    opened by lucianodelucchi 10
  • Removing nodes

    Removing nodes

    Hi,

    Is there any way using Goutte to remove nodes? I'm building a parser but I need to remove header and footer elements, along with others.

    I've briefly looked at the source code for Symfony Crawler and I found the reduce function but I'm unsure on how to use it.

    opened by duncanmcclean 9
  • Get images source from specific div.

    Get images source from specific div.

    Hi, I'm trying to get source of images which are in a div by id="demo". Here's what I did:

    $crawler->filter('#demo img')->each(function ($node) {
        $src = $node->attr('src');
        echo $src . '<br>';
    });
    

    It doesn't work But if you try getting any other attribute like image height or image class , it works fine. It only doesn't work when you pass src in attr().

    opened by sam-deepweb 9
  • Fixes for #146

    Fixes for #146

    Demonstrates bug in #146 Basically Guzzle 4.0 doesn't support boolean header values. So for special case of Https, we convert that to 'on'. Not sure if anything is needed at all.

    opened by larowlan 9
  • Recent

    Recent "src/" update breaks Composer install?

    I'm trying to install the latest Goutte version via Composer and I keep geting an error that the "Goutte\Client" can't be found. I noticed that there was a recent change that pulled the code out of the "src/" directory. When Composer makes the autoloader, it still tries to include this in the namespacing:

    'Goutte' => $vendorDir . '/fabpot/goutte/src/',

    If I manually remove it, I get this exception: https://gist.github.com/2704894

    Seems like something's broken....not sure what, though.

    opened by enygma 9
  • JS Submit

    JS Submit

    is a way to run a js function with goutte?

    <input id="ingresar" onclick="submitForm()" name="ingresar" class="btn btn-success" style="width:390px;-webkit-border-radius:0 !important;-moz-border-radius:0 !important;border-radius:0 !important;text-align:center;font-size:18px;font-weight:bold;color:#fff;padding:20px 0;cursor:pointer;background-image:none;" value="Ingresar">

    This html change and now i cant submit and when y fill the form i can´t send it.

    opened by CAUN94 2
  • The current node list is empty.

    The current node list is empty.

    I have the next problema with this code, the last 5 month work with no problems.

    $form = $crawler->selectButton('Ingresar')->form();
    $form->setValues(['rut' => 'admin', 'password' => 'Pascual4900']);
    $crawler = $client->submit($form);
    

    This is the html from the page

    <form id="login-form" method="POST" action="/sessions/authenticate">
    	<div class="login-container">
    		<div class="input" style="cursor:text;" onclick="jQuery('input', this).focus();">
    			<h5 style="margin:0">Usuario</h5>
    			<input name="rut" type="text" style="font-size:18px;border:0;-webkit-box-shadow:0 0 0;-moz-box-shadow:0 0 0;box-shadow:0 0 0 inset;padding:10px 0 0 0;" placeholder="Escriba su usuario...">
    		</div>
    		<div class="input" style="cursor:text;" onclick="jQuery('input', this).focus();">
    			<h5 style="margin:0">Clave</h5>
    			<input name="password" type="password" style="font-size:18px;border:0;-webkit-box-shadow:0 0 0;-moz-box-shadow:0 0 0;box-shadow:0px 0px 0px inset;padding:10px 0 0 0;" placeholder="Escriba su clave...">
    		</div>
    		<input id="ingresar" onclick="submitForm()" name="ingresar" class="btn btn-success" style="width:390px;-webkit-border-radius:0 !important;-moz-border-radius:0 !important;border-radius:0 !important;text-align:center;font-size:18px;font-weight:bold;color:#fff;padding:20px 0;cursor:pointer;background-image:none;" value="Ingresar">
    	</div>
    </form>
    

    Thiis return this error

    InvalidArgumentException The current node list is empty.

    opened by CAUN94 4
  • Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init()

    Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init()

    Hello, I don't have any problem running goutte on my local host, but when I transfer files on a shared host, I encounter this error. Thank you for helping me soon.

    Fatal error: Uncaught Error: Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init() in /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/Internal/CurlClientState.php:43 Stack trace: #0 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/CurlHttpClient.php(74): Symfony\Component\HttpClient\Internal\CurlClientState->__construct(6, 50) #1 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/HttpClient.php(54): Symfony\Component\HttpClient\CurlHttpClient->__construct(Array, 6, 50) #2 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/browser-kit/HttpBrowser.php(37): Symfony\Component\HttpClient\HttpClient::create() #3 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(16): Symfony\Component\BrowserKit\HttpBrowser->__construct() #4 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(27): Start->card() #5 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(98): Start->titel() #6 {main} thrown in /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/Internal/CurlClientState.php on line 43

    opened by abolfazllover 0
  • Error parsing get url request with year in url

    Error parsing get url request with year in url

    It seems when the numbers 2020 or 24 are in the url passed to request, it will not return any data, as I believe the url encoding is breaking the url.

    after googling for hours, I have tried to pass
    $crawler->getQuery()->setEncodingType(false); $crawler->setEncodingType(false);

    with no luck

    An example of a non working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson+ruby+wave&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

    and an example of a working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

    i cannot figure for the life of me how to fix this, I have tried urlecoding, decoding, a million different versions of the url, and the only thing different between the 2 url's is the keyword, _nkw param.

    Thanks in advance

    opened by djpisbionic 3
  • syntax error, unexpected token

    syntax error, unexpected token ")"

    This is the Error i get while creating a new instance of the Client : image It works locally, but not in production. i use php 8.1 locally & 8.0.19 in prod so it should not be a syntax probleme. when i checked prod's laravel logs it says this : [2022-06-22 14:22:19] local.ERROR: syntax error, unexpected token ")" {"userId":2,"exception":"[object] (ParseError(code: 0): syntax error, unexpected token ")" at /home/749128.cloudwaysapps.com/maxxhthfnk/public_html/vendor/symfony/http-client/CurlHttpClient.php:68)

    opened by BenOussama180 7
Releases(v3.1.0)
Owner
null
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
Extractor (scraper, crawler, parser) of products from Allegro

Extractor (scraper, crawler, parser) of products from Allegro

Daniel Yatsura 1 May 11, 2022
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

null 68 Dec 27, 2022
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

null 2.1k Dec 30, 2022
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Mark Beech 1.7k Dec 30, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Daniel Reis 46 Sep 21, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

João Eduardo Fornazari 1 Nov 26, 2021