Goutte, a simple PHP Web Scraper

Last update: Jan 1, 2023

Related tags

Scraping Goutte

Overview

Goutte, a simple PHP Web Scraper

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

Requirements

Goutte depends on PHP 7.1+.

Installation

Add fabpot/goutte as a require dependency in your composer.json file:

composer require fabpot/goutte

Usage

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\HttpBrowser):

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

To use your own HTTP settings, you may create and pass an HttpClient instance to Goutte. For example, to add a 60 second request timeout:

use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$client = new Client(HttpClient::create(['timeout' => 60]));

Click on links:

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

Extract data:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

Submit forms:

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'fabpot', 'password' => 'xxxxxx']);
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

More Information

Read the documentation of the BrowserKit, DomCrawler, and HttpClient Symfony Components for more information about what you can do with Goutte.

Pronunciation

Goutte is pronounced goot i.e. it rhymes with boot and not out.

Technical Information

Goutte is a thin wrapper around the following Symfony Components: BrowserKit, CssSelector, DomCrawler, and HttpClient.

License

Goutte is licensed under the MIT license.

Comments

Upgrades Guzzle to Guzzle 4

Two initiatives are working towards goals that will require Goutte in Drupal 8's composer.json. One is adding Behat to Drupal 8 The other is re-architecting Drupals' WebTestBase on top of Mink. Drupal 8 currently uses Guzzle 4. We'd rather not rely on Guzzle 3 and Guzzle 4 at the same time. This updates Goutte to use Guzzle 4. It will require a new 2.0 branch as it bumps php to 5.4. I'm happy to help maintain this version.

opened by larowlan 16

H: Complete form on ASP.NET WebForms website

Hello,

I'm trying to get data from here: https://wyobiz.wy.gov/Business/FilingSearch.aspx

I'm trying to check if business name is free or not. But this website is ASP.NET Web Forms and has this:

            '__VIEWSTATE' => '',
            '__VIEWSTATEGENERATOR' => '9E6EC73D',
            '__EVENTVALIDATION' => '',

Is it possible to submit this form somehow and get returned data?

Thank you.

My code:

        $crawler = $client->request('GET', 'https://wyobiz.wy.gov/Business/FilingSearch.aspx');
        $form = $crawler->selectButton('Search')->form();
        $formValues = $form->getValues();
        $crawler = $client->submit($form, array(
            '__VIEWSTATE' => $formValues['__VIEWSTATE'],
            '__VIEWSTATEGENERATOR' => $formValues['__VIEWSTATEGENERATOR'],
            '__EVENTVALIDATION' => $formValues['__EVENTVALIDATION'],
            'ctl00$MainContent$myScriptManager' => 'MainContent_myScriptManager',
            'ctl00$MainContent$txtFilingName' => 'Google',
            'ctl00$MainContent$searchOpt' => 'chkSearchStartWith',
            'ctl00$MainContent$txtFilingID' => null,
        ));
        echo $crawler->html();

Error:

The current node list is empty.

opened by kironet 15

How to upload file with Guzzle client?

I'm trying to do BDD testing on an upload method. I'm using Behat with Mink in a symfony2 project. I'm trying to use the Guzzle client to send a form with file.

$url = $this->minkParameters["base_url"] . '/' . ltrim($url, '/'); 
$file = new \Symfony\Component\HttpFoundation\File\UploadedFile($path, "video");
$fields = json_encode($table->getColumnsHash()[0]); //array("user" => "test")
$this->client->request("POST", $url, array('content-type' => 'multipart/form-data'), array($file), array(), $fields);

I receive this error:

Call to undefined method GuzzleHttp\Stream\Stream::addFile()

Where am i wrong? Thank you

opened by stuzzo 15

Allow to override Curl options passed to Guzzle request
Currently curl options are hardcoded in Client.php, but sometimes more options have to be set (increase timeout or max. redirects).

Code from Client.php:

$guzzleRequest->getCurlOptions() ->set(CURLOPT_FOLLOWLOCATION, false) ->set(CURLOPT_MAXREDIRS, 0) ->set(CURLOPT_TIMEOUT, 30);
opened by antonbabenko 14
Add support for other types of CookieJar, eg. FileCookieJar and SessionCookieJar

When using the standard CookieJar the cookies only last the duration of the PHP script. When the PHP script terminates, the cookies will have been lost. To maintain the cookies across script executions, I updated Goutte to support FileCookieJar and SessionCookieJar.

opened by hoducha 13
PHP 8 support

Goutte is one of the few Drupal 9.1 dependencies left that does not yet support PHP 8. It would be great to start running CI with PHP 8 to see what needs to be changed.

opened by goba 10
Parameters are being ignored when sending a GET request.
In order to scrape data that change dynamically according to GET parameters in a website, I need to create a GET request with parameters such as:

//a=a&b=b $params = array( 'a' => 'a', 'b' => 'b', ); $crawler = $client->request('GET', 'http://www.example.com/', $params);

Then the complete URL should be formed by the URI (http://www.example.com/) plus the Request parameters (a=a&b=b) so that the request is sent to the correct URL.
opened by lucianodelucchi 10
Removing nodes

Hi,

Is there any way using Goutte to remove nodes? I'm building a parser but I need to remove header and footer elements, along with others.

I've briefly looked at the source code for Symfony Crawler and I found the reduce function but I'm unsure on how to use it.

opened by duncanmcclean 9
Get images source from specific div.
Hi, I'm trying to get source of images which are in a div by id="demo". Here's what I did:

$crawler->filter('#demo img')->each(function ($node) { $src = $node->attr('src'); echo $src . '<br>'; });

It doesn't work But if you try getting any other attribute like image height or image class , it works fine. It only doesn't work when you pass src in attr().
opened by sam-deepweb 9
Fixes for #146

Demonstrates bug in #146 Basically Guzzle 4.0 doesn't support boolean header values. So for special case of Https, we convert that to 'on'. Not sure if anything is needed at all.

opened by larowlan 9
Recent "src/" update breaks Composer install?

I'm trying to install the latest Goutte version via Composer and I keep geting an error that the "Goutte\Client" can't be found. I noticed that there was a recent change that pulled the code out of the "src/" directory. When Composer makes the autoloader, it still tries to include this in the namespacing:

'Goutte' => $vendorDir . '/fabpot/goutte/src/',

If I manually remove it, I get this exception: https://gist.github.com/2704894

Seems like something's broken....not sure what, though.

opened by enygma 9
JS Submit

is a way to run a js function with goutte?

<input id="ingresar" onclick="submitForm()" name="ingresar" class="btn btn-success" style="width:390px;-webkit-border-radius:0 !important;-moz-border-radius:0 !important;border-radius:0 !important;text-align:center;font-size:18px;font-weight:bold;color:#fff;padding:20px 0;cursor:pointer;background-image:none;" value="Ingresar">

This html change and now i cant submit and when y fill the form i can´t send it.

opened by CAUN94 2

The current node list is empty.

I have the next problema with this code, the last 5 month work with no problems.

$form = $crawler->selectButton('Ingresar')->form();
$form->setValues(['rut' => 'admin', 'password' => 'Pascual4900']);
$crawler = $client->submit($form);

This is the html from the page

<form id="login-form" method="POST" action="/sessions/authenticate">
	<div class="login-container">
		<div class="input" style="cursor:text;" onclick="jQuery('input', this).focus();">
			<h5 style="margin:0">Usuario</h5>
			<input name="rut" type="text" style="font-size:18px;border:0;-webkit-box-shadow:0 0 0;-moz-box-shadow:0 0 0;box-shadow:0 0 0 inset;padding:10px 0 0 0;" placeholder="Escriba su usuario...">
		</div>
		<div class="input" style="cursor:text;" onclick="jQuery('input', this).focus();">
			<h5 style="margin:0">Clave</h5>
			<input name="password" type="password" style="font-size:18px;border:0;-webkit-box-shadow:0 0 0;-moz-box-shadow:0 0 0;box-shadow:0px 0px 0px inset;padding:10px 0 0 0;" placeholder="Escriba su clave...">
		</div>
		<input id="ingresar" onclick="submitForm()" name="ingresar" class="btn btn-success" style="width:390px;-webkit-border-radius:0 !important;-moz-border-radius:0 !important;border-radius:0 !important;text-align:center;font-size:18px;font-weight:bold;color:#fff;padding:20px 0;cursor:pointer;background-image:none;" value="Ingresar">
	</div>
</form>

Thiis return this error

InvalidArgumentException The current node list is empty.

opened by CAUN94 4

$Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init()$

Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init()

Hello, I don't have any problem running goutte on my local host, but when I transfer files on a shared host, I encounter this error. Thank you for helping me soon.

Fatal error: Uncaught Error: Call to undefined function Symfony\Component\HttpClient\Internal\curl_multi_init() in /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/Internal/CurlClientState.php:43 Stack trace: #0 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/CurlHttpClient.php(74): Symfony\Component\HttpClient\Internal\CurlClientState->__construct(6, 50) #1 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/HttpClient.php(54): Symfony\Component\HttpClient\CurlHttpClient->__construct(Array, 6, 50) #2 /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/browser-kit/HttpBrowser.php(37): Symfony\Component\HttpClient\HttpClient::create() #3 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(16): Symfony\Component\BrowserKit\HttpBrowser->__construct() #4 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(27): Start->card() #5 /home2/farmasco/domains/almas-net.com/public_html/test/test/index.php(98): Start->titel() #6 {main} thrown in /home2/farmasco/domains/almas-net.com/public_html/test/test/vendor/symfony/http-client/Internal/CurlClientState.php on line 43

opened by abolfazllover 0
Error parsing get url request with year in url

It seems when the numbers 2020 or 24 are in the url passed to request, it will not return any data, as I believe the url encoding is breaking the url.

after googling for hours, I have tried to pass
$crawler->getQuery()->setEncodingType(false); $crawler->setEncodingType(false);

with no luck

An example of a non working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson+ruby+wave&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

and an example of a working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

i cannot figure for the life of me how to fix this, I have tried urlecoding, decoding, a million different versions of the url, and the only thing different between the 2 url's is the keyword, _nkw param.

Thanks in advance

opened by djpisbionic 3
syntax error, unexpected token ")"

This is the Error i get while creating a new instance of the Client : It works locally, but not in production. i use php 8.1 locally & 8.0.19 in prod so it should not be a syntax probleme. when i checked prod's laravel logs it says this : [2022-06-22 14:22:19] local.ERROR: syntax error, unexpected token ")" {"userId":2,"exception":"[object] (ParseError(code: 0): syntax error, unexpected token ")" at /home/749128.cloudwaysapps.com/maxxhthfnk/public_html/vendor/symfony/http-client/CurlHttpClient.php:68)

opened by BenOussama180 7

Releases(v3.1.0)

v3.1.0(Jun 25, 2015)

Release of version 3.1.0
Source code(tar.gz)
Source code(zip)
goutte-v3.1.0.phar(312.62 KB)
v3.0.0(Jun 25, 2015)

Release of version 3.0.0
Source code(tar.gz)
Source code(zip)
goutte-v3.0.0.phar(312.20 KB)
v2.0.4(Jun 25, 2015)

Release of version 2.0.4
Source code(tar.gz)
Source code(zip)
goutte-v2.0.4.phar(385.88 KB)
v2.0.3(Jun 25, 2015)

Release of version 2.0.3
Source code(tar.gz)
Source code(zip)
goutte-v2.0.3.phar(385.64 KB)
v1.0.7(Jun 25, 2015)

Release of version 1.0.7
Source code(tar.gz)
Source code(zip)
goutte-v1.0.7.phar(719.35 KB)
v2.0.2(Jun 25, 2015)

Release of version 2.0.2
Source code(tar.gz)
Source code(zip)
goutte-v2.0.2.phar(586.49 KB)

Owner

GitHub

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

485 Dec 31, 2022

PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

327 Dec 30, 2022

The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

134 Jan 4, 2023

Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

60 Nov 30, 2022

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

1 Mar 24, 2022

Extractor (scraper, crawler, parser) of products from Allegro

1 May 11, 2022

A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

1.3k Dec 28, 2022

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

1.1k Jan 3, 2023

Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

1.9k Jan 1, 2023

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

57 Dec 16, 2022

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

11 Nov 8, 2022

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

68 Dec 27, 2022

Goutte, a simple PHP Web Scraper

Related tags

Overview

Goutte, a simple PHP Web Scraper

Requirements

Installation

Usage

More Information

Pronunciation

Technical Information

License

Comments

Releases(v3.1.0)

v3.1.0(Jun 25, 2015)

v3.0.0(Jun 25, 2015)

v2.0.4(Jun 25, 2015)

v2.0.3(Jun 25, 2015)

v1.0.7(Jun 25, 2015)

v2.0.2(Jun 25, 2015)

Owner

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

PHP Scraper - an highly opinionated web-interface for PHP

The most integrated web scraper package for Laravel.

Library for Rapid (Web) Crawler and Scraper Development

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Extractor (scraper, crawler, parser) of products from Allegro

A configurable and extensible PHP web spider

A browser testing and web crawling library for PHP and Symfony

Roach is a complete web scraping toolkit for PHP

Get info from any web service or page

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Property page web scrapper

Simple and fast HTML parser

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

PHP Discord Webcrawler to log all messages from a Discord Chat.

PHP DOM Manipulation toolkit.