PHP Scraper - an highly opinionated web-interface for PHP

Peter Thaleikis

Last update: Dec 30, 2022

Related tags

Scraping php scraper php-library scraping web-scraper web-scraping scraping-websites php-crawler php-scraper php-spider

Overview

PHP Scraper

An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath selectors, preparing data structures, etc. Instead, you can just "go to a website" and get an array with all details relevant to your scraping project.

Under the hood, it uses Goutte and a few other packages. See composer.json.

Examples

Here are a few impressions on the way the library works. More examples are on the project website.

Get the Title of a Website

All scraping functionality can be accessed either as a function call or a property call. On the example of title scraping this would like like this:

$web = new \spekulatius\phpscraper();

$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

Scrape the Images from a Website

Scraping the images including the attributes of the img-tags:

$web = new \spekulatius\phpscraper();

/**
 * Navigate to the test page.
 *
 * This page contains twice the image "cat.jpg".
 * Once with a relative path and once with an absolute path.
 */
$web->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html');

var_dump($web->imagesWithDetails);
/**
 * Contains:
 *
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'absolute path',
 *     'width' => null,
 *     'height' => null,
 * ],
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'relative path',
 *     'width' => null,
 *     'height' => null,
 * ]
 */

See the full documentation for more information and examples.

Comments

Bump jeremykendall/php-domain-parser to version 6

I'm trying to integrate this into an app that is already locked to psr/log:^3.

jeremykendall/php-domain-parser version 5 is locked to psr/log:^1

There are 2 tests that fail when bumping to the new version. I can fix and submit a PR.

opened by tacman 10
Versioning and Changelog

Hi

Its a great project, we use that library, its great :+1: thanks for all the work!

We are not sure what the versioning concept of this library looks like, also we could not find information about what has changed (a changelog), there are only tags: https://github.com/spekulatius/PHPScraper/tags. So it would be nice to read something about what is the versioning strategy and when are breaks expected.

Of course semver would be nice with an 1.0.0 release and a CHANGELOG.md and UPGRADE.md. My personal example would look like this https://github.com/nadar/quill-delta-parser/blob/master/CHANGELOG.md and this https://github.com/nadar/quill-delta-parser/blob/master/UPGRADE.md

Thanks and keep up to great work!

opened by nadar 8
[Proposal] Add HTTP proxy support
I'm working on a project that needs to be constantly changing proxies and taking a look in the Goute client, the underlying implementation already supports it.

Research

HttpBrowser, which Goute is based on, supports a custom client, implementing the HttpClientInterface (source code)

... class HttpBrowser extends AbstractBrowser { private $client; public function __construct(HttpClientInterface $client = null, History $history = null, CookieJar $cookieJar = null) ... $this->client = $client ?? HttpClient::create(); ...

The HttpClient::create() supports proxy using the $defaultOptions parameter, that gets passed to the selected HttpClient (source code)

public static function create(array $defaultOptions = [], int $maxHostConnections = 6, int $maxPendingPushes = 50): HttpClientInterface

Implementation details

The idea is to expose this functionality though a setProxy function in the core class. The library will continue to dynamically select the httpClient accordingly.

I just made a POC in my fork with all necessary code to make this work (a6589da).

public function setProxy(string $proxy) { $httpClient = HttpClient::create([ 'proxy' => $proxy ]); $this->client = new Client($httpClient); return $this; }

How to use

$web = new phpscraper; $web->__call('setProxy', [ 'http://user:[email protected]:3128', ]);

If this feature gets approved, I will open a PR with it. If anything needs to be changed, let me know.
opened by nathabonfim59 6

Getting Deprecated: strpos() error in PHP PHP 8.1.8

I am using this example https://phpscraper.de/examples/extract-keywords.html with PHP 8.1.8

Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 858
PHP Deprecated:  strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 859

opened by khanakia 5

Make public function to access Client (Goutte)
I wanted to know the response code of the url but could not get access.

Looking at Goutte under the hood, I think I could get it if I could do something like:

$web->getClient()->getInternalResponse()->getStatusCode()

So, I would like a function to get access to the $client, like:

public function getClient() { return $this->client; }
opened by amurrell 4
SSL connect error

Hi

Im getting an

SSL connect error for "https://bbc.in/3cNMnkw".

when connecting to some websites. I'm writing a tweetdeck style system and so will scrape the site that the tweet links to to pick up the details so that a nicer link can be generated. Any ideas on how to solve this ? Most work fine, its just a few that fail.

Also Im sometimes getting a "This browser is no longer supported" coming through as a description from the page. I notice you are passing through 'Mozilla/5.0 (compatible; PHP Scraper/0.x; +https://phpscraper.de)' as the user agent. Is it worth altering the user agent ? I dont think I can pass it in as a parameter anywhere from what I can see?

Thanks.

opened by tonybyng 4

TypeError

When i run the sample code:

$web = new \Spekulatius\PHPScraper\PHPScraper();
$web->go('https://www.google.com/');
echo $web->title;

It return:

Spekulatius\PHPScraper\Core::setHttpClient(): Argument #1 ($httpClient) must be of type Symfony\Component\HttpClient\CurlHttpClient, Symfony\Component\HttpClient\NativeHttpClient given, called in C:\www\web-crawer\vendor\spekulatius\phpscraper\src\PHPScraper.php on line 108

Environments

PHP: 8.1.13 PHPScraper: 1.0.1

opened by alanx15a2 3

deprecate magic properties / methods

In my branch I've removed the magic __get and __call methods, and moved what was core into the phpscraper, so now there is only one class.

After a while I got tired of find/replace, so I created a rector rule to change the properties to method.

Is there a demo repository that uses I can use to test my branch? Tests are passing, except for the ones related to internal/external links, which I'll address in another issue.

opened by tacman 3

charset in headers method

Is this method used? Where is charset() defined?

    /**
     * Get the header collected as an array
     *
     * @return array
     */
    public function headers()
    {
        return [
            'charset' => $this->charset(),
            'contentType' => $this->contentType(),
            'viewport' => $this->viewport(),
            'canonical' => $this->canonical(),
            'csrfToken' => $this->csrfToken(),
        ];
    }

opened by tacman 3

SSL certificate problem

Symfony\Component\HttpClient\Exception\TransportException: SSL certificate problem: unable to get local issuer certificate for ***

Getting an error when trying to get data from a link that doesn't have ssl

opened by datlechin 3

composer require fails on mac due to strange characters in filenames in the /tests directory

Problem

Running composer require spekulatius/phpscraper fails on macOS 10.15.5 with the following:

[RuntimeException]                                                                                                                                                                                                          
 Failed to extract spekulatius/phpscraper: (50) '/usr/bin/unzip' -qq '/xxx/xxx/vendor/composer/tmp-123dcc14bbc3272649fdd489b0ecc9a3' -d '/xxx/xxx/vendor/composer/74030a62'  
                                                                                                                                                                                                                            
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/katze-+?-++-+?.jpg                                                                    
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/???.jpg                                                                               
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/katze-+?-++-+?.jpg                                                                
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/???.jpg                                                                           
        Illegal byte sequence

It seems like the unzip command within composer fails due to the strange characters.

opened by BrettGregson 3

What location PHPSCrapper based on?

Hi @spekulatius,

Some user using my https://github.com/datlechin/flarum-link-preview that used PHPScrapper, They see the scrap content is in other language, is the website content based on where the website is hosted?

https://discuss.flarum.org/d/30011-link-preview/178

opened by datlechin 1
Idea: Directly exposing received headers
Thanks, so it actually depends on the header. So far headers haven't been processed much. Exposing them would be beneficial in general. What do you think?

Originally posted by @spekulatius in https://github.com/spekulatius/PHPScraper/pull/164#discussion_r1045676088
opened by spekulatius 1
Idea: Implement low-level util to access the web.
E.g.

// GET request $web->get('https://...'); // POST request $response = $web->post('https://...', [ 'param' => 'first param', ]); // ...

This could be done either directly in PHPScraper or built upon another specialized lib such as Symfony HTTP. Exposing the functionality of the existing dependency sounds like a reasonable way to go, if the idea is of interest.
opened by spekulatius 1
Parsing structured data (microdata)
#16 proposes adding support for JSONLD.

There isn't only JSONLD - structured data can be provided also in the microdata notation, and: good news - there is a project which parses microdata and converts it to the same data structure as JSONLD: https://github.com/yusufkandemir/microdata-parser

So it should be possible to use both and treat it just like an additional JSONLD block!

A first test:

$jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromHTML($web->client->getResponse()->getContent(), $web->currentUrl())->toArray();

Internally this project uses an own DOM document class derived from DOMDocument. It has a function to import a DOMDocument - but Symphonys response class doesn't allow to access the DOMDocument.

I did a small test, but didn't fiddle it out how to pass the DOMDocument without reparsing - my try which didn't work:

$dom = new \DOMDocument('1.0', 'UTF-8'); $dom->importNode($web->filterFirst('//*')->getNode(0), true); $jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromDOMDocument($dom);

What makes sense? Adding separate PHPScraper functions for JSONLD and microdata? Or mixing both automatically? (my opinion: mixing)

How should support for microdata look like? Adding the other project to PHPScraper? Extending the existing classes or porting the whole functionality to PHPScraper?
opened by eposjk 2
get http status code
How do I check if the scraped page is an error page? It would be helpful to have the http status for that - something like

$web->go('https://httpstat.us/404'); echo $web->status; // prints 404

Just before posting, I found the solution myself:

$web->client->getResponse()->getStatusCode()

It would be great to have this added to the documentation. Or maybe add $web->status as shortcut?
opened by eposjk 7

Releases(1.0.0)

1.0.0(Nov 23, 2022)
It's been a while until PHPScraper reached v1. But now it's here! This first stable release covers:

More rounded parsing and consistent naming

Improved documentation website in multiple languages

Better Testing and Test coverage

Parsing of Feeds sitemap.xml, RSS feeds and static search indexes

Parsing of plaintext files such as CSV, XML and JSON

You can find all major commits in the changelog. Please check the upgrade instructions to get to v1.

Special thanks to all contributors: @walirazzaq, @nadar, @nathabonfim59, @datlechin, @tacman, @fumiya5863 and @vitormattos! :tada:
Source code(tar.gz)
Source code(zip)

Owner

Peter Thaleikis

Software engineer focused on solutions using open source and simply filling in the gaps to fulfill the requirements.

GitHub https://phpscraper.de

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 1, 2023

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 4, 2023

The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

134 Jan 4, 2023

Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

60 Nov 30, 2022

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

1 Mar 24, 2022

Extractor (scraper, crawler, parser) of products from Allegro

1 May 11, 2022

On-Page SEO Crawler Tool with Interface

upzon I developed this project with PHP & MYSQL and python. If you have basic python and php knowledge, it is quite simple to use this program. I'm us

5 Oct 27, 2021

A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

1.3k Dec 28, 2022

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

1.1k Jan 3, 2023

Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

1.9k Jan 1, 2023

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

57 Dec 16, 2022

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

11 Nov 8, 2022

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

68 Dec 27, 2022

PHP Scraper - an highly opinionated web-interface for PHP

Related tags

Overview

Sponsors

Examples

Get the Title of a Website

Scrape the Images from a Website

Comments

Research

Implementation details

How to use

Environments

Releases(1.0.0)

1.0.0(Nov 23, 2022)

Owner

Peter Thaleikis

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper

The most integrated web scraper package for Laravel.

Library for Rapid (Web) Crawler and Scraper Development

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Extractor (scraper, crawler, parser) of products from Allegro

On-Page SEO Crawler Tool with Interface

A configurable and extensible PHP web spider

A browser testing and web crawling library for PHP and Symfony

Roach is a complete web scraping toolkit for PHP

Get info from any web service or page

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Property page web scrapper

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

PHP Discord Webcrawler to log all messages from a Discord Chat.

PHP DOM Manipulation toolkit.