Roach is a complete web scraping toolkit for PHP

Roach PHP

Last update: Jan 3, 2023

Related tags

Overview

🐴 Roach

A complete web scraping toolkit for PHP

About

Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clone) of the popular Scrapy package for Python.

Installation

Install the package via composer

composer require roach-php/core

Documentation

The full documentation can be found here.

Credits

License

MIT

Comments

League Container and Symfony container conflicts
I am using in roach-php in a Symfony 6 project. I am trying to inject the EntityManagerInterface in my ItemProcessorInterface class to save the object in the DB. But doing that looks like it creates some kind of conflict between the containers:

Alias (Doctrine\ORM\EntityManagerInterface) is not being managed by the container or delegates in (League) Container.php, 188

This is happening also if I inject dependencies in the Spider class. Any workaround for this? Maybe telling symfony to ignore these classes and using League Container instead? No idea how to do that since league container is instantiated in vendor/roach-php
opened by alejgarciarodriguez 7
Scraping multiple elements within a page and relative queries is not available
Hi! Now package is allow scraping list of single pages.

But what if we need to repeatedly select a list of items within a page and do an additional filter on each of them? For example:

public function parse(Response $response): Generator { $data = $response->filter('.preview'); foreach ($data as $item) { $cover = $item->filter('img')->attr('src'); $number = $tem->filter('span:nth-of-type(1)')->text(); $published = $item->filter('span.published')->text(); yield $this->item(compact('cover', 'number', 'published')); } }

Here I want to select a list of items on every page by the general .preview filter, and then make an additional filter for each of $item to get the data I want.

At the moment, as I understand it, the package does not support such functionality, although in scrapy it is possible and is a powerful tool for working with page data.
opened by wett1988 7
interactive shell vs real code

I was trying roach-php in a laravel project. When I try a filter in the interactive shell I get the data I want. But if I you the same filter in my spider file, I don;t get the data.

Looks like the interactive shell gets the remote date different, because if I dd the return array in laravel and look at the remote html data, there is less information available then in the interactive shell.

Can I use config values to get the same results in the real code vs interactive shell ?

opened by xciser77 5
Testing how a spider scrapes a given HTML file

Hello there,

Just a question. Is there a simple way to feature test a spider by giving it some HTML and inspecting what it returns, e.g. making assertions against what would be returned by collectSpider.

Many thanks

Seb

opened by seb-jones 5
Scraping and crawling with Laravel Dusk

Hello!

Awesome idea crafting this project, I'm really looking forward to using it when scraping data.

Some websites rely on Javascript heavily and require interactivity to reach certain pieces of information. Is there any way of using something like Laravel Dusk's interactivity features with Roach?

opened by clarkewing 5

Xpath not working

Hello,

I was trying to test your library but cannot get a basic xpath working on Google(as an example).

    public function parse(Response $response): Generator
    {
        $html = $response->filterXpath('//div[contains(@id, "center_col")]')->each(function (Crawler $node) {
            return $node->text();
        });
        yield $this->item([
            'html' => $html,
        ]);
    }

It returns and empty array and this is strange because there is a "

" in the page.

Any idea why it is not working please?

Thank yyou.

opened by Benoit1980 4

[Feature Request] Composing Spiders

Hey,

First of all, thanks for this great package!

In the docs, there is an example of how to parse a set of articles from an overview page:

public function parse(Response $response): Generator
{
    $links = $response->filter('header + div a')->links();

    foreach ($links as $link) {
        yield $this->request('GET', $link->getUri(), 'parseBlogPage');
    }
}

public function parseBlogPage(Response $response): Generator
{
    $title = $response->filter('h1')->text();
    $publishDate = $response
        ->filter('time')
        ->attr('datetime');
    $excerpt = $response
        ->filter('.blog-content div > p:first-of-type')
        ->text();

    yield $this->item(compact('title', 'publishDate', 'excerpt'));
}

In a use case of mine, I would like to do something similar but split the parsing of the overview page and a specific blog page up into two separate Spiders. In the Spider that finds different articles, I would then like to delegate the parsing of a specific blog page to another Spider. For example, I'd like to do something like this:

class BlogOverviewSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        $pages = $response
            ->filter('main > div:first-child a')
            ->links();

        foreach ($pages as $page) {
            // Here the spider() method would use the parse result of a specific Spider class
            yield $this->spider(BlogPageSpider::class, overrides: new Overrides([startUrls: $page->getUri()]));
        }
    }
}

class BlogPageSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        yield $this->item([/* */])
    }
}

Here's a simplified example that's a bit more realistic and that demonstrates its usefulness.

Scraping metadata from different Git repositories

class RepositoryOverviewSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        $repositories = $response
            ->filter('main > div:first-child a')
            ->links();

        foreach ($repositories as $repository) {
            if ($this->isGithubRepository($repository->getUri())) {
                 yield $this->spider(GithubRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            } else if ($this->isGitlabRepository($repository->getUri())) {
                 yield $this->spider(GitlabRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            } else {
                 yield $this->spider(GenericRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            }
        }
    }
}

Here, each repository Spider could define its own authentication scheme and its own specific parsing method.

I could not find any way of using the result of another Spider in the docs. Most of the logic of starting a Spider seems to be locked behind a private API in the RoachPHP\Roach class .

Maybe I've missed something and you can already compose Spiders in some way. If not, I think it could be a great feature.

If you also see the merit in this, I could try taking a stab at implementing this myself.

enhancement

opened by Daanra 4

ExecuteJavascriptMiddleware not waiting long enough - option request for wait until network idle

I'm finding the execute javascript middleware to be extremely useful but sometimes it doesn't wait long enough and activity is still happening in the DOM.

I ended up having to copy over the middleware and chain on ->waitUntilNetworkIdle() in the constructor

In order for me to be able get the markup from the DOM otherwise lots of critical data doesn't get rendered.

I might have missed this as an option somewhere - there is likely a better way to indicate this as an option rather than an entirely separate middleware file.

<?php

declare(strict_types=1);

/**
 * Copyright (c) 2022 Kai Sassnowski
 *
 * For the full copyright and license information, please view
 * the LICENSE file that was distributed with this source code.
 *
 * @see https://github.com/roach-php/roach
 */

namespace App\Roach\Middleware;

use Psr\Log\LoggerInterface;
use RoachPHP\Http\Response;
use RoachPHP\Downloader\Middleware\ResponseMiddlewareInterface;
use RoachPHP\Support\Configurable;
use Spatie\Browsershot\Browsershot;
use Throwable;

final class ExecuteJavascriptNetworkIdleMiddleware implements ResponseMiddlewareInterface
{
    use Configurable;

    /**
     * @var callable(string): Browsershot
     */
    private $getBrowsershot;

    /**
     * @param null|callable(string): Browsershot $getBrowsershot
     */
    public function __construct(
        private LoggerInterface $logger,
        ?callable $getBrowsershot = null,
    ) {
        $this->getBrowsershot = $getBrowsershot ?: static fn (string $uri): Browsershot => Browsershot::url($uri)->waitUntilNetworkIdle();
    }

    public function handleResponse(Response $response): Response
    {
        $browsershot = $this->configureBrowsershot(
            $response->getRequest()->getUri(),
        );

        try {
            $body = $browsershot->bodyHtml();
        } catch (Throwable $e) {
            $this->logger->info('[ExecuteJavascriptMiddleware] Error while executing javascript', [
                'message' => $e->getMessage(),
                'trace' => $e->getTraceAsString(),
            ]);

            return $response->drop('Error while executing javascript');
        }

        return $response->withBody($body);
    }

    /**
     * @psalm-suppress MixedArgument, MixedAssignment
     */
    private function configureBrowsershot(string $uri): Browsershot
    {
        $browsershot = ($this->getBrowsershot)($uri);

        if (!empty($this->option('chromiumArguments'))) {
            $browsershot->addChromiumArguments($this->option('chromiumArguments'));
        }

        if (null !== ($chromePath = $this->option('chromePath'))) {
            $browsershot->setChromePath($chromePath);
        }

        if (null !== ($binPath = $this->option('binPath'))) {
            $browsershot->setBinPath($binPath);
        }

        if (null !== ($nodeModulePath = $this->option('nodeModulePath'))) {
            $browsershot->setNodeModulePath($nodeModulePath);
        }

        if (null !== ($includePath = $this->option('includePath'))) {
            $browsershot->setIncludePath($includePath);
        }

        if (null !== ($nodeBinary = $this->option('nodeBinary'))) {
            $browsershot->setNodeBinary($nodeBinary);
        }

        if (null !== ($npmBinary = $this->option('npmBinary'))) {
            $browsershot->setNpmBinary($npmBinary);
        }

        return $browsershot;
    }

    private function defaultOptions(): array
    {
        return [
            'chromiumArguments' => [],
            'chromePath' => null,
            'binPath' => null,
            'nodeModulePath' => null,
            'includePath' => null,
            'nodeBinary' => null,
            'npmBinary' => null,
        ];
    }
}

Thank you!

opened by chrismcintosh 3

[Laravel Sail] ExcecuteJavascriptMiddleware not firing

I am wanting to use this Middleware but it is not firing. I had an issue as I am using Laravel Sail on an M1 Mackbook and installing puppeteer had issues due to chromium not being arm64 ready:

The chromium binary is not available for arm64.

so I did the following.

I installed spatie/browsershot and ran

sail PUPPETEER_EXPERIMENTAL_CHROMIUM_MAC_ARM=1 npm i puppeteer

and everything seemed to install correctly but the ExcecuteJavascriptMiddleware doesn't appear to being called so I still get the:

<noscript>You need to enable JavaScript to run this app.</noscript>\n

version of the page returned.

I put breakpoints in ExcecuteJavascriptMiddleware but they never fire.

am I doing something wrong or missed a step?

I am using the Laravel Adaptor so thought the Middleware was already injected in to the Container, am I wrong?

opened by andyscraven 3
make browsershot wait until network idle on ExecuteJavascriptMiddleware

Hi Kai,

As requested here is the pull request for making Browsershot wait until network idle on the ExecuteJavascriptMiddleware.

This is my first Open Source pull request - so thanks for your patience!

Best, Chris

opened by chrismcintosh 2

Trying to parse the first page of a paginated result (Call to undefined method Generator::value())

I am trying to scrape a page that has paginated links at the bottom. In the roach docs I have found that you could override the initialRequest to find other URL's to scrape.

This is working as expected:

class ExampleSpider extends BasicSpider
{
    public function parseOverview(Response $response): \Generator
    {
        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

    public function parse(Response $response): \Generator
    {
        $items = $response->filter('.product-item')->each(function (Crawler $product, $i) {

            $productName = $product->filter('.product-item-link');
            $array['product_name'] = $productName->count() ? $productName->text() : null;

            $link = $product->filter('.product-item-link');
            $array['link'] = $link->count() ? $link->link()->getUri() : null;

            $imageUrl = $product->filter('.product-image-photo');
            $array['image_url'] = $imageUrl->count() ? $imageUrl->image()->getUri() : null;

            $salePrice = $product->filter('.price-final_price .price');
            $array['sale_price'] = $salePrice->count() ? $salePrice->text() : null;

            $regularPrice = $product->filter('.old-price span.price');
            $array['regular_price'] = $regularPrice->count() ? $regularPrice->text() : null;

            $attributeSize = $product->filter('.attribute.size');
            $array['attribute_size'] = $attributeSize->count() ? $attributeSize->text() : null;

            $savings = $product->filter('.sticker-wrapper');
            $array['savings'] = $savings->count() ? $savings->text() : null;

            return $array;
        });

        foreach ($items as $item) {
            if (!$item) {
                continue;
            }
            yield $this->item($item);
        }
    }

    /** @return Request[] */
    protected function initialRequests(): array
    {
        return [
            new Request(
                'GET',
                'https://www.example.com/5-pages', // Has 5 pages
                [$this, 'parseOverview']
            ),
            new Request(
                'GET',
                'https://www.example.com/1-page', // Has 1 page (no pagination)
                [$this, 'parseOverview']
            ),
        ];
    }
}

However, this only scrapes the pages that are gathered parseOverview() method. I would also like to use the $response object from the first page (https://www.example.com/5-pages) and not only:

https://www.example.com/5-pages?page=2
https://www.example.com/5-pages?page=3
https://www.example.com/5-pages?page=4
https://www.example.com/5-pages?page=5

So I figured, as we have the first page already in the Response, I'll try running the $this->parse() method on the $response object in the parseOverview() method:

public function parseOverview(Response $response): \Generator
    {
        yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page

        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

However, when running the Spider I get the following error: Call to undefined method Generator::value()

I tried adding the first page url to the array $pageUrls, but then I get a DuplicatedRequest. This is good because I do not want to fire the request twice when we already have a working Response object.

What do you recommend to change to make sure I get the data of the first page also?

opened by matthiastjong 2

$Adds the possibility to configure the user agent on Spatie\Browsershot for ExecuteJavascriptMiddleware$
Adds the possibility to configure the user agent on Spatie\Browsershot for ExecuteJavascriptMiddleware
This PR add the configuration for the middleware ExecuteJavascriptMiddleware to tell Browsershot the option userAgent to use.

[ ExecuteJavascriptMiddleware::class, ['userAgent' => 'custom'], ]

Before doing this PR This is my first time using the library and I was not sure how to do it so I tried with the middleware UserAgentMiddleware but I saw that ExecuteJavascriptMiddleware was only using the URI of the request https://github.com/roach-php/core/blob/827783c0a0500975e37b5fb4bb507609410af6b6/src/Downloader/Middleware/ExecuteJavascriptMiddleware.php#L45

Maybe there is a better approach but I found this solution simple. Maybe the ExecuteJavascriptMiddleware can verify if the user agent exist on the request and setting the user agent for Browsershot.

No hard feeling if this is refused!
opened by corbeil 0

Overriding Not Working

Hi, hope all is well, Im trying to pass the URL dynamically using overriding but its not working and the response is null. Thank you

Roach::startSpider(
            LoremIpsumSpider::class,
            new Overrides(startUrls: ['https://sinarahmannejad.com']),
        );

        $result = Roach::collectSpider(LoremIpsumSpider::class);

opened by sinarahmany 0

Crawling entire site?

Hello,

Does roach-php support following links and crawling an entire site? I read the documentation on Scraping versus Crawling and understand the difference between two... but throughout the documentation, both terms "scraper" and "crawler" are used to describe roachphp. So my question is does roach-php support crawling? If so, I can't seem to find anywhere on the documentation.

Thank you.

opened by seongbae 0
spatie/robots-txt overwrites default Laravel robots.txt

I'm not sure if this issue supposed to be written here or in the https://github.com/spatie/robots-txt package !, but since I'm using this package and it depends on spatie/robots-txt. I will write it here. I just discovered that all the URL's in my website is not indexed and blocked by robots.txt, after digging it out the only thing that I found overwriting the default robots.txt file in Laravel is spatie/robots-txt, I'm not using spatie/robots-txt directly I'm just using https://github.com/roach-php. Any help or confirming this issue would be helpful.

opened by Xoshbin 0
add (#65) ProxyMiddleware
Currently, for one proxy.

public array $downloaderMiddleware = [ [ProxyMiddleware::class, ['proxyList' => [ 'http://xxx.xxx.xxx.156:3128', // http://IP:PORT ]]] ];

Then i want to add processing by:

an array of proxies,

getting from a file,

getting from a database
opened by Azmandios 0

Releases(1.1.1)

1.1.1(Sep 9, 2022)
What's Changed

make browsershot wait until network idle on ExecuteJavascriptMiddleware by @chrismcintosh in https://github.com/roach-php/core/pull/56

New Contributors

@chrismcintosh made their first contribution in https://github.com/roach-php/core/pull/56

Full Changelog: https://github.com/roach-php/core/compare/1.1.0...1.1.1
Source code(tar.gz)
Source code(zip)
1.1.0(Jun 22, 2022)
What's Changed

Custom item classes by @ksassnowski in https://github.com/roach-php/core/pull/47

Fix to align with symfony/console 6.1 by @ndeblauw in https://github.com/roach-php/core/pull/44

New Contributors

@ndeblauw made their first contribution in https://github.com/roach-php/core/pull/44

Full Changelog: https://github.com/roach-php/core/compare/1.0.0...1.1.0
Source code(tar.gz)
Source code(zip)
1.0.0(Apr 19, 2022)
:tada: Partaaay :tada:

This release contains a few breaking changes. Please check out the upgrade guide on how to upgrade your application.

What's Changed

Added Roach::collectSpider method to start a spider run and return all scraped items.

Added array $context parameter to Roach::startSpider and Roach::collectSpider to pass arbitrary context data to a spider when starting a run.

Added roach:run <spider> command to start a spider through the CLI.

Added Roach::fake() method to test that a run for a given spider was started

Requests dropped by downloader middleware are no longer affected by requestDelay (fixes #27)

Move spatie/browsershot from a require to suggest as it's only necessary if the ExecuteJavascriptMiddleware is used. Remove ext-exif as a dependency for the same reason.

Removed default command from CLI. To start the REPL, you now need to explicitly invoke the roach:shell <url> command, instead.

Add Configurable::withOptions method by @inxilpro in https://github.com/roach-php/core/pull/32

Make InteractsWithRequestsAndResponses available for testing by @Daanra in https://github.com/roach-php/core/pull/34

New Contributors

@inxilpro made their first contribution in https://github.com/roach-php/core/pull/32

@Daanra made their first contribution in https://github.com/roach-php/core/pull/34

Full Changelog: https://github.com/roach-php/core/compare/0.3.0...1.0.0
Source code(tar.gz)
Source code(zip)
0.3.0(Jan 30, 2022)
What's Changed

Add option to override spider configuration when starting a run by @ksassnowski in https://github.com/roach-php/core/pull/11

bump psysh version to 0.11.1 by @bangnokia in https://github.com/roach-php/core/pull/17

New Contributors

@bangnokia made their first contribution in https://github.com/roach-php/core/pull/17

Full Changelog: https://github.com/roach-php/core/compare/0.2.0...0.3.0
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 28, 2021)
What's Changed

Add middleware to execute Javascript by @ksassnowski in https://github.com/roach-php/core/pull/7

New Contributors

@ksassnowski made their first contribution in https://github.com/roach-php/core/pull/7

Full Changelog: https://github.com/roach-php/core/compare/0.1.0...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 27, 2021)

Initial release
Source code(tar.gz)
Source code(zip)

Owner

Roach PHP

A complete web scraping toolkit for PHP. Inspired by Scrapy

GitHub https://roach-php.dev

Symfony bundle for Roach PHP

roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

7 Sep 28, 2022

PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

1 Nov 26, 2021

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

485 Dec 31, 2022

PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

327 Dec 30, 2022

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 1, 2023

A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

1.3k Dec 28, 2022

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 4, 2023

Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

1.9k Jan 1, 2023

The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

134 Jan 4, 2023

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

57 Dec 16, 2022

Roach is a complete web scraping toolkit for PHP

Related tags

Overview

🐴 Roach

About

Installation

Documentation

Credits

License

Comments

Releases(1.1.1)

1.1.1(Sep 9, 2022)

What's Changed

New Contributors

1.1.0(Jun 22, 2022)

What's Changed

New Contributors

1.0.0(Apr 19, 2022)

:tada: Partaaay :tada:

What's Changed

New Contributors

0.3.0(Jan 30, 2022)

What's Changed

New Contributors

0.2.0(Dec 28, 2021)

What's Changed

New Contributors

0.1.0(Dec 27, 2021)

Owner

Roach PHP

Symfony bundle for Roach PHP

PHP DOM Manipulation toolkit.

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

PHP Scraper - an highly opinionated web-interface for PHP

Goutte, a simple PHP Web Scraper

A configurable and extensible PHP web spider

A browser testing and web crawling library for PHP and Symfony

Goutte, a simple PHP Web Scraper

Get info from any web service or page

The most integrated web scraper package for Laravel.

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Property page web scrapper

Library for Rapid (Web) Crawler and Scraper Development

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

PHP Discord Webcrawler to log all messages from a Discord Chat.

This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Beanbun 是用 PHP 编写的多进程网络爬虫框架，具有良好的开放性、高可扩展性，基于 Workerman