Roach is a complete web scraping toolkit for PHP

Overview

🐴 Roach

Latest Stable Version Total Downloads

A complete web scraping toolkit for PHP

About

Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clone) of the popular Scrapy package for Python.

Installation

Install the package via composer

composer require roach-php/core

Documentation

The full documentation can be found here.

Credits

License

MIT

Comments
  • League Container and Symfony container conflicts

    League Container and Symfony container conflicts

    I am using in roach-php in a Symfony 6 project. I am trying to inject the EntityManagerInterface in my ItemProcessorInterface class to save the object in the DB. But doing that looks like it creates some kind of conflict between the containers:

    Alias (Doctrine\ORM\EntityManagerInterface) is not being managed by the container or delegates
    in (League) Container.php, 188
    

    This is happening also if I inject dependencies in the Spider class. Any workaround for this? Maybe telling symfony to ignore these classes and using League Container instead? No idea how to do that since league container is instantiated in vendor/roach-php

    opened by alejgarciarodriguez 7
  • Scraping multiple elements within a page and relative queries is not available

    Scraping multiple elements within a page and relative queries is not available

    Hi! Now package is allow scraping list of single pages.

    But what if we need to repeatedly select a list of items within a page and do an additional filter on each of them? For example:

    public function parse(Response $response): Generator
    {
      $data = $response->filter('.preview');
      
      foreach ($data as $item) {  
          $cover = $item->filter('img')->attr('src');
          $number = $tem->filter('span:nth-of-type(1)')->text();
          $published = $item->filter('span.published')->text();
          
          yield $this->item(compact('cover', 'number', 'published'));
      } 
    }
    

    Here I want to select a list of items on every page by the general .preview filter, and then make an additional filter for each of $item to get the data I want.

    At the moment, as I understand it, the package does not support such functionality, although in scrapy it is possible and is a powerful tool for working with page data.

    opened by wett1988 7
  • interactive shell vs real code

    interactive shell vs real code

    I was trying roach-php in a laravel project. When I try a filter in the interactive shell I get the data I want. But if I you the same filter in my spider file, I don;t get the data.

    Looks like the interactive shell gets the remote date different, because if I dd the return array in laravel and look at the remote html data, there is less information available then in the interactive shell.

    Can I use config values to get the same results in the real code vs interactive shell ?

    opened by xciser77 5
  • Testing how a spider scrapes a given HTML file

    Testing how a spider scrapes a given HTML file

    Hello there,

    Just a question. Is there a simple way to feature test a spider by giving it some HTML and inspecting what it returns, e.g. making assertions against what would be returned by collectSpider.

    Many thanks

    Seb

    opened by seb-jones 5
  • Scraping and crawling with Laravel Dusk

    Scraping and crawling with Laravel Dusk

    Hello!

    Awesome idea crafting this project, I'm really looking forward to using it when scraping data.

    Some websites rely on Javascript heavily and require interactivity to reach certain pieces of information. Is there any way of using something like Laravel Dusk's interactivity features with Roach?

    opened by clarkewing 5
  • Xpath not working

    Xpath not working

    Hello,

    I was trying to test your library but cannot get a basic xpath working on Google(as an example).

        public function parse(Response $response): Generator
        {
            $html = $response->filterXpath('//div[contains(@id, "center_col")]')->each(function (Crawler $node) {
                return $node->text();
            });
            yield $this->item([
                'html' => $html,
            ]);
        }
    

    It returns and empty array and this is strange because there is a "

    " in the page.

    Any idea why it is not working please?

    Thank yyou.

    opened by Benoit1980 4
  • [Feature Request] Composing Spiders

    [Feature Request] Composing Spiders

    Hey,

    First of all, thanks for this great package!

    In the docs, there is an example of how to parse a set of articles from an overview page:

    public function parse(Response $response): Generator
    {
        $links = $response->filter('header + div a')->links();
    
        foreach ($links as $link) {
            yield $this->request('GET', $link->getUri(), 'parseBlogPage');
        }
    }
    
    public function parseBlogPage(Response $response): Generator
    {
        $title = $response->filter('h1')->text();
        $publishDate = $response
            ->filter('time')
            ->attr('datetime');
        $excerpt = $response
            ->filter('.blog-content div > p:first-of-type')
            ->text();
    
        yield $this->item(compact('title', 'publishDate', 'excerpt'));
    }
    

    In a use case of mine, I would like to do something similar but split the parsing of the overview page and a specific blog page up into two separate Spiders. In the Spider that finds different articles, I would then like to delegate the parsing of a specific blog page to another Spider. For example, I'd like to do something like this:

    class BlogOverviewSpider extends BasicSpider
    {
        public function parse(Response $response): Generator
        {
            $pages = $response
                ->filter('main > div:first-child a')
                ->links();
    
            foreach ($pages as $page) {
                // Here the spider() method would use the parse result of a specific Spider class
                yield $this->spider(BlogPageSpider::class, overrides: new Overrides([startUrls: $page->getUri()]));
            }
        }
    }
    
    class BlogPageSpider extends BasicSpider
    {
        public function parse(Response $response): Generator
        {
            yield $this->item([/* */])
        }
    }
    

    Here's a simplified example that's a bit more realistic and that demonstrates its usefulness.

    Scraping metadata from different Git repositories
    class RepositoryOverviewSpider extends BasicSpider
    {
        public function parse(Response $response): Generator
        {
            $repositories = $response
                ->filter('main > div:first-child a')
                ->links();
    
            foreach ($repositories as $repository) {
                if ($this->isGithubRepository($repository->getUri())) {
                     yield $this->spider(GithubRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
                } else if ($this->isGitlabRepository($repository->getUri())) {
                     yield $this->spider(GitlabRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
                } else {
                     yield $this->spider(GenericRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
                }
            }
        }
    }
    

    Here, each repository Spider could define its own authentication scheme and its own specific parsing method.

    I could not find any way of using the result of another Spider in the docs. Most of the logic of starting a Spider seems to be locked behind a private API in the RoachPHP\Roach class .

    Maybe I've missed something and you can already compose Spiders in some way. If not, I think it could be a great feature.

    If you also see the merit in this, I could try taking a stab at implementing this myself.

    enhancement 
    opened by Daanra 4
  • ExecuteJavascriptMiddleware not waiting long enough - option request for wait until network idle

    ExecuteJavascriptMiddleware not waiting long enough - option request for wait until network idle

    I'm finding the execute javascript middleware to be extremely useful but sometimes it doesn't wait long enough and activity is still happening in the DOM.

    I ended up having to copy over the middleware and chain on ->waitUntilNetworkIdle() in the constructor

    In order for me to be able get the markup from the DOM otherwise lots of critical data doesn't get rendered.

    I might have missed this as an option somewhere - there is likely a better way to indicate this as an option rather than an entirely separate middleware file.

    <?php
    
    declare(strict_types=1);
    
    /**
     * Copyright (c) 2022 Kai Sassnowski
     *
     * For the full copyright and license information, please view
     * the LICENSE file that was distributed with this source code.
     *
     * @see https://github.com/roach-php/roach
     */
    
    namespace App\Roach\Middleware;
    
    use Psr\Log\LoggerInterface;
    use RoachPHP\Http\Response;
    use RoachPHP\Downloader\Middleware\ResponseMiddlewareInterface;
    use RoachPHP\Support\Configurable;
    use Spatie\Browsershot\Browsershot;
    use Throwable;
    
    final class ExecuteJavascriptNetworkIdleMiddleware implements ResponseMiddlewareInterface
    {
        use Configurable;
    
        /**
         * @var callable(string): Browsershot
         */
        private $getBrowsershot;
    
        /**
         * @param null|callable(string): Browsershot $getBrowsershot
         */
        public function __construct(
            private LoggerInterface $logger,
            ?callable $getBrowsershot = null,
        ) {
            $this->getBrowsershot = $getBrowsershot ?: static fn (string $uri): Browsershot => Browsershot::url($uri)->waitUntilNetworkIdle();
        }
    
        public function handleResponse(Response $response): Response
        {
            $browsershot = $this->configureBrowsershot(
                $response->getRequest()->getUri(),
            );
    
            try {
                $body = $browsershot->bodyHtml();
            } catch (Throwable $e) {
                $this->logger->info('[ExecuteJavascriptMiddleware] Error while executing javascript', [
                    'message' => $e->getMessage(),
                    'trace' => $e->getTraceAsString(),
                ]);
    
                return $response->drop('Error while executing javascript');
            }
    
            return $response->withBody($body);
        }
    
        /**
         * @psalm-suppress MixedArgument, MixedAssignment
         */
        private function configureBrowsershot(string $uri): Browsershot
        {
            $browsershot = ($this->getBrowsershot)($uri);
    
            if (!empty($this->option('chromiumArguments'))) {
                $browsershot->addChromiumArguments($this->option('chromiumArguments'));
            }
    
            if (null !== ($chromePath = $this->option('chromePath'))) {
                $browsershot->setChromePath($chromePath);
            }
    
            if (null !== ($binPath = $this->option('binPath'))) {
                $browsershot->setBinPath($binPath);
            }
    
            if (null !== ($nodeModulePath = $this->option('nodeModulePath'))) {
                $browsershot->setNodeModulePath($nodeModulePath);
            }
    
            if (null !== ($includePath = $this->option('includePath'))) {
                $browsershot->setIncludePath($includePath);
            }
    
            if (null !== ($nodeBinary = $this->option('nodeBinary'))) {
                $browsershot->setNodeBinary($nodeBinary);
            }
    
            if (null !== ($npmBinary = $this->option('npmBinary'))) {
                $browsershot->setNpmBinary($npmBinary);
            }
    
            return $browsershot;
        }
    
        private function defaultOptions(): array
        {
            return [
                'chromiumArguments' => [],
                'chromePath' => null,
                'binPath' => null,
                'nodeModulePath' => null,
                'includePath' => null,
                'nodeBinary' => null,
                'npmBinary' => null,
            ];
        }
    }
    

    Thank you!

    opened by chrismcintosh 3
  • [Laravel Sail] ExcecuteJavascriptMiddleware not firing

    [Laravel Sail] ExcecuteJavascriptMiddleware not firing

    I am wanting to use this Middleware but it is not firing. I had an issue as I am using Laravel Sail on an M1 Mackbook and installing puppeteer had issues due to chromium not being arm64 ready:

    The chromium binary is not available for arm64.

    so I did the following.

    I installed spatie/browsershot and ran

    sail PUPPETEER_EXPERIMENTAL_CHROMIUM_MAC_ARM=1 npm i puppeteer

    and everything seemed to install correctly but the ExcecuteJavascriptMiddleware doesn't appear to being called so I still get the:

    <noscript>You need to enable JavaScript to run this app.</noscript>\n

    version of the page returned.

    I put breakpoints in ExcecuteJavascriptMiddleware but they never fire.

    am I doing something wrong or missed a step?

    I am using the Laravel Adaptor so thought the Middleware was already injected in to the Container, am I wrong?

    opened by andyscraven 3
  • make browsershot wait until network idle on ExecuteJavascriptMiddleware

    make browsershot wait until network idle on ExecuteJavascriptMiddleware

    Hi Kai,

    As requested here is the pull request for making Browsershot wait until network idle on the ExecuteJavascriptMiddleware.

    This is my first Open Source pull request - so thanks for your patience!

    Best, Chris

    opened by chrismcintosh 2
  • Trying to parse the first page of a paginated result (Call to undefined method Generator::value())

    Trying to parse the first page of a paginated result (Call to undefined method Generator::value())

    I am trying to scrape a page that has paginated links at the bottom. In the roach docs I have found that you could override the initialRequest to find other URL's to scrape.

    This is working as expected:

    class ExampleSpider extends BasicSpider
    {
        public function parseOverview(Response $response): \Generator
        {
            $pageUrls = array_map(
                function (Link $link) {
                    return $link->getUri();
                },
                $response
                    ->filter('.pages-items li a')
                    ->links(),
                );
    
            foreach ($pageUrls as $pageUrl) {
                // Since we’re not specifying the second parameter,
                // all article pages will get handled by the
                // spider’s `parse` method.
                yield $this->request('GET', $pageUrl);
            }
        }
    
        public function parse(Response $response): \Generator
        {
            $items = $response->filter('.product-item')->each(function (Crawler $product, $i) {
    
                $productName = $product->filter('.product-item-link');
                $array['product_name'] = $productName->count() ? $productName->text() : null;
    
                $link = $product->filter('.product-item-link');
                $array['link'] = $link->count() ? $link->link()->getUri() : null;
    
                $imageUrl = $product->filter('.product-image-photo');
                $array['image_url'] = $imageUrl->count() ? $imageUrl->image()->getUri() : null;
    
                $salePrice = $product->filter('.price-final_price .price');
                $array['sale_price'] = $salePrice->count() ? $salePrice->text() : null;
    
                $regularPrice = $product->filter('.old-price span.price');
                $array['regular_price'] = $regularPrice->count() ? $regularPrice->text() : null;
    
                $attributeSize = $product->filter('.attribute.size');
                $array['attribute_size'] = $attributeSize->count() ? $attributeSize->text() : null;
    
                $savings = $product->filter('.sticker-wrapper');
                $array['savings'] = $savings->count() ? $savings->text() : null;
    
                return $array;
            });
    
            foreach ($items as $item) {
                if (!$item) {
                    continue;
                }
                yield $this->item($item);
            }
        }
    
        /** @return Request[] */
        protected function initialRequests(): array
        {
            return [
                new Request(
                    'GET',
                    'https://www.example.com/5-pages', // Has 5 pages
                    [$this, 'parseOverview']
                ),
                new Request(
                    'GET',
                    'https://www.example.com/1-page', // Has 1 page (no pagination)
                    [$this, 'parseOverview']
                ),
            ];
        }
    }
    

    However, this only scrapes the pages that are gathered parseOverview() method. I would also like to use the $response object from the first page (https://www.example.com/5-pages) and not only:

    1. https://www.example.com/5-pages?page=2
    2. https://www.example.com/5-pages?page=3
    3. https://www.example.com/5-pages?page=4
    4. https://www.example.com/5-pages?page=5

    So I figured, as we have the first page already in the Response, I'll try running the $this->parse() method on the $response object in the parseOverview() method:

    public function parseOverview(Response $response): \Generator
        {
            yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page
    
            $pageUrls = array_map(
                function (Link $link) {
                    return $link->getUri();
                },
                $response
                    ->filter('.pages-items li a')
                    ->links(),
                );
    
            foreach ($pageUrls as $pageUrl) {
                // Since we’re not specifying the second parameter,
                // all article pages will get handled by the
                // spider’s `parse` method.
                yield $this->request('GET', $pageUrl);
            }
        }
    

    However, when running the Spider I get the following error: Call to undefined method Generator::value()

    I tried adding the first page url to the array $pageUrls, but then I get a DuplicatedRequest. This is good because I do not want to fire the request twice when we already have a working Response object.

    What do you recommend to change to make sure I get the data of the first page also?

    opened by matthiastjong 2
  • Adds the possibility to configure the user agent on Spatie\Browsershot for ExecuteJavascriptMiddleware

    Adds the possibility to configure the user agent on Spatie\Browsershot for ExecuteJavascriptMiddleware

    This PR add the configuration for the middleware ExecuteJavascriptMiddleware to tell Browsershot the option userAgent to use.

    [
        ExecuteJavascriptMiddleware::class,
        ['userAgent' => 'custom'],
    ]
    

    Before doing this PR This is my first time using the library and I was not sure how to do it so I tried with the middleware UserAgentMiddleware but I saw that ExecuteJavascriptMiddleware was only using the URI of the request https://github.com/roach-php/core/blob/827783c0a0500975e37b5fb4bb507609410af6b6/src/Downloader/Middleware/ExecuteJavascriptMiddleware.php#L45

    Maybe there is a better approach but I found this solution simple. Maybe the ExecuteJavascriptMiddleware can verify if the user agent exist on the request and setting the user agent for Browsershot.

    No hard feeling if this is refused!

    opened by corbeil 0
  • Overriding Not Working

    Overriding Not Working

    Hi, hope all is well, Im trying to pass the URL dynamically using overriding but its not working and the response is null. Thank you

    Roach::startSpider(
                LoremIpsumSpider::class,
                new Overrides(startUrls: ['https://sinarahmannejad.com']),
            );
    
            $result = Roach::collectSpider(LoremIpsumSpider::class);
    
    opened by sinarahmany 0
  • Crawling entire site?

    Crawling entire site?

    Hello,

    Does roach-php support following links and crawling an entire site? I read the documentation on Scraping versus Crawling and understand the difference between two... but throughout the documentation, both terms "scraper" and "crawler" are used to describe roachphp. So my question is does roach-php support crawling? If so, I can't seem to find anywhere on the documentation.

    Thank you.

    opened by seongbae 0
  • spatie/robots-txt overwrites default Laravel robots.txt

    spatie/robots-txt overwrites default Laravel robots.txt

    I'm not sure if this issue supposed to be written here or in the https://github.com/spatie/robots-txt package !, but since I'm using this package and it depends on spatie/robots-txt. I will write it here. I just discovered that all the URL's in my website is not indexed and blocked by robots.txt, after digging it out the only thing that I found overwriting the default robots.txt file in Laravel is spatie/robots-txt, I'm not using spatie/robots-txt directly I'm just using https://github.com/roach-php. Any help or confirming this issue would be helpful.

    opened by Xoshbin 0
  • add (#65) ProxyMiddleware

    add (#65) ProxyMiddleware

    Currently, for one proxy.

    	public array $downloaderMiddleware = [
    		[ProxyMiddleware::class, ['proxyList' => [
    			'http://xxx.xxx.xxx.156:3128',  // http://IP:PORT
    		]]]
    	];
    

    Then i want to add processing by:

    • an array of proxies,
    • getting from a file,
    • getting from a database
    opened by Azmandios 0
Releases(1.1.1)
  • 1.1.1(Sep 9, 2022)

    What's Changed

    • make browsershot wait until network idle on ExecuteJavascriptMiddleware by @chrismcintosh in https://github.com/roach-php/core/pull/56

    New Contributors

    • @chrismcintosh made their first contribution in https://github.com/roach-php/core/pull/56

    Full Changelog: https://github.com/roach-php/core/compare/1.1.0...1.1.1

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Jun 22, 2022)

    What's Changed

    • Custom item classes by @ksassnowski in https://github.com/roach-php/core/pull/47
    • Fix to align with symfony/console 6.1 by @ndeblauw in https://github.com/roach-php/core/pull/44

    New Contributors

    • @ndeblauw made their first contribution in https://github.com/roach-php/core/pull/44

    Full Changelog: https://github.com/roach-php/core/compare/1.0.0...1.1.0

    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Apr 19, 2022)

    :tada: Partaaay :tada:

    This release contains a few breaking changes. Please check out the upgrade guide on how to upgrade your application.

    What's Changed

    • Added Roach::collectSpider method to start a spider run and return all scraped items.
    • Added array $context parameter to Roach::startSpider and Roach::collectSpider to pass arbitrary context data to a spider when starting a run.
    • Added roach:run <spider> command to start a spider through the CLI.
    • Added Roach::fake() method to test that a run for a given spider was started
    • Requests dropped by downloader middleware are no longer affected by requestDelay (fixes #27)
    • Move spatie/browsershot from a require to suggest as it's only necessary if the ExecuteJavascriptMiddleware is used. Remove ext-exif as a dependency for the same reason.
    • Removed default command from CLI. To start the REPL, you now need to explicitly invoke the roach:shell <url> command, instead.
    • Add Configurable::withOptions method by @inxilpro in https://github.com/roach-php/core/pull/32
    • Make InteractsWithRequestsAndResponses available for testing by @Daanra in https://github.com/roach-php/core/pull/34

    New Contributors

    • @inxilpro made their first contribution in https://github.com/roach-php/core/pull/32
    • @Daanra made their first contribution in https://github.com/roach-php/core/pull/34

    Full Changelog: https://github.com/roach-php/core/compare/0.3.0...1.0.0

    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Jan 30, 2022)

    What's Changed

    • Add option to override spider configuration when starting a run by @ksassnowski in https://github.com/roach-php/core/pull/11
    • bump psysh version to 0.11.1 by @bangnokia in https://github.com/roach-php/core/pull/17

    New Contributors

    • @bangnokia made their first contribution in https://github.com/roach-php/core/pull/17

    Full Changelog: https://github.com/roach-php/core/compare/0.2.0...0.3.0

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Dec 28, 2021)

    What's Changed

    • Add middleware to execute Javascript by @ksassnowski in https://github.com/roach-php/core/pull/7

    New Contributors

    • @ksassnowski made their first contribution in https://github.com/roach-php/core/pull/7

    Full Changelog: https://github.com/roach-php/core/compare/0.1.0...0.2.0

    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Dec 27, 2021)

Owner
Roach PHP
A complete web scraping toolkit for PHP. Inspired by Scrapy
Roach PHP
Symfony bundle for Roach PHP

roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

Pisarev Alexey 7 Sep 28, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

João Eduardo Fornazari 1 Nov 26, 2021
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Mark Beech 1.7k Dec 30, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Daniel Reis 46 Sep 21, 2022
This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

ǃшɒʞɒH ǃǀɄ 1 Dec 22, 2021
Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

Kidd Yu 1.2k Dec 19, 2022