Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Overview

Build Status Coverage Status

Overview

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

Installation

composer require crawlzone/crawlzone

Key Features

  • Asynchronous crawling with customizable concurrency.
  • Automatically throttling crawling speed based on the load of the website you are crawling
  • If configured, automatically filters out requests forbidden by the robots.txt exclusion standard.
  • Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.
  • Rich filtering capabilities.
  • Ability to set crawling depth
  • Easy to extend the core by hooking into the crawling process using events.
  • Shut down crawler any time and start over without losing the progress.

Architecture

Architecture

Here is what's happening for a single request when you run the client:

  1. The client queues the initial request (start_uri).
  2. The engine looks at the queue and checks if there are any requests.
  3. The engine gets the request from the queue and emits the BeforeRequestSent event. If the depth option is set in the config, then the RequestDepth extension validates the depth of the request. If the obey robots.txt option is set in the config, then the RobotTxt extension checks if the request complies with the rules. In a case when the request doesn't comply, the engine emits the RequestFailed event and gets the next request from the queue.
  4. The engine uses the request middleware stack to pass the request through it.
  5. The engine sends an asynchronous request using Guzzle HTTP Client
  6. The engine emits the AfterRequestSent event and stores the request in the history to avoid crawling the same request again.
  7. When response headers are received, but the body has not yet begun to download, the engine emits the ResponseHeadersReceived event.
  8. The engine emits the TransferStatisticReceived event. If the autothrottle option is set in the config, then the AutoThrottle extension is executed.
  9. The engine uses the response middleware stack to pass the response through it.
  10. The engine emits the ResponseReceived event. Additionally, if the request status code is greater than or equal to 400, the engine emits RequestFailed event.
  11. The ResponseReceived triggers the ExtractAndQueueLinks extension, which extracts and queues the links. The process starts over until the queue is empty.

Quick Start

getUri(), $response->getStatusCode()); return $response; } } ); $client->run();">


use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
use Crawlzone\Middleware\BaseMiddleware;
use Crawlzone\Client;
use Crawlzone\Middleware\ResponseMiddleware;

require_once __DIR__ . '/../vendor/autoload.php';

$config = [
    'start_uri' => ['https://httpbin.org/'],
    'concurrency' => 3,
    'filter' => [
        //A list of string containing domains which will be considered for extracting the links.
        'allow_domains' => ['httpbin.org'],
        //A list of regular expressions that the urls must match in order to be extracted.
        'allow' => ['/get','/ip','/anything']
    ]
];

$client = new Client($config);

$client->addResponseMiddleware(
    new class implements ResponseMiddleware {
        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface
        {
            printf("Process Response: %s %s \n", $request->getUri(), $response->getStatusCode());

            return $response;
        }
    }
);

$client->run();

Middlewares

Middleware can be written to perform a variety of tasks including authentication, filtering, headers, logging, etc. To create middleware simply implement Crawlzone\Middleware\RequestMiddleware or Crawlzone\Middleware\ResponseMiddleware and then add it to a client:

getUri()); return $request; } } ); $client->addResponseMiddleware( new class implements ResponseMiddleware { public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface { printf("Middleware 2 Response: %s %s \n", $request->getUri(), $response->getStatusCode()); return $response; } } ); $client->run(); /* Output: Middleware 1 Request: https://httpbin.org/ip Middleware 2 Response: https://httpbin.org/ip 200 */ ">
...

$config = [
    'start_uri' => ['https://httpbin.org/ip']
];

$client = new Client($config);

$client->addRequestMiddleware(
    new class implements RequestMiddleware {
        public function processRequest(RequestInterface $request): RequestInterface
        {
            printf("Middleware 1 Request: %s \n", $request->getUri());
            return $request;
        }
    }
);

$client->addResponseMiddleware(
    new class implements ResponseMiddleware {
        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface
        {
            printf("Middleware 2 Response: %s %s \n", $request->getUri(), $response->getStatusCode());
            return $response;
        }
    }
);

$client->run();

/*
Output:
Middleware 1 Request: https://httpbin.org/ip
Middleware 2 Response: https://httpbin.org/ip 200
*/

To skip the request and go to the next middleware you can throw new \Crawlzone\Exception\InvalidRequestException from any middleware. The scheduler will catch the exception, notify all subscribers, and ignore the request.

Processing server errors

You can use middlewares to handle 4xx or 5xx responses.

getUri(), $response->getStatusCode()); return $response; } } ); $client->run();">
...
$config = [
    'start_uri' => ['https://httpbin.org/status/500','https://httpbin.org/status/404'],
    'concurrency' => 1,
];

$client = new Client($config);

$client->addResponseMiddleware(
    new class implements ResponseMiddleware {
        public function processResponse(ResponseInterface $response, RequestInterface $request): ResponseInterface
        {
            printf("Process Failure: %s %s \n", $request->getUri(), $response->getStatusCode());

            return $response;
        }
    }
);

$client->run();

Filtering

Use regular expression to allow or deny specific links. You can also pass array of allowed or denied domains as well. Use robotstxt_obey option to enable filtering. out requests forbidden by the robots.txt exclusion standard

...
$config = [
    'start_uri' => ['http://site.local/'],
    'concurrency' => 1,
    'filter' => [
        'robotstxt_obey' => true,
        'allow' => ['/page\d+','/otherpage'],
        'deny' => ['/logout']
        'allow_domains' => ['site.local'],
        'deny_domains' => ['othersite.local'],
    ]
];
$client = new Client($config);

Autothrottle

Autothrottle is enabled by default (use autothrottle.enabled => false to disable). It automatically adjusts scheduler to the optimum crawling speed trying to be nicer to the sites.

Throttling algorithm

AutoThrottle algorithm adjusts download delays based on the following rules:

  1. When a response is received, the target download delay is calculated as latency / N where latency is a latency of the response, and N is concurrency.
  2. Delay for next requests is set to the average of previous delay and the current delay;
  3. Latencies of non-200 responses are not allowed to decrease the delay;
  4. Delay can’t become less than min_delay or greater than max_delay
...
$config = [
    'start_uri' => ['http://site.local/'],
    'concurrency' => 3,
    'autothrottle' => [
        'enabled' => true,
        'min_delay' => 0, // Sets minimum delay between the requests (default 0).
        'max_delay' => 60, // Sets maximun delay between the requests (default 60).
    ]
];

$client = new Client($config);
...

Extension

Basically speaking, extensions are nothing more than event listeners based on the Symfony Event Dispatcher component. To create extension simply extend Crawlzone\Extension\Extension and add it to a client. All extensions have access to a Crawlzone\Config\Config and Crawlzone\Session object, which holds GuzzleHttp\Client. This might be helpful if you want to make some additional requests or reuse cookie headers for authentication.

...

use GuzzleHttp\Psr7\Request;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
use Crawlzone\Client;
use Crawlzone\Event\BeforeEngineStarted;
use Crawlzone\Extension\Extension;
use Crawlzone\Middleware\ResponseMiddleware;

$config = [
    'start_uri' => ['http://site.local/admin/']
];

$client = new Client($config);

$loginUri = 'http://site.local/admin/';
$username = 'test';
$password = 'password';

$client->addExtension(new class($loginUri, $username, $password) extends Extension {
    private $loginUri;
    private $username;
    private $password;

    public function __construct(string $loginUri, string $username, string $password)
    {
        $this->loginUri = $loginUri;
        $this->username = $username;
        $this->password = $password;
    }

    public function authenticate(BeforeEngineStarted $event): void
    {
        $this->login($this->loginUri, $this->username, $this->password);
    }

    private function login(string $loginUri, string $username, string $password)
    {
        $formParams = ['username' => $username, 'password' => $password];
        $body = http_build_query($formParams, '', '&');
        $request = new Request('POST', $loginUri, ['content-type' => 'application/x-www-form-urlencoded'], $body);
        $this->getSession()->getHttpClient()->sendAsync($request)->wait();
    }

    public static function getSubscribedEvents(): array
    {
        return [
            BeforeEngineStarted::class => 'authenticate'
        ];
    }
});

$client->run();

List of supported events Crawlzone\Event:

Event When?
BeforeEngineStarted Right before the engine starts crawling
BeforeRequestSent Before the request is scheduled to be sent
AfterRequestSent After the request is scheduled
TransferStatisticReceived When a handler has finished sending a request. Allows you to get access to transfer statistics of a request and access the lower level transfer details.
ResponseHeadersReceived When the HTTP headers of the response have been received but the body has not yet begun to download. Useful if you want to reject responses that are greater than certain size for example.
RequestFailed When the request is failed or the exception InvalidRequestException has been thrown from the middleware.
ResponseReceived When the response is received
AfterEngineStopped After engine stopped crawling

Command Line Tool

You can use simple command line tool to crawl your site quickly. First create configuration file:

./crawler init 

Then configure crawler.yml and run the crawler with a command:

./crawler start --config=./crawler.yml 

To get more details about request and response use -vvv option:

./crawler start --config=./crawler.yml -vvv 

Configuration

'/path/to/my/sqlite.db', 'filter' => [ // If enabled, crawler will respect robots.txt policies. Default is false 'robotstxt_obey' => false, // A list of regular expressions that the urls must match in order to be extracted. If not given (or empty), it will match all links.. 'allow' => ['test','test1'], // A list of string containing domains which will be considered for extracting the links. 'allow_domains' => ['test.com','test1.com'], // A list of strings containing domains which won’t be considered for extracting the links. It has precedence over the allow_domains parameter. 'deny_domains' => ['test2.com','test3.com'], // A list of regular expressions) that the urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. 'deny' => ['test2','test3'], ], // Crawler uses Guzzle HTTP Client so most of the Guzzle request options supported // For more info go to http://docs.guzzlephp.org/en/stable/request-options.html 'request_options' => [ // Describes the SSL certificate verification behavior of a request. 'verify' => false, // Specifies whether or not cookies are used in a request or what cookie jar to use or what cookies to send. 'cookies' => CookieJar::fromArray(['name' => 'test', 'value' => 'test-value'],'localhost'), // Describes the redirect behavior of a request. 'allow_redirects' => false, // Set to true or to enable debug output with the handler used to send a request. 'debug' => true, // Float describing the number of seconds to wait while trying to connect to a server. Use 0 to wait indefinitely (the default behavior). 'connect_timeout' => 0, // Float describing the timeout of the request in seconds. Use 0 to wait indefinitely (the default behavior). 'timeout' => 0, // Float describing the timeout to use when reading a streamed body. Defaults to the value of the default_socket_timeout PHP ini setting 'read_timeout' => 60, // Specify whether or not Content-Encoding responses (gzip, deflate, etc.) are automatically decoded. 'decode_content' => true, // Set to "v4" if you want the HTTP handlers to use only ipv4 protocol or "v6" for ipv6 protocol. 'force_ip_resolve' => null, // Pass an array to specify different proxies for different protocols. 'proxy' => [ 'http' => 'tcp://localhost:8125', // Use this proxy with "http" 'https' => 'tcp://localhost:9124', // Use this proxy with "https", 'no' => ['.mit.edu', 'foo.com'] // Don't use a proxy with these ], // Set to true to stream a response rather than download it all up-front. 'stream' => false, // Protocol version to use with the request. 'version' => '1.1', // Set to a string or an array to specify the path to a file containing a PEM formatted client side certificate and password. 'cert' => '/path/server.pem', // Specify the path to a file containing a private SSL key in PEM format. 'ssl_key' => ['/path/key.pem', 'password'] ], 'autothrottle' => [ // Enables autothrottle extension. Default is true. 'enabled' => true, // Sets minimum delay between the requests. 'min_delay' => 0, // Sets maximun delay between the requests. 'max_delay' => 60 ] ]; ">
$fullConfig = [
    // A list of URIs to crawl. Required parameter. 
    'start_uri' => ['http://test.com', 'http://test1.com'],
    
    // The number of concurrent requests. Defaut is 10.
    'concurrency' => 10,
    
    // The maximum depth that will be allowed to crawl (Mininum 1, unlimited if not set). 
    'depth' => 1,
    
    // The path to local file where the progress will be stored. Use "memory" to store the progress in memory (default behavior).
    // The crawler uses Sqlite database to store the progress.
    'save_progress_in' => '/path/to/my/sqlite.db',
    
    'filter' => [
        // If enabled, crawler will respect robots.txt policies. Default is false
        'robotstxt_obey' => false,
        
        // A list of regular expressions that the urls must match in order to be extracted. If not given (or empty), it will match all links..
        'allow' => ['test','test1'],
        
        // A list of string containing domains which will be considered for extracting the links.
        'allow_domains' => ['test.com','test1.com'],
        
        // A list of strings containing domains which won’t be considered for extracting the links. It has precedence over the allow_domains parameter.
        'deny_domains' => ['test2.com','test3.com'],
        
        // A list of regular expressions) that the urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter.
        'deny' => ['test2','test3'],
    ],
    // Crawler uses Guzzle HTTP Client so most of the Guzzle request options supported
    // For more info go to http://docs.guzzlephp.org/en/stable/request-options.html
    'request_options' => [
        // Describes the SSL certificate verification behavior of a request.
        'verify' => false,
        
        // Specifies whether or not cookies are used in a request or what cookie jar to use or what cookies to send.
        'cookies' => CookieJar::fromArray(['name' => 'test', 'value' => 'test-value'],'localhost'),
        
        // Describes the redirect behavior of a request.
        'allow_redirects' => false,
        
        // Set to true or to enable debug output with the handler used to send a request.
        'debug' => true,
        
        // Float describing the number of seconds to wait while trying to connect to a server. Use 0 to wait indefinitely (the default behavior).
        'connect_timeout' => 0,
        
        // Float describing the timeout of the request in seconds. Use 0 to wait indefinitely (the default behavior).
        'timeout' => 0,
        
        // Float describing the timeout to use when reading a streamed body. Defaults to the value of the default_socket_timeout PHP ini setting
        'read_timeout' => 60,
        
        // Specify whether or not Content-Encoding responses (gzip, deflate, etc.) are automatically decoded.
        'decode_content' => true,
        
        // Set to "v4" if you want the HTTP handlers to use only ipv4 protocol or "v6" for ipv6 protocol.
        'force_ip_resolve' => null,
        
        // Pass an array to specify different proxies for different protocols.
        'proxy' => [
            'http'  => 'tcp://localhost:8125', // Use this proxy with "http"
            'https' => 'tcp://localhost:9124', // Use this proxy with "https",
            'no' => ['.mit.edu', 'foo.com']    // Don't use a proxy with these
         ],
         
         // Set to true to stream a response rather than download it all up-front.
        'stream' => false,
        
        // Protocol version to use with the request.
        'version' => '1.1',
        
        // Set to a string or an array to specify the path to a file containing a PEM formatted client side certificate and password.
        'cert' => '/path/server.pem',
        
        // Specify the path to a file containing a private SSL key in PEM format.
        'ssl_key' => ['/path/key.pem', 'password']
    ],
    
    'autothrottle' => [
        // Enables autothrottle extension. Default is true.
        'enabled' => true,
        
        // Sets minimum delay between the requests.
        'min_delay' => 0,
        
        // Sets maximun delay between the requests.
        'max_delay' => 60
    ]
];

Thanks for Inspiration

https://scrapy.org/

http://docs.guzzlephp.org/

If you feel that this project is helpful, please give it a star or leave some feedback. This will help me understand the needs and provide future library updates.

Comments
  • Uncaught PDOException: SQLSTATE[HY000] [14] unable to open database file

    Uncaught PDOException: SQLSTATE[HY000] [14] unable to open database file

    Fatal error: Uncaught PDOException: SQLSTATE[HY000] [14] unable to open database file in vendor/crawlzone/crawlzone/src/Storage/Adapter/SqliteAdapter.php:20

    how to fix this?

    opened by BenjaminHoegh 2
  • Bump guzzlehttp/guzzle from 7.4.1 to 7.4.4

    Bump guzzlehttp/guzzle from 7.4.1 to 7.4.4

    Bumps guzzlehttp/guzzle from 7.4.1 to 7.4.4.

    Release notes

    Sourced from guzzlehttp/guzzle's releases.

    Release 7.4.4

    See change log for changes.

    Release 7.4.3

    See change log for changes.

    Release 7.4.2

    See change log for changes.

    Changelog

    Sourced from guzzlehttp/guzzle's changelog.

    7.4.4 - 2022-06-09

    • Fix failure to strip Authorization header on HTTP downgrade
    • Fix failure to strip the Cookie header on change in host or HTTP downgrade

    7.4.3 - 2022-05-25

    • Fix cross-domain cookie leakage

    7.4.2 - 2022-03-20

    Fixed

    • Remove curl auth on cross-domain redirects to align with the Authorization HTTP header
    • Reject non-HTTP schemes in StreamHandler
    • Set a default ssl.peer_name context in StreamHandler to allow force_ip_resolve
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump guzzlehttp/guzzle from 7.4.1 to 7.4.3

    Bump guzzlehttp/guzzle from 7.4.1 to 7.4.3

    Bumps guzzlehttp/guzzle from 7.4.1 to 7.4.3.

    Release notes

    Sourced from guzzlehttp/guzzle's releases.

    Release 7.4.3

    See change log for changes.

    Release 7.4.2

    See change log for changes.

    Changelog

    Sourced from guzzlehttp/guzzle's changelog.

    7.4.3 - 2022-05-25

    • Fix cross-domain cookie leakage

    7.4.2 - 2022-03-20

    Fixed

    • Remove curl auth on cross-domain redirects to align with the Authorization HTTP header
    • Reject non-HTTP schemes in StreamHandler
    • Set a default ssl.peer_name context in StreamHandler to allow force_ip_resolve
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump guzzlehttp/guzzle from 7.4.1 to 7.4.5

    Bump guzzlehttp/guzzle from 7.4.1 to 7.4.5

    Bumps guzzlehttp/guzzle from 7.4.1 to 7.4.5.

    Release notes

    Sourced from guzzlehttp/guzzle's releases.

    Release 7.4.5

    See change log for changes.

    Release 7.4.4

    See change log for changes.

    Release 7.4.3

    See change log for changes.

    Release 7.4.2

    See change log for changes.

    Changelog

    Sourced from guzzlehttp/guzzle's changelog.

    7.4.5 - 2022-06-20

    • Fix change in port should be considered a change in origin
    • Fix CURLOPT_HTTPAUTH option not cleared on change of origin

    7.4.4 - 2022-06-09

    • Fix failure to strip Authorization header on HTTP downgrade
    • Fix failure to strip the Cookie header on change in host or HTTP downgrade

    7.4.3 - 2022-05-25

    • Fix cross-domain cookie leakage

    7.4.2 - 2022-03-20

    Fixed

    • Remove curl auth on cross-domain redirects to align with the Authorization HTTP header
    • Reject non-HTTP schemes in StreamHandler
    • Set a default ssl.peer_name context in StreamHandler to allow force_ip_resolve
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump guzzlehttp/psr7 from 2.1.0 to 2.2.1

    Bump guzzlehttp/psr7 from 2.1.0 to 2.2.1

    Bumps guzzlehttp/psr7 from 2.1.0 to 2.2.1.

    Release notes

    Sourced from guzzlehttp/psr7's releases.

    2.2.1

    See change log for changes.

    2.2.0

    See change log for changes.

    2.1.2

    See change log for changes.

    2.1.1

    See change log for changes.

    Changelog

    Sourced from guzzlehttp/psr7's changelog.

    2.2.1 - 2022-03-20

    Fixed

    • Correct header value validation

    2.2.0 - 2022-03-20

    Added

    • A more compressive list of mime types
    • Add JsonSerializable to Uri
    • Missing return types

    Fixed

    • Bug MultipartStream no uri metadata
    • Bug MultipartStream with filename for data:// streams
    • Fixed new line handling in MultipartStream
    • Reduced RAM usage when copying streams
    • Updated parsing in Header::normalize()

    2.1.1 - 2022-03-20

    Fixed

    • Validate header values properly
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Use function/Closure for deny, allow

    Use function/Closure for deny, allow

    Is there any way to use Closure for deny, allow option? I think we should use a Closure/function for that so we can check using database ...etc.

    See https://github.com/spatie/crawler#filtering-certain-urls

    enhancement 
    opened by tranghaviet 1
  • Link extractor replacement

    Link extractor replacement

    Hi, is it possible to replace the default link extractor? Maybe removing the default ExtractAndQueueLinks extension and readding it with my link extractor?

    enhancement 
    opened by Redominus 1
Releases(4.0.0)
Owner
null
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

AOE 0 Sep 14, 2021
A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

Mohammad Qasemi 9 Sep 24, 2022
Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

null 2.1k Dec 30, 2022
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Techie Sneh 21 Nov 19, 2021
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022