Library for Rapid (Web) Crawler and Scraper Development

Overview

Library for Rapid (Web) Crawler and Scraper Development

This package provides kind of a framework and a lot of ready to use, so-called steps, that you can combine to build your own crawlers or scrapers with.

Documentation

You can find the documentation at crwlr.software.

Contributing

If you consider contributing something to this package, read the contribution guide (CONTRIBUTING.md).

Comments
  • Add JsonFileStore feature

    Add JsonFileStore feature

    I found your great project during #hacktoberfest and gave it a spin. I noticed that there is currently only an option for saving results as CSV and thought having the option for saving to a JSON file would be a nice addition to your project.

    Usage is fairly simple, instead of adding the CSV Store to your crawler just add the new store instead:

    $crawler->setStore(new JsonFileStore(YOUR_STORE_PATH));

    opened by Cyberschorsch 2
  • Question

    Question

    Hi @otsch

    Is this package supports chrome-like session storage to perform crawling as a logged-in user? I think about giving this package a try, but not sure it fits all necessary requirements for the crawler I need. At least I don't see any references to WebDriver or PhantomJS in the docs. If this is currently out of support, do you plan to support it in the future?

    opened by michael-rubel 2
  • Minor validation methods improvement

    Minor validation methods improvement

    The validateAndSanitize...() methods in the abstract Step class, when called with an array with one single element, automatically try to use that array element as input value.

    opened by otsch 0
  • Html/Xml data extraction multiple layers

    Html/Xml data extraction multiple layers

    With the Html and Xml data extraction steps you can now add layers to the data that is being extracted, by just adding further Html/Xml data extraction steps as values in the mapping array that you pass as argument to the extract() method.

    opened by otsch 0
  • Improve adding data to final Result objects

    Improve adding data to final Result objects

    Add new step methods addToResult() and addLaterToResult(). addToResult() is a single replacement for setResultKey() and addKeysToResult() (which are removed) that can be used for array and non array output. addLaterToResult() is a new method that does not create a Result object immediately, but instead adds the output of the current step to all the Results that will later be created originating from the current output.

    New step method createsResult(), so you can differentiate if a step creates a Result object, or just keeps data to add to results later (new addLaterToResult() method). Primarily relevant for library internal use.

    getResultKey() is also removed with setResultKey(). It's removed without replacement, as it doesn't really make sense any longer.

    opened by otsch 0
  • New outputKey() and keepInputData() step methods

    New outputKey() and keepInputData() step methods

    New methods outputKey() and keepInputData() that can be used with any step. Using the outputKey() method, the step will convert non array output to an array and use the key provided as an argument to this method as array key for the output value. The keepInputData() method allows you to forward data from the step's input to the output. If the input is non array you can define a key using the method's argument. This is useful e.g. if you're having data in the initial inputs that you also want to add to the final crawling results.

    Breaking change: Group steps can now only produce combined outputs, as previously done when combineToSingleOutput() method was called. The method is removed. This is done because I think there is no real use case for a group step yielding separate outputs for all the different steps it contains. And on the other hand there have already been some Exceptions that have been thrown when certain methods are called without calling combineToSingleOutput() first, because they can only work with combined output. So, this also removes the need for such Exceptions.

    opened by otsch 0
  • Change urlPathMatches filter rule

    Change urlPathMatches filter rule

    Don't require delimiters in the argument for the regex pattern and also use tilde as delimiter which will be pretty rare in url paths compared to other common delimiters like /, %, #, +.

    opened by otsch 0
  • Improve response cache

    Improve response cache

    • New method retryCachedErrorResponses() in HttpLoader. When called, the loader will only use successful responses (status code < 400) from the cache and therefore retry already cached error responses.
    • New method writeOnlyCache() in HttpLoader to only write to, but don't read from the response cache. Can be used to renew cached responses.
    opened by otsch 0
  • Add default timeouts for the default guzzle client

    Add default timeouts for the default guzzle client

    The default values are 10 seconds for connect_timeout and 60 seconds for timeout. Also add the option to provide config options for the default guzzle client created inside the HttpLoader.

    Further, removed the httpClient() method in the HttpCrawler class. If you want to provide your own HTTP client, implement a custom loader method passing your client to the HttpLoader instead.

    opened by otsch 0
  • New functionality to paginate

    New functionality to paginate

    There is the new Paginate child class of the Http step class (easy access via Http::get()->paginate()). It takes an instance of the PaginatorInterface and uses it to iterate through pagination links. There is one implementation of that interface, the SimpleWebsitePaginator. The Http::get()->paginate() method uses it by default, when called just with a CSS selector to get pagination links. Paginators receive all loaded pages and implement the logic to find pagination links. The paginator class is also called before sending a request, with the request object that is about to be sent as an argument (prepareRequest()). This way, it should even be doable to implement more complex pagination functionality. For example when pagination is built using POST request with query strings in the request body.

    opened by otsch 0
  • Controll what to do in case of error responses

    Controll what to do in case of error responses

    New methods stopOnErrorResponse() and yieldErrorResponses() that can be used with Http steps. By calling stopOnErrorResponse() the step will throw a LoadingException when a response has a 4xx or 5xx status code. By calling the yieldErrorResponse() even error responses will be yielded and passed on to the next steps.

    The latter actually was the default behavior until now, but I think most people would either like to just ignore error responses or the whole crawler to fail/stop, so I changed this.

    opened by otsch 0
Releases(v0.6.0)
  • v0.6.0(Oct 3, 2022)

    Added

    • New step Http::crawl() (class HttpCrawl extending the normal Http step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on.
    • New steps Sitemap::getSitemapsFromRobotsTxt() (GetSitemapsFromRobotsTxt) and Sitemap::getUrlsFromSitemap() (GetUrlsFromSitemap) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps.
    • New step Html::metaData() to get data from meta tags (and title tag) in HTML documents.
    • New step Html::schemaOrg() (SchemaOrg) to get schema.org structured data in JSON-LD format from HTML documents.
    • The abstract DomQuery class (parent of the CssSelector and XPathQuery classes) now has some methods to narrow the selected matches further: first(), last(), nth(n), even(), odd().

    Changed

    • BREAKING: Removed PoliteHttpLoader and traits WaitPolitely and CheckRobotsTxt. Converted the traits to classes Throttler and RobotsTxtHandler which are dependencies of the HttpLoader. The HttpLoader internally gets default instances of those classes. The RobotsTxtHandler will respect robots.txt rules by default if you use a BotUserAgent and it won't if you use a normal UserAgent. You can access the loader's RobotsTxtHandler via HttpLoader::robotsTxt(). You can pass your own instance of the Throttler to the loader and also access it via HttpLoader::throttle() to change settings.

    Fixed

    • Getting absolute links via the GetLink and GetLinks steps and the toAbsoluteUrl() method of the CssSelector and XPathQuery classes, now also look for <base> tags in HTML when resolving the URLs.
    • The SimpleCsvFileStore can now also save results with nested data (but only second level). It just concatenates the values separated with a |.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Sep 3, 2022)

    Added

    • You can now call the new useHeadlessBrowser method on the HttpLoader class to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created.
    • With the maxOutputs() method of the abstract Step class you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call the invoke() method any longer until the step is reset after a run.
    • With the new outputHook() method of the abstract Crawler class you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons.
    • The extract() method of the Html and Xml (children of Dom) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data.
    • In addition to uniqueOutputs() there is now also uniqueInputs(). It works exactly the same as uniqueOutputs(), filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object.
    • In order to be able to also get absolute links when using the extract() method of Dom steps, the abstract DomQuery class now has a method toAbsoluteUrl(). The Dom step will automatically provide the DomQuery instance with the base url, presumed that the input was an instance of the RespondedRequest class and resolve the selected value against that base url.

    Changed

    • Remove some not so important log messages.
    • Improve behavior of group step's combineToSingleOutput(). When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on.
    • When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key unnamed with the whole array as value.

    Fixed

    • The static methods Html::getLink() and Html::getLinks() now also work without argument, like the GetLink and GetLinks classes.
    • When a DomQuery (CSS selector or XPath query) doesn't match anything, its apply() method now returns null (instead of an empty string). When the Html(/Xml)::extract() method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methods Html(/Xml)::each(), Html(/Xml)::first() or Html(/Xml)::last() doesn't match anything, that's not causing an error any longer, it just won't yield anything.
    • Removed the (unnecessary) second argument from the Loop::withInput() method because when keepLoopingWithoutOutput() is called and withInput() is called after that call, it resets the behavior.
    • Issue when date format for expires date in cookie doesn't have dashes in d-M-Y (so d M Y).
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(May 10, 2022)

  • v0.4.0(May 6, 2022)

    Added

    • The BaseStep class now has where() and orWhere() methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter using orWhere it's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the new Filter class. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...).
    • The GetLink and GetLinks steps now have methods onSameDomain(), notOnSameDomain(), onDomain(), onSameHost(), notOnSameHost(), onHost() to restrict the which links to find.
    • Automatically add the crawler's logger to the Store so you can also log messages from there. This can be breaking as the StoreInterface now also requires the addLogger method. The new abstract Store class already implements it, so you can just extend it.

    Changed

    • The Csv step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 26, 2022)

    Added

    • By calling monitorMemoryUsage() you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.

    Fixed

    • Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Apr 25, 2022)

    [0.2.0] - 2022-04-25

    Added

    • uniqueOutputs() method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.
    • runAndTraverse() method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.
    • Implement the behaviour for when a Group step should add something to the Result using setResultKey or addKeysToResult, which was still missing. For groups this will only work when using combineToSingleOutput.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Apr 18, 2022)

    Initial Version containing

    • Crawler class being the main unit that executes all the steps that you'll add to it, handling input and output of the steps.
    • HttpCrawler class using the PoliteHttpLoader (version of HttpLoader sticking to robots.txt rules) using any PSR-18 HTTP client under the hood and having an own implementation for a cookie jar.
    • Some ready to use steps for HTTP, HTML, XML, JSON and CSV.
    • Loops and Groups.
    • Crawler has a PSR-3 LoggerInterface and passes it on to all the steps. The included steps log some messages about what they're doing. Package includes a simple CliLogger.
    • Crawler requires a User Agent and an included BotUserAgent class provides an easy interface for bot user agent strings.
    • Stores to save the final results can be added to the Crawler. Simple CSV File Store is shipped with the package.
    Source code(tar.gz)
    Source code(zip)
Owner
crwlr.software
PHP Packages for Rapid Crawler and Scraper Development
crwlr.software
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Reliq Arts 134 Jan 4, 2023
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
On-Page SEO Crawler Tool with Interface

upzon I developed this project with PHP & MYSQL and python. If you have basic python and php knowledge, it is quite simple to use this program. I'm us

null 5 Oct 27, 2021
Packagist crawler

packagist-crawler packagist.orgをクロールして、全てのpackage.jsonをダウンロードします。 ダウンロードし終わったあとでstaticなweb serverで配信すれば、packagist.orgのミラーを作ることができます。 Requirement PHP >

Hiraku NAKANO 56 Nov 13, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Aamer 11 Nov 8, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

null 68 Dec 27, 2022
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Vaugen Wake 2 Feb 24, 2022
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
PHP library to Scrape website into entity easily

Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

null 5 Dec 18, 2021
Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

null 2.1k Dec 30, 2022