The most integrated web scraper package for Laravel.

Overview

Laravel Scavenger

Laravel Scavenger

The most integrated web scraper package for Laravel.

Build Status Codecov Scrutinizer License Latest Stable Version Latest Unstable Version

Top Features

Scavenger provides the following features and more out-the-box.

  • Ease of use
    • Scavenger is super-easy to configure. Simple publish the config file and set your targets.
  • Scrape data from multiple sources at once.
  • Convert scraped data into usable Laravel model objects.
    • eg. You may scrape an article and have it converted into an object of your choice and saved in your database. Immediately available to your viewers.
  • You can easily perform one or more operations to each property of any scraped entity.
    • eg. You may call a paraphrase service from a model or package of your choice on data attributes before saving them to your database.
  • Data integrity constraints
    • Scavenger uses a hashing algorithm of your choice to maintain data integrity. This hash is used to ensure that one scrap (source article) is not converted to multiple output objects (model duplicates).
  • Console Command
    • Once scavenger is configured, a simple artisan command launches the seeker. Since this is a console command it is more efficient and timeouts are less likely to occur.
    • Artisan command: php artisan scavenger:seek
  • Schedule ready
    • Scavenger can easily be set to scrape on a schedule. Hence, creating a someone autonomous website is super easy!
  • SERP
    • Scavenger can be used to flexibly scrape Search Engine Result Pages.

Installation

  1. Install via composer; in console:

    composer require reliqarts/laravel-scavenger
    

    or require in composer.json:

    {
        "require": {
            "reliqarts/laravel-scavenger": "^3.1"
        }
    }

    then run composer update in your terminal to pull it in.

  2. (Optional) Publish package resources and configuration:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider"

You may opt to publish only configuration by using the scavenger-config tag:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-config"

or only the migrations via the scavenger-migrations tag:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-migrations"

Configuration

Scavenger is highly configurable. These configurations remain for use the next time around.

Structure

Below is an example of a typical config file structure, with explaining comments.

<?php

return [
    // debug mode?
    'debug' => false,

    // whether log file should be written
    'log' => true,

    // How much detail is expected in output, 1 being the lowest, 3 being highest.
    'verbosity' => 1,

    // Set the database config
    'database' => [
        // Scraps table
        'scraps_table' => env('SCAVENGER_SCRAPS_TABLE', 'scavenger_scraps'),
    ],

    // Daemon config - used to build daemon user
    'daemon' => [
        // Model to use for Daemon identification and login
        'model' => 'App\\User',

        // Model property to check for daemon ID
        'id_prop' => 'email',

        // Daemon ID
        'id' => '[email protected]',

        // Any additional information required to create a user:
        // NB. this is only used when creating a daemon user, there is no "safe" way
        // to change the daemon's password once he has been created.
        'info' => [
            'name' => 'Scavenger Daemon',
            'password' => 'pass',
        ],
    ],

    // guzzle settings
    'guzzle_settings' => [
        'timeout' => 60,
    ],

    // hashing algorithm to use
    'hash_algorithm' => 'sha512',

    // storage
    'storage' => [
        // This directory will live inside your application's log directory.
        'log_dir' => env('SCAVENGER_LOG_DIR', 'scavenger'),
    ],

    // different model entities and mapping information
    'targets' => [
        // NB. the "rooms" target shown below is for example purposes only. It has all posible keys explicitly.
        'rooms' => [
            'example' => true,
            'serp' => false,
            'model' => 'App\\Room',
            'source' => 'http://myroomslistingsite.1demo/section/rooms',
            'search' => [
                // keywords
                'keywords' => ['professional'],
                // form markup
                'form' => [
                    // search form selector (important)
                    'selector' => '#form',
                    // input element name for search term/keyword
                    'keyword_input_name' => 'keyword',
                    'submit_button' => [
                        // text on submit button (optional)
                        'text' => null,
                        // submit element id, use if button doesn't have text (optional)
                        'id' => null,
                    ],
                ],
            ],
            'pager' => [
                // link (a tag) selector
                'selector' => 'div.content #page a.pagingnav',
            ],
            // max. number of pages to scrape (0 is unlimited)
            'pages' => 0,
            // content markup: actual data to be scraped
            'markup' => [
                'title' => 'div.content section > table tr h3',
                // inside: content to be found upon clicking title link
                '__inside' => [
                    'title' => '#ad-title > h1 > a',
                    'body' => 'article .adcontent > p[align="LEFT"]:last-of-type',
                    // focus: focus detail on the following section
                    '__focus' => 'section section > .content #ad-detail > article',
                ],
                // wrapper/item/result: wrapping selector for each item on single page.
                // If inside special key is set this key becomes invalid (i.e. inside takes preference)
                '__result' => null,
            ],
            // split single attributes into multiple based on regex
            'dissect' => [
                'body' => [
                    'email' => '(([eE]mail)*:*\s*\w+\@(\s*\w)*\.(net|com))',
                    'phone' => '((([cC]all|[[tT]el|[Pp][Hh](one)*)[:\d\-,\sDL\/]*\d)|(\d{3}\-?\d{4}))',
                    'beds' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]edroom|b\/r|[Bb]ed)s?)',
                    'baths' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]athroom|bth|[Bb]ath)s?)',
                    // retain:  whether details should be left in source attribute after extraction
                    '__retain' => true,
                ],
            ],
            // modify attributes by calling functions
            'preprocess' => [
                // takes a callable
                // optional third parameter of array if callable method needs an instance
                // e.g. ['App\\Item', 'foo', true] or 'bar'
                'title' => null,
            ],
            // remap entity attributes to model properties (optional)
            'remap' => [
                'title' => null,
                'body' => null,
            ],
            // scraps containing any of these words will be rejected (optional)
            'bad_words' => [
                'office',
            ],
        ],

        // Google SERP example:
        'google' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\GoogleResult',
            'source' => 'https://www.google.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form[name="f"]',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 2,
            'pager' => [
                'selector' => '#foot > table > tr > td.b:last-child a',
            ],
            'markup' => [
                '__result' => 'div.g',
                'title' => 'h3 > a',
                'description' => '.st',
                // the 'link' and 'position' attributes make use of some of Scavengers available properties
                'link' => '__link',
                'position' => '__position',
            ],
        ],

        // Bing SERP example:
        'bing' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\BingResult',
            'source' => 'https://www.bing.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form#sb_form',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 3,
            'pager' => [
                'selector' => '.sb_pagN',
            ],
            'markup' => [
                '__result' => '.b_algo',
                'title' => 'h2 a',
                'description' => '.b_caption p',
                'link' => '__link',
                'position' => '__position',
            ],
        ],
    ],
];

Target Breakdown

The targets array is to contain a list of entities (to be scraped from) keyed by a unique target identifier. The structure is as follows.

  • model: Laravel DB model to create from target.
  • source: Source URL to scrape.
  • search: Search settings. Use if a search is to be performed before target data is shown. (optional)
    • keywords: Array of keywords to search for.
    • keyword_input: Keyword input text markup.
    • form_markup: CSS selector for search form.
    • submit_button_text: The text on the form's submit button.
  • pager: Next link CSS selector. To skip to next page.
  • markup: Array of attributes to scrape from main list. [attributeName => CSS selector]
    • __inside: Sub markup for detail page. Markup for page which shows when article title is clicked/opened. (optional)
  • dissect: Split compound attributes into smaller attributes via REGEX. (optional)
  • preprocess: Array of attributes which need to be preprocessed. [attributeName => callable] (optional)
  • remap: Array of attributes which need to be renamed in order to be saved as target objects. [attributeName => newName] (optional)
  • bad_words: Any scraps found containing these words will be discarded. (optional)

Glossary of Terms

The following words may appear in context above.

  • Daemon: User instance to be used by the scavenger service.
  • Scrap: Scraped data before being converted to the target object.
  • Target: Configured source-model mapping for a single entity.
  • Target Object: Eloquent model object to be generated from scrap.

Acknowledgements

This library is heavily inspired by and dependent on the Guzzle library, although several concepts may have been adjusted.

You might also like...
A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

s3n Search-Scan-Save-Notify A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PH

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

Property page web scrapper
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Roach is a complete web scraping toolkit for PHP

🐴 Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

PHP Scraper - an highly opinionated web-interface for PHP
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

A package for Laravel to perform basic git commands on locally integrated packages.

A package for Laravel to perform basic git commands on locally integrated development packages. If working within multiple local development packages or repositories at once this package is meant to ease the burden of navigating to each individual repository to perform basic git commands.

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

Extractor (scraper, crawler, parser) of products from Allegro

Extractor (scraper, crawler, parser) of products from Allegro

PHP scraper to get data from Google Play

nelexa/google-play-scraper PHP library to scrape application data from the Google Play store. Checking the exists of the app on Google Play. Retrievin

Integrated online shop based on Laravel LTS and the Aimeos e-commerce framework
Integrated online shop based on Laravel LTS and the Aimeos e-commerce framework

⭐ Star us on GitHub — it motivates us a lot! 😀 Aimeos Laravel ecommerce platform Aimeos is THE professional, full-featured and high performance e-com

Wordpress integrated with Laravel via Composer. Together, but independents.
Wordpress integrated with Laravel via Composer. Together, but independents.

Wordpress integrated with Laravel via Composer. Atention! The branch master is no longer manteined. Now I'm working on branch light. Not booting Larav

Laravel framework with integrated NuxtJs support, preconfigured for eslint, jest and vuetify.
Laravel framework with integrated NuxtJs support, preconfigured for eslint, jest and vuetify.

Laravel framework with integrated NuxtJs support, preconfigured for eslint, jest and vuetify.

PHP Integrated Query, a real LINQ library for PHP

PHP Integrated Query - Official site What is PINQ? Based off the .NET's LINQ (Language integrated query), PINQ unifies querying across arrays/iterator

Intuitive Website Styling integrated into WordPress' Customizer
Intuitive Website Styling integrated into WordPress' Customizer

Customify - Intuitive Website Styling for WordPress With Customify, developers can easily create advanced theme-specific options inside the WordPress

Comments
  • Can you give an example on parsing xml

    Can you give an example on parsing xml

    i want to use this package to scrape xml data .....but i want to use the config file to define my tags and parameter just like you did in example : 'bing' => [ 'example' => true, 'serp' => true, 'model' => 'App\BingResult', 'source' => 'https://www.bing.com', 'search' => [ 'keywords' => ['dog'], 'form' => [ 'selector' => 'form#sb_form', 'keyword_input_name' => 'q', ], ], 'pages' => 3, 'pager' => [ 'selector' => '.sb_pagN', ], 'markup' => [ '__result' => '.b_algo', 'title' => 'h2 a', 'description' => '.b_caption p', 'link' => '__link', 'position' => '__position', ],

    example url from where i want to fetch my data https://www.geo.tv/rss/1/0

    `` i want to fetch all the adat in item tags =>

    GEO TV - Entertainment https://www.geo.tv/ GeoTV, Geo News, Latest-News, Breaking News, Pakistan, Live Videos Mon, 25 Jul 2022 15:59:12 +0500 en-US <![CDATA[ Kate Hudson gives sneak peek of fun day out with her daughter: Photo ]]> ]]> <![CDATA[ Prince Harry not afraid to ‘say what he really feels’ in memoir ]]> ]]> `` enhancement 
    opened by Rameen15 1
Releases(v3.5.0)
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
Extractor (scraper, crawler, parser) of products from Allegro

Extractor (scraper, crawler, parser) of products from Allegro

Daniel Yatsura 1 May 11, 2022
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022