A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Related tags

Scraping s3n
Overview

s3n

Search-Scan-Save-Notify

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PHP.

  • Due to security concern, all errors are logged via email 📧 and are not displayed at GUI. To configure email and other requirements, keep reading.

The goal was to scrape News sites and custom RSS feeds and notify if any search criteria was hit or if a change on the site was made. The program uses free API keys for translation. I used IBM's Watson. You may choose google or other.

Requirement

  • PHP Server
  • php curl
  • php mailer
  • cron job to automate

What does the program do?

  • Fetch a hardcoded URL (using php curl) and store the contents in a temp file in the comp (comparison) directory.
  • After a specified interval (use cron job), reopen the same URL and compare this new response to the response stored in comp directory.
  • If the results are same, do nothing.
  • If the results differ; then perform a search of the terms mentioned in search.php
  • If the search terms do not match, do nothing; otherwise replace the temp file in comp directory and store the file (name of the file is the date and time of curl) in dumps directory and send an email notification (can be replaced by Telegram / Slack notofication) to the hardcoded address.
  • Repeat the process.

What else can the program do?

  • Translate the the site using IBM's Watson API. Here the contents from the site (let's say in Russian) will be sent to Watson's API and the response (let's say in English) will be searched for the search term and subsequently notified via mail.

Example

You can integrate for xss.is (Russian hacking platform; in Russian language).

Changes are required before you begin

  • watsonlate.php (IBM wWatson Translator) file requires $turl (line 2), language translation model like ru-en (line 3), auth token (line 12). The token requires signup at Watson service. Watson can translate 20K words per month for free trial account. Be judicious in its use. You may set frequesncy of cron twice a day at max.
  • search.php contains the list of terms you want to be notified about. Add new line for the new term (it is case insensitive as it is done in principal file in line 20). The $qq. variable will be displayed in the subject line of the email.
  • principal.php is the main file for a particular website. Create multiple copies for different websites. Change Line 6 to your target domain. Change email in line 11 to yours.
  • index.php is the one which a cron job should request in order for the entire above mentioned to commence. To be precise line 6 to 9 are important. The first and last lines are just calculating the round trip time of the entire program for debugging purpose.

Finally Enjoy!

  • If you have a bug to report or suggest an efficient way for any of the various modules, send me a message via the contact form on https://aamershah.com.
  • I have uploaded this code and removed sensitive info. I hope I didn't remove any important artifact.
You might also like...
The most integrated web scraper package for Laravel.
The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

Property page web scrapper
Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

Roach is a complete web scraping toolkit for PHP

🐴 Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

A small example of crawling another website and extracting the required information from it to save the website wherever we need it
A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere
It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one
RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.

Most Advance online restaurant menu, food delivery system, online restaurant menu, free online restaurant menu, free online restaurant menu app, online restaurant menu service provider
Most Advance online restaurant menu, food delivery system, online restaurant menu, free online restaurant menu, free online restaurant menu app, online restaurant menu service provider

Buffet Box online menu and cloud kitchen Most Advance online restaurant menu, online restaurant menu, free online restaurant menu, free online restaur

Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and User Authentication.
Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and User Authentication.

Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and

A PHP library to read and write feeds in JSONFeed, RSS or Atom format

feed-io feed-io is a PHP library built to consume and serve news feeds. It features: JSONFeed / Atom / RSS read and write support Feeds auto-discovery

Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3+ that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI.
Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3+ that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI.

laravelpodcast | A Laravel podcast manager package - v0.0.8 Introduction Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3

Doogle is a search engine and web crawler which can search indexed websites and images
Doogle is a search engine and web crawler which can search indexed websites and images

Doogle Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later. Written pri

Collect - REDAXO-Addon für APIs und Feeds auf Basis von YForm

Collect sammelt anhand unterschiedlicher APIs und Schnittstellen in regelmäßigen Abständen Social Media Posts, RSS-Einträge, Videos und Playlists und andere Inhalte.

PHP library to Scrape website into entity easily

Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

Drupal is an open source content management platform supporting a variety of websites ranging from personal weblogs to large community-driven websites.
Drupal is an open source content management platform supporting a variety of websites ranging from personal weblogs to large community-driven websites.

Drupal is an open source content management platform supporting a variety of websites ranging from personal weblogs to large community-driven websites.

Thelia is an open source tool for creating e-business websites and managing online content.

Thelia is an open source tool for creating e-business websites and managing online content.

The RSS feed for websites missing it
The RSS feed for websites missing it

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alon

Owner
Aamer
I ♥ code
Aamer
Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

Blackfire Player Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services,

Blackfire 485 Dec 31, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

XVideos PornHub RedTube API This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own

null 57 Dec 16, 2022
Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

null 68 Dec 27, 2022
Library for Rapid (Web) Crawler and Scraper Development

Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

crwlr.software 60 Nov 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
Get info from any web service or page

Embed PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web

Oscar Otero 1.9k Jan 1, 2023
PHP Scraper - an highly opinionated web-interface for PHP

PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

Peter Thaleikis 327 Dec 30, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023