A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Aamer

Last update: Nov 8, 2022

Related tags

Scraping s3n

Overview

s3n

Search-Scan-Save-Notify

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit. It is based on PHP.

Due to security concern, all errors are logged via email 📧 and are not displayed at GUI. To configure email and other requirements, keep reading.

The goal was to scrape News sites and custom RSS feeds and notify if any search criteria was hit or if a change on the site was made. The program uses free API keys for translation. I used IBM's Watson. You may choose google or other.

Requirement

PHP Server
php curl
php mailer
cron job to automate

What does the program do?

Fetch a hardcoded URL (using php curl) and store the contents in a temp file in the comp (comparison) directory.
After a specified interval (use cron job), reopen the same URL and compare this new response to the response stored in comp directory.
If the results are same, do nothing.
If the results differ; then perform a search of the terms mentioned in search.php
If the search terms do not match, do nothing; otherwise replace the temp file in comp directory and store the file (name of the file is the date and time of curl) in dumps directory and send an email notification (can be replaced by Telegram / Slack notofication) to the hardcoded address.
Repeat the process.

What else can the program do?

Translate the the site using IBM's Watson API. Here the contents from the site (let's say in Russian) will be sent to Watson's API and the response (let's say in English) will be searched for the search term and subsequently notified via mail.

Example

You can integrate for xss.is (Russian hacking platform; in Russian language).

Changes are required before you begin

watsonlate.php (IBM wWatson Translator) file requires $turl (line 2), language translation model like ru-en (line 3), auth token (line 12). The token requires signup at Watson service. Watson can translate 20K words per month for free trial account. Be judicious in its use. You may set frequesncy of cron twice a day at max.
search.php contains the list of terms you want to be notified about. Add new line for the new term (it is case insensitive as it is done in principal file in line 20). The $qq. variable will be displayed in the subject line of the email.
principal.php is the main file for a particular website. Create multiple copies for different websites. Change Line 6 to your target domain. Change email in line 11 to yours.
index.php is the one which a cron job should request in order for the entire above mentioned to commence. To be precise line 6 to 9 are important. The first and last lines are just calculating the round trip time of the entire program for debugging purpose.

Finally Enjoy!

If you have a bug to report or suggest an efficient way for any of the various modules, send me a message via the contact form on https://aamershah.com.
I have uploaded this code and removed sensitive info. I hope I didn't remove any important artifact.

You might also like...

The most integrated web scraper package for Laravel.

Laravel Scavenger The most integrated web scraper package for Laravel. Top Features Scavenger provides the following features and more out-the-box. Ea

134 Jan 4, 2023

Property page web scrapper

Property page web scrapper This tool was built to expermiment with extracting features for property pages on websites like booking.com and Airbnb. Thi

2 Feb 24, 2022

Roach is a complete web scraping toolkit for PHP

🐴 Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

1.1k Jan 3, 2023

Simple and fast HTML parser

DiDOM README на русском DiDOM - simple and fast HTML parser. Contents Installation Quick start Creating new document Search for elements Verify if ele

2.1k Dec 30, 2022

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

0 Sep 14, 2021

A small example of crawling another website and extracting the required information from it to save the website wherever we need it

A small example of crawling another website and extracting the required information from it to save the website wherever we need it Description This s

9 Sep 24, 2022

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

21 Nov 19, 2021

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

1 Mar 24, 2022

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.

5.5k Dec 30, 2022

Most Advance online restaurant menu, food delivery system, online restaurant menu, free online restaurant menu, free online restaurant menu app, online restaurant menu service provider

Buffet Box online menu and cloud kitchen Most Advance online restaurant menu, online restaurant menu, free online restaurant menu, free online restaur

20 Oct 28, 2022

Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and User Authentication.

Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and

35 Dec 19, 2022

A PHP library to read and write feeds in JSONFeed, RSS or Atom format

feed-io feed-io is a PHP library built to consume and serve news feeds. It features: JSONFeed / Atom / RSS read and write support Feeds auto-discovery

236 Dec 22, 2022

Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3+ that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI.

laravelpodcast | A Laravel podcast manager package - v0.0.8 Introduction Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3

22 Nov 4, 2022

Doogle is a search engine and web crawler which can search indexed websites and images

Drupal is an open source content management platform supporting a variety of websites ranging from personal weblogs to large community-driven websites.

3.8k Jan 4, 2023

Thelia is an open source tool for creating e-business websites and managing online content.

800 Dec 28, 2022

The RSS feed for websites missing it

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alon

5.5k Jan 8, 2023

A program to scrape online web-content (APIs, RSS Feeds, or Websites) and notify if search term was hit.

Related tags

Overview

s3n

Requirement

What does the program do?

What else can the program do?

Example

Changes are required before you begin

Finally Enjoy!

You might also like...

The most integrated web scraper package for Laravel.

Property page web scrapper

Roach is a complete web scraping toolkit for PHP

Simple and fast HTML parser

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.

A small example of crawling another website and extracting the required information from it to save the website wherever we need it

It can Scrap ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one

Most Advance online restaurant menu, food delivery system, online restaurant menu, free online restaurant menu, free online restaurant menu app, online restaurant menu service provider

Laravel Podcast is Laravel 5.5 web app that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI and User Authentication.

A PHP library to read and write feeds in JSONFeed, RSS or Atom format

Laravel Podcast Manager is a complete podcast manager package for Laravel 5.3+ that enables you to manage RSS feeds for your favorite podcasts and listen to the episodes in a seamless UI.

Doogle is a search engine and web crawler which can search indexed websites and images

Collect - REDAXO-Addon für APIs und Feeds auf Basis von YForm

PHP library to Scrape website into entity easily

Drupal is an open source content management platform supporting a variety of websites ranging from personal weblogs to large community-driven websites.

Thelia is an open source tool for creating e-business websites and managing online content.

The RSS feed for websites missing it

Owner

Aamer

Blackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.

A browser testing and web crawling library for PHP and Symfony

A configurable and extensible PHP web spider

This script scrapes the HTML from different web pages to get the information from the video and you can use it in your own video player.

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

Library for Rapid (Web) Crawler and Scraper Development

Goutte, a simple PHP Web Scraper

Get info from any web service or page

PHP Scraper - an highly opinionated web-interface for PHP

Goutte, a simple PHP Web Scraper