Doogle is a search engine and web crawler which can search indexed websites and images

Overview

Doogle

Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later.

Written primarily in OOP style PHP with the intent of better understanding OOP and how web crawlers work.

DoogleHomepage-Preview

Features

  • Search sites
    • Displays title, URL and description
  • Search images
    • Hover over images to preview description (alt tag)
    • Masonry layout for searched images
    • Image preview using Fancybox
    • Image search page responds dynamically
  • Clean homepage
  • Filters broken image results
  • Organises search results by clicks/visits
  • Pagination system at the bottom of the search page
  • Shows 'results found' for search term

Table of Contents

Setup and Usage

Server Setup

v1.0.0-beta.1 is supported and tested in PHP 7.4, 8.0 and 8.1.

Please refer to XAMPP for the web server, PHP server and MySQL server configuration. XAMPP is the simplest method as several servers are required to use Doogle.

MySQL Setup on XAMPP will use PHPMyAdmin as a GUI method of setting up the database.

Once logged into the database via PHPMyAdmin under the PHPMyAdmin > SQL tab, the content of 'doogle-tables-no-data.sql' can be pasted into the field

Image1-PHPMyAdmin

PHP Dependencies

mysql
pdo_mysql

SQL User Creation

Amend the password PASSWORD_HERE using a strong random password.

mysql> CREATE USER IF NOT EXISTS 'doogle'@'localhost' IDENTIFIED BY 'PASSWORD_HERE';

SQL User Permissions

The SQL user 'doogle' must have SELECT, INSERT and UPDATE privileges:

mysql> GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'localhost';
  • INSERT is used for crawling
  • SELECT is required for the search engine to return queries
  • UPDATE is required to amend the clicks and broken results (see ./ajax/)

Connecting PHP to MySQL Server

In the file config.php the following must be entered correctly for your database configuration:

$dbname = "doogle";
$dbhost = "localhost";
$dbuser = "doogle";
$dbpass = "";

In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle'.

Crawling Websites to Populate Images and Sites tables

Form-based crawl

In your browser go to where the file is hosted http://localhost/crawl-formSubmit.php

Paste the URL into the input field and press the Crawl button.

Manual crawl

At the bottom of crawl-manual.php the variable $startUrl is where to paste the URL of the website to be crawled:

$startUrl = "https://thehackernews.com/";

Then in your browser go to where the file is hosted http://localhost/crawl-manual.php

Explanation

The crawling process will take some time, it will completely depend on the size of the website being crawled. The page will continue to load (without output) until the crawl.php script finishes.

Check the tables images and sites in the database to ensure they are being populated.

Image2-PHPMyAdmin

Once the tables are populated visit the Doogle homepage and search! See preview images.

Programming Logic

Pagination

Logic of pagination system

Inside search.php, pagination is implemented

image demonstrating pagnigation

In the example above, currentPage=11. The number of pages to show is always 10.

Results Per Page

Site search will return 20 results per page and image search will return 30 results per page.

The results per page can be changed inside search.php on lines {83, 88} respectively. As indicated by the $pageSize variables:

Search-resultsPerPage

Handling an edge case

An edge case can occur when no more pages are available.

So, for 331 results, 17 pages will be available. However, without an edge case scenario consider, the UI for the pagination system will allow scrolling through pages which don't exist; which would return an empty result.

To handle an edge case the following logic is implemented in the while-loop:

if($currentPage + $pagesLeft > $numPages + 1)
    $currentPage = $numPages + 1 - $pagesLeft;

while($pagesLeft != 0 && $currentPage <= $numPages) 
{ ... }

Image Search

Image Captions

To make image searches more informative, the 'alt' tag is part of the search term. As shown in ./classes/ImageResultsProvider.php line 34

ImageResultsProvider-query

Loading Images with JavaScript

In the 'images' table, there is a row 'broken' which tracks images which return an error.

Because images are already loaded with a pure server-side solution, AJAX must be leveraged, loading images dynamically. Which is shown in ./assets/js/script.js

script js-loadImage-broken

Masonry

Image searches are using Masonry - Cascading grid layout library.

Masonry allows images a grid layout which is responsive due to jQuery. The image below shows an example layout:

Masonry-item-layout

Site Search - Trimming Results

As shown in the preview images, Doogle when performing a site search will return (title, URL and description) for each result.

However, to make some results easier to read, a trimming process is performed. Inside ./classes/SiteResultsProvider.php the function trimField() is called:

SiteResultsProvider-trim1

SiteResultsProvider-trim2

Title's are trimmed at 55 characters and description's are trimmed at 230 characters.

Telemetry

Both the 'images' and 'sites' tables in the database have a row containing 'clicks' for each column.

The 'clicks' field is increased each time a site is visited or image is previewed.

When performing a search, results returned are organised in descending order of clicks. This behaviour is shown by the $query inside ./classes/SiteResultsProvider.php function getResultsHtml(). See line 43.

SiteResultsProvider-getResultsHtml

User-Agent

Inside ./classes/DomDocumentParser.php the user-agent data used during crawling is located. As indicated on line 9:

DomDocumentParser-bot

Preview Images

Doogle Homepage

Image3-DoogleHomepage-Edge

Doogle Search - Sites

Image4-DoogleSearch-PoC

Doogle Search - Images

Image5-DoogleSearch-PoC-images

Image Preview

Image preview is done using Fancybox.

The title, image URL and site URL are available on the bottom left corner.

Image9-DoogleSearch-imagePreview

Pagination System

Naturally, certain search terms may return many results like 'bbc'.

To which Doogle only displays 20 sites per page. At the bottom of the page, we can view the next 10 pages.

Results Shown

Image6-DoogleSearch-pagination-ResultsShown

Bottom of Page

Image7-DoogleSearch-pagination-Bottom

Bottom of Page 13

Image8-DoogleSearch-pagination-scrollingThrough

doogleBot Crawl Form

An HTML form to submit a URL for crawling

Image10-doogleBot-Crawler-formpng

Preview Video

Doogle Search demo - YouTube

Comments
  • Indexing problem with Polish characters

    Indexing problem with Polish characters

    Email body below.

    Hi, I use your doogle and I have a problem with Polish characters, i.e. partially indexed pages have normally Polish characters, but some pages do not exist, do you know how to solve it? Thank you and best regards Thank you

    image

    opened by safesploit 8
  • Checkup ? Question (Sitemap Crawl Functionality | Crawling Description Question)

    Checkup ? Question (Sitemap Crawl Functionality | Crawling Description Question)

    Hey, i wanted to ask you some questions.

    so the question is i wanted to use your search engine to index a website using sitemap.xml ( index and crawl the whole content from the website) this way it will be easier to pinpoint the engine on what pages it needs to search on. it would be much more easier to find content you are looking for.

    because I followed your Read.me file but each time Doodle crawl through a website I find out that it only saves the page title and the website description. eg. Hackernew website. ( when I index and search for a keyword the result is almost the same( description) but the URL is present and the title is not.

    eg. when I search for Malware

    the result present is title: Malware Strains Targeting Python and JavaScript Developers description: The Hacker News is the most trusted and popular cybersecurity publication for information security professionals seeking breaking news, actionable insights https://thehackernews.com/2022/12/malware-strains-targeting-python-and.html

    see the description uses the main website description instead of the blog page.

    am not sure if am missing something.

    opened by RedWilly 3
  • Question?

    Question?

    Does this search engine follow robots.txt file or can it ignore and index it either way.

    Trying to index my blogspot domain but because of the robots.txt file I can't do that.. any way to allow me to index the site while ignoring the robot.txt

    opened by RedWilly 2
  • What PHP versions are you writing this in?

    What PHP versions are you writing this in?

    I have seen some other versions. But, they quit working with the newer versions of php. So, I would like to know if you are at least using php7 or above.

    opened by findborg 2
  • Bug: Crawling non-ASCII characters (URL)

    Bug: Crawling non-ASCII characters (URL)

    When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

    opened by safesploit 0
  • Bug: Search box not displaying on iOS Safari

    Bug: Search box not displaying on iOS Safari

    Searchbox related to .mainSection .searchContainer .searchBox is not displaying on iOS 16 Safari.

    Believed to be an issue related to border: none; box-shadow: 0px 2px 2px 0px rgba(0,0,0,0.16), 0px 0px 0px 1px rgba(0,0,0,0.08); not rendering correctly in mobile Safari.

    Possible remedy is to attempt WebKit translation or opt for compatible CSS under @media only screen and (max-width: 700px).

    Notes

    Update as a patch.

    opened by safesploit 0
Releases(v1.1.2-beta)
Owner
Zepher Ashe
GPG Software Singing Key ID: 36DE9A59CD869EE2
Zepher Ashe
A web app for the resolution of a mobile game in wich you have 4 images and a list of letters, then a few boxes to fill with the word connecting the four images.

4images_1mot_solutions A web app for the resolution of a mobile game in wich you have 4 images and a list of letters, then a few boxes to fill with th

FOTSO Claude 3 Jan 13, 2022
Nova Search is an open source search engine developed by the Artado Project.

Loli Search Loli Search açık kaynak kodlu bir arama motorudur ve yalnızca kendi sonuçlarını değil, diğer arama motorlarının sonuçlarını da göstermekte

Artado Project 10 Jul 22, 2022
JSONFinder - a library that can find json values in a mixed text or html documents, can filter and search the json tree, and converts php objects to json without 'ext-json' extension.

JSONFinder - a library that can find json values in a mixed text or html documents, can filter and search the json tree, and converts php objects to json without 'ext-json' extension.

Eboubaker Eboubaker 2 Jul 31, 2022
crm_chatbot is an app which allows to create a chat for websites.

CRM Chatbot This app could be installed only in the Midrub CMS version 0.0.8.5+. In older versions it will break anything. DEMO VIDEO: https://youtu.b

null 2 Oct 27, 2022
This tool can help you to see the real IP behind CloudFlare protected websites.

CrimeFlare Bypass Hostname Alat untuk melihat IP asli dibalik website yang telah dilindungi CloudFlare. Introduction Alat ini berfungsi untuk melakuka

zidan rahmandani 126 Oct 20, 2021
Smile ElasticSuite - Magento 2 merchandising and search engine built on ElasticSearch

News ⚠️ Magento versions compatibility : Due to several changes in Magento 2.4.0, we cannot ensure compatibility between ElasticSuite <2.10 and Magent

Smile - Open Source Solutions 724 Dec 30, 2022
A privacy respecting free as in freedom meta search engine for Google and popular torrent sites

A privacy respecting free as in freedom meta search engine for Google and popular torrent sites

null 329 Dec 27, 2022
Silverstripe-fulltextsearch - Adds external full text search engine support to SilverStripe

FullTextSearch module Adds support for fulltext search engines like Sphinx and Solr to SilverStripe CMS. Compatible with PHP 7.2 Important notes when

Silverstripe CMS 42 Dec 30, 2022
Magento 2 Module for Search Engine Optimization

Magento 2 Search Engine Optimization Magento 2 Module to Improve Search Engine Optimization (SEO) on your Magento site. Installation Install the modul

Stämpfli AG 100 Oct 7, 2022
An Elasticsearch engine plugin for Moodle's Global Search

Moodle Global Search - Elasticsearch Backend This plugin allows Moodle to use Elasticsearch as the search engine for Moodle's Global Search. The follo

Catalyst IT 12 Nov 3, 2022
This project aims to facilitate the management of websites monitored by the blackbox exporter, via a web UI.

This project aims to facilitate the management of websites monitored by the blackbox exporter, via a web UI. The UI would allow to add/remove sites, groups, and even add different fields in the prometheus database.

null 2 Nov 6, 2021
Shortest Path - have a function ShortestPath (strArr) take strArr which will be an array of strings which models a non-looping Graph.

Have the function ShortestPath(strArr) take strArr which will be an array of strings which models a non-looping Graph

null 1 Feb 5, 2022
This Repository contains a custom Workflow for Alfred which provides the function to instantly search in the Magento 2 DevDocs

Introduction Add the custom search to your Alfred Workflow and have a quicker access to the Magento 2 DevDocs. Installation Just download the alfredwo

David Lambauer 10 Jun 29, 2022
search non profitable charity or organization through api search

Non Profile Charity Search Search non profitable organization or get the details of an organization Installation Require the package using composer: c

Touhidur Rahman 5 Jan 20, 2022
Silverstripe-tinytidy - Control which styles are available in TinyMCE's style dropdown menu and what elements they can be applied to

TinyTidy for SilverStripe This module mainly serves as an example of how to customise the 'styles' dropdown menu in the TinyMCE editor to control whic

Jono Menz 30 Jul 30, 2020
Simple Symfony API-Platform Template which you can use to start to develop with symfony and api-platform

symfony-api-platform-skeleton Simple Template for Symfony API You can fork it and change the git remote to your Repo git remote set-url <your-git-remo

null 1 Jan 23, 2022
AnsibleBoy aims to use the Asnible `facts` as data, which can then be visualized in a table format

AnsibleBoy - Ansible Frontend Hub About AnsibleBoy aims to use the Ansible facts as data, which can then be visualized as a table ToDo (note that this

Ron 23 Jul 14, 2022
CrateKeyShopGUI Pocketmine-MP plugin which can be set in Config.yml file

CrateKeyShopGUI CrateKeyShopGUI Pocketmine-MP plugin which can be set in Config.yml file Depend FormAPI EconomyAPI PiggyCrate InvCrashFix Download Dow

null 4 Jan 7, 2022
This library implements a fuzzer for PHP, which can be used to find bugs in libraries

PHP Fuzzer This library implements a fuzzer for PHP, which can be used to find bugs in libraries (particularly parsing libraries) by feeding them "ran

Nikita Popov 341 Dec 25, 2022