🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Overview



crawlerdetect.io

GitHub Workflow Status

About CrawlerDetect

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers.

Installation

composer require jaybizzle/crawler-detect

Usage

use Jaybizzle\CrawlerDetect\CrawlerDetect;

$CrawlerDetect = new CrawlerDetect;

// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
    // true if crawler user agent detected
}

// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
    // true if crawler user agent detected
}

// Output the name of the bot that matched (if any)
echo $CrawlerDetect->getMatches();

Contributing

If you find a bot/spider/crawler user agent that CrawlerDetect fails to detect, please submit a pull request with the regex pattern added to the $data array in Fixtures/Crawlers.php and add the failing user agent to tests/crawlers.txt.

Failing that, just create an issue with the user agent you have found, and we'll take it from there :)

Laravel Package

If you would like to use this with Laravel, please see Laravel-Crawler-Detect

Symfony Bundle

To use this library with Symfony 2/3/4, check out the CrawlerDetectBundle.

YII2 Extension

To use this library with the YII2 framework, check out yii2-crawler-detect.

ES6 Library

To use this library with NodeJS or any ES6 application based, check out es6-crawler-detect.

Python Library

To use this library in a Python project, check out crawlerdetect.

.NET Library

To use this library in a .net standard (including .net core) based project, check out NetCrawlerDetect.

Ruby Gem

To use this library with Ruby on Rails or any Ruby-based application, check out crawler_detect gem.

Go Module

To use this library with Go, check out the crawlerdetect module.

Parts of this class are based on the brilliant MobileDetect

Analytics

Comments
  • Facebook Ads Robot

    Facebook Ads Robot

    Hi friends,

    First, sorry for not following the right process to add a new crawler. I'm a little newbie in this world.

    I just found that facebook is using a new structure to crawle ads links.

    Look the user-agents that I got here:

    Mozilla\/5.0 (Linux; Android 6.0.1; SM-J500M Build\/MMB29M; wv) AppleWebKit\/537.36 (KHTML, like Gecko) Version\/4.0 Chrome\/62.0.3202.84 Mobile Safari\/537.36 [FB_IAB\/FB4A;FBAV\/152.0.0.42.136;]
    
    Mozilla\/5.0 (Linux; Android 6.0.1; SM-J700M Build\/MMB29K; wv) AppleWebKit\/537.36 (KHTML, like Gecko) Version\/4.0 Chrome\/62.0.3202.84 Mobile Safari\/537.36 [FB_IAB\/FB4A;FBAV\/152.0.0.42.136;]
    
    Mozilla\/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit\/603.3.8 (KHTML, like Gecko) Mobile\/14G60 [FBAN\/FBIOS;FBAV\/150.0.0.32.132;FBBV\/80278251;FBDV\/iPhone6,2;FBMD\/iPhone;FBSN\/iOS;FBSV\/10.3.3;FBSS\/2;FBCR\/O2;FBID\/phone;FBLC\/en_GB;FBOP\/5;FBRV\/0]
    
    Mozilla\/5.0 (iPhone; CPU iPhone OS 11_1_2 like Mac OS X) AppleWebKit\/604.3.5 (KHTML, like Gecko) Mobile\/15B202 [FBAN\/FBIOS;FBAV\/151.0.0.61.202;FBBV\/82156572;FBDV\/iPhone6,2;FBMD\/iPhone;FBSN\/iOS;FBSV\/11.1.2;FBSS\/2;FBCR\/TIM;FBID\/phone;FBLC\/pt_BR;FBOP\/5;FBRV\/83160404]
    

    I just attached a XLS file with all user-agents I got here.

    facebook-ads-crawlers.xlsx

    Can someone help me to add it to the crawler list? =)

    opened by andreladocruz 15
  • Googlebot is not detected

    Googlebot is not detected

    Hi,

    Still parsing my logs, hits by Google googlebot are not detected. The trick is that it uses a valid User-agent - Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/28.0.1500.71 Safari\/537.36 - but you should read from HTTP From header, whose value is googlebot(at)googlebot.com. It is a big modification, but is it considered?

    opened by romaricdrigon 13
  • FEATURE Injectable components to support custom detection

    FEATURE Injectable components to support custom detection

    This allows, for instance, a custom app to whitelist / blacklist custom agents.

    For example, this is usercode from an app which needs to manually whitelist wechat.

        /**
         * Check if a request is a crowler, with a wechat exemption
         *
         * @param HTTPRequest $request
         * @return bool
         */
        protected function checkIsCrawler(HTTPRequest $request): bool
        {
            // Add wechat exemption to crawler detection
            $exclusions = new Exclusions();
            $exclusionList = array_merge(
                $exclusions->getAll(),
                ['MicroMessenger\/']
            );
            $exclusions->setAll($exclusionList);
    
            // Set custom exclusions to new detector
            $detect = new CrawlerDetect();
            $detect->setExclusions($exclusions);
    
            // Detect bots
            $userAgent = $request->getHeader('User-Agent');
            return $detect->isCrawler($userAgent);
        }
    

    Setters return $this to support chaining. E.g. $crawler->setUaHttpHeaders($headers)->setHeaders($appHeaders)

    opened by tractorcow 10
  • Googlebot has been detected.

    Googlebot has been detected.

    Hi Guys,

    Thanks for the amazing script , it works amazing for us .

    But today we can see that Googlebot with IP 66.249.73.157 has been detected and blocked! When looking the IP up , it seems to be the real one for Googlebot .

    Any way to exclude the well known crawlers :) ? Thanks again

    opened by Radwan-10 8
  • OPPO A33 Build/LMY47V

    OPPO A33 Build/LMY47V

    Hi!

    This seems to be a bot, see also https://www.johnlarge.co.uk/blocking-aggressive-chinese-crawlers-scrapers-bots/

    Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3
    Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0
    Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36
    
    opened by AAKempf 7
  • The ability to extend crawlers list?

    The ability to extend crawlers list?

    Hi, really appreciate your work, guys, the package is great.

    But why is there no way to mannualy extend the default crawlers list? For example i need to add "Sendsay.Ru/1.0; https://Sendsay.Ru/; [email protected]" (it's a niche Russian crawler), but i have no options but to write an issue here. It would be really great if we could have some kind of a configuration. Or at least have Crawlers injected in CrawlerDetect class via constructor so we could extend it and inject our custom Crawlers implementation without rewriting the whole constructor.

    Thanks)

    opened by godstanis 7
  • Remove single ; from exclusions list

    Remove single ; from exclusions list

    Because of issue #427 I think we shouldn't remove single characters in the exclusions list.

    The exclusions were partly to improve performance. We see much greater returns (by using these exclusions) when running over multiple user agents, so on a single request this should make very little to no difference.

    opened by MaxGiting 6
  • MicroMessenger bot

    MicroMessenger bot

    UserAgent

    Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN
    
    opened by MikeVL 6
  • Problems with twitter and facebook bots

    Problems with twitter and facebook bots

    Hello,

    We use crawler-detect to detect social networks bots and we've noticed that some bot user agents passed the tests. There they are : Twitter :

    • Mozilla/5.0 (compatible; TrendsmapResolver/0.1)
    • Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
    • Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)
    • Mozilla/5.0
    • Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

    Facebook :

    • Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.0 Mobile/14G60 Safari/602.1
    • Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1
    • Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1
    • Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1
    • Mozilla/5.0 (iPhone; CPU iPhone OS 12_0_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1

    I've put them in the tests/crawlers.txt but we cannot differenciate these user agents from ordinary user agents (except for the first user agent) so I just add the TrendsmapResolver to the Fixtures/Crawlers.php.

    Could you please let me know how to recognize that tey're bots ?

    Yours sincerely,

    Mathilde

    opened by himiro 6
  • Incorrect agent result

    Incorrect agent result

    Mozilla/5.0 (Linux; Android 6.0.1; LG-K100 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/50.0.2661.86 Mobile Safari/537.36 YandexSearch/5.75

    opened by gDemous 6
  • Remove 'okhttp' user agent as it is being used by home-assistant android app

    Remove 'okhttp' user agent as it is being used by home-assistant android app

    https://github.com/JayBizzle/Crawler-Detect/blob/master/raw/Crawlers.txt#L800

    bunkerized-nginx uses these lists, and it blocks access to home-assistant when trying to log in as:

    2021/02/16 12:55:07 [warn] 24951#24951: *35 [lua] access_by_lua(main-lua.conf:166):80: [BLOCK] User-Agent okhttp/4.9.0 is blacklisted, client: 10.0.2.100, server: www.example.com, request: "POST /auth/token HTTP/2.0", host: "www.example.com"
    

    Additional information

    https://github.com/home-assistant/android/search?q=okhttp&type=code

    opened by e-minguez 5
  • The ability to extend crawlers list for monitoring-like systems?

    The ability to extend crawlers list for monitoring-like systems?

    Hello guys.

    First of all, thank you for your package. It is really helpful for projects I've been doing.

    Literally, I'm referencing existing issue #309 but from a little different point of view. My goal is not to simply add a few exotic bots to the list. I aim to exclude the custom-configured monitoring systems that we use like Zabbix, Munin, etc. They perform health checks using HTTP requests with specific UA suffix, which we are configuring by ourselves. It can be like, ${PROJECT_NAME}Monitoring, ${DOMAIN}Robot, ${NODE_NAME}Zabbix, etc.

    And at this point, we're stuck. Cause of:

    • the code of this package doesn't imply extensibility of the crawlers list
    • adding our-specific UA suffixes to the crawlers list will not make any benefit to the community

    So my question: is any chance that you'll review your position about code extensibility or maybe you have any workaround?

    As for now, I see the one and a little dumb way to get the goal (excluding project fork) is to make custom classes extend from Crawlers and CrawlerDetect, so CustomCrawlers will contain extra lines for custom monitoring systems and CustomCrawlerDetect will use it.

    opened by s-chizhik 3
  • Drop support for EOL PHP versions

    Drop support for EOL PHP versions

    This is not a firm decision, just exploring the possibility only supporting officially supported PHP versions - https://www.php.net/supported-versions.php

    opened by JayBizzle 2
  • Crawler from urlscan.io is undetected

    Crawler from urlscan.io is undetected

    Hi,

    I've install Crawler-Detect, it work fine but urlscan.io can crawl. For now, I still have not succeeded in obtaining their fingerprints but I prefer to warn, maybe you will.

    Cordially

    help wanted 
    opened by callsecfrance 0
  • Bots with love from Russia

    Bots with love from Russia

    Lines extracts from a few websites in Russia with high traffic. Each suggested line checked with «User agent is NOT a bot» result.

    Sample sources.

    @JayBizzle please, approve list and after your «OK» i'll send PR.

    Confirmed

    • [x] HTTP Tester/1.0 https://apps.apple.com/us/app/httpbot/id1232603544

    • [x] SimonWebHelper/1.0 CFNetwork/1179.0.1 Darwin/20.0.0 https://www.dejal.com/simon

    • [x] saelmon https://selectel.ru/services/additional/monitoring

    • [x] Zadarma API https://zadarma.com/ru/support/api

    • [x] Mozilla/5.0 (compatible; Artax) https://github.com/amphp/artax

    • [x] iOS/14.0 (18A5319i) dataaccessd/1.0 «...process responsible for background syncing of Exchange, iCloud, CalDAV, and other calendar data...»

    • [x] Atlassian Webhook HTTP Client Manual verified with private Atlassian JIRA instance. I can't find any public mention in docs, blog, etc.

    • [x] macOS/11.0 (20A5323l) CalendarAgent/949

    • [x] Mac+OS+X/10.15.6 (19G46c) CalendarAgent/930.5.1

    • [x] TLS tester from https://testssl.sh/dev/

    • [x] Orpho.Ru - v. 1.5.5

    • [x] Corax - [email protected]

    • [x] Mozilla/5.0 (compatible; Domains Project/1.1.0; +https://domainsproject.org)

    Suspicious

    • [ ] AppleCoreMedia/1.0.0.15E216 (iPhone; U; CPU OS 11_3 like Mac OS X; ru_ru)

    • [ ] U7IA53O8WRQLP5HTN36O 520895322210219123 178783837217128874 128152866426197781 170646164760074160 897891597049094717 10798953746823634 865980663587406199 83960315279034295 50879782656773070 425389085694464584 520832197366222982 566888253833316094 497618096526395537 425659151411788951 494971665505918016 870530337619880183 349139421128410017 627636913921908195 757052309832051516 923961866520554027 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

    • [ ] P5IFYN4 466155777352526040 187582494976457628 205329424338575162 290476900380796124 777992341037289367 321980071372304533 644559670148391318 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

    • [ ] Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; MRSPUTNIK 2, 4, 0, 386; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; BRI/2; AskTbFXTV5/5.14.1.20007)

    • [ ] Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Pivim Multibar; MRSPUTNIK 2, 4, 0, 171; MRA 5.7 (build 03796))

    opened by antonydevanchi 3
Releases(v1.2.112)
Owner
Mark Beech
Father of two, Airplane Geek, Beer Lover, Yorkshire to the core, Rugby League fan, Tech Nerd!
Mark Beech
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Daniel Reis 46 Sep 21, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

João Eduardo Fornazari 1 Nov 26, 2021
This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

ǃшɒʞɒH ǃǀɄ 1 Dec 22, 2021
Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

Kidd Yu 1.2k Dec 19, 2022
Symfony bundle for Roach PHP

roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

Pisarev Alexey 7 Sep 28, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
PHP library to Scrape website into entity easily

Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

null 5 Dec 18, 2021
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
Mobile_Detect is a lightweight PHP class for detecting mobile devices (including tablets). It uses the User-Agent string combined with specific HTTP headers to detect the mobile environment.

Motto: "Every business should have a detection script to detect mobile readers." About Mobile Detect is a lightweight PHP class for detecting mobile d

Şerban Ghiţă 10.2k Jan 4, 2023
Helps detect the user's browser and platform at the PHP level via the user agent

cbschuld/browser.php Helps detect the user's browser and platform at the PHP level via the user agent Installation You can add this library as a local

Chris Schuld 574 Dec 16, 2022
PHP class for parsing user agent strings (HTTP_USER_AGENT).

PHP class for parsing user agent strings (HTTP_USER_AGENT). Includes mobile checks, bots and banned bots checks, browser types/versions and more. Based on browscap (via phpbrowscap), Mobile_Detect and ua-parser. Created for high traffic websites and fast batch processing.

Mikolaj Misiurewicz 44 Jul 26, 2022
👮 A PHP desktop/mobile user agent parser with support for Laravel, based on Mobiledetect

Agent A PHP desktop/mobile user agent parser with support for Laravel, based on Mobile Detect with desktop support and additional functionality. Insta

Jens Segers 4.2k Jan 5, 2023
Lightning Fast, Minimalist PHP User Agent String Parser.

Lightning Fast, Minimalist PHP User Agent String Parser.

Jesse Donat 523 Dec 21, 2022
The Universal Device Detection library will parse any User Agent and detect the browser, operating system, device used (desktop, tablet, mobile, tv, cars, console, etc.), brand and model.

DeviceDetector Code Status Description The Universal Device Detection library that parses User Agents and detects devices (desktop, tablet, mobile, tv

Matomo Analytics 2.4k Jan 5, 2023