[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

Overview

DEPRECATED

Consider to use https://github.com/jeremykendall/php-domain-parser as maintained alternative.

TLDExtract

TLDExtract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL, e.g. domain parser. For example, say you want just the 'google' part of 'http://www.google.com'.

Latest Version on Packagist Software License Build Status Coverage Status Total Downloads


Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

TLDExtract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

$result = tld_extract('http://forums.news.cnn.com/');
var_dump($result);

object(LayerShifter\TLDExtract\Result)#34 (3) {
  ["subdomain":"LayerShifter\TLDExtract\Result":private]=>
  string(11) "forums.news"
  ["hostname":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "cnn"
  ["suffix":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "com"
}

Result implements ArrayAccess interface, so you simple can access to its result.

var_dump($result['subdomain']);
string(11) "forums.news"
var_dump($result['hostname']);
string(3) "cnn"
var_dump($result['suffix']);
string(3) "com"

Also you can simply convert result to JSON.

var_dump($result->toJson());
string(54) "{"subdomain":"forums.news","hostname":"cnn","suffix":"com"}"

This package is compliant with PSR-1, PSR-2, PSR-4. If you notice compliance oversights, please send a patch via pull request.

Does TLDExtract make requests to Public Suffix List website?

No. TLDExtract uses database from TLDDatabase that generated from Public Suffix List and updated regularly. It does not make any HTTP requests to parse or validate a domain.

Requirements

The following versions of PHP are supported.

  • PHP 5.5
  • PHP 5.6
  • PHP 7.0
  • PHP 7.1
  • PHP 7.2
  • PHP 7.3
  • HHVM

Install

Via Composer

$ composer require layershifter/tld-extract

Additional result methods

Class LayerShifter\TLDExtract\Result has some usable methods:

$extract = new LayerShifter\TLDExtract\Extract();

# For domain 'shop.github.com'

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false

# For IP '192.168.0.1'

$result = $extract->parse('192.168.0.1');
$result->getFullHost(); // will return (string) '192.168.0.1'
$result->getRegistrableDomain(); // will return null
$result->isValidDomain(); // will return (bool) false
$result->isIp(); // will return (bool) true

Custom database

By default package is using database from TLDDatabase package, but you can override this behaviour simply:

new LayerShifter\TLDExtract\Extract(__DIR__ . '/cache/mydatabase.php');

For more details and how keep database updated TLDDatabase.

Implement own result

By default after parse you will receive object of LayerShifter\TLDExtract\Result class, but sometime you need own methods or additional functionality.

You can create own class that implements LayerShifter\TLDExtract\ResultInterface and use it as parse result.

class CustomResult implements LayerShifter\TLDExtract\ResultInterface {}

new LayerShifter\TLDExtract\Extract(null, CustomResult::class);

Parsing modes

Package has three modes of parsing:

  • allow ICANN suffixes (domains are those delegated by ICANN or part of the IANA root zone database);
  • allow private domains (domains are amendments submitted to Public Suffix List by the domain holder, as an expression of how they operate their domain security policy);
  • allow custom (domains that are not in list, but can be usable, for example: example, mycompany, etc).

For keeping compatibility with Public Suffix List ideas package runs in all these modes by default, but you can easily change this behavior:

use LayerShifter\TLDExtract\Extract;

new Extract(null, null, Extract::MODE_ALLOW_ICANN);
new Extract(null, null, Extract::MODE_ALLOW_PRIVATE);
new Extract(null, null, Extract::MODE_ALLOW_NOT_EXISTING_SUFFIXES);
new Extract(null, null, Extract::MODE_ALLOW_ICANN | Extract::MODE_ALLOW_PRIVATE);

Change log

Please see CHANGELOG for more information what has changed recently.

Testing

$ composer test

Contributing

Please see CONTRIBUTING and CONDUCT for details.

License

This library is released under the Apache 2.0 license. Please see License File for more information.

Comments
  • underscore in hostname

    underscore in hostname

    Hi,

    i've found a weird behavior in domain extraction;

    $extract = new Extract();
    $result = $extract->parse('dkim._domainkey.phea.fr');
    print_r($result->toArray());
    
    $result = $extract->parse('dkim.domainkey.phea.fr');
    print_r($result->toArray());
    

    result

    Array
    (
        [subdomain] => dkim._domainkey.phea
        [hostname] => fr
        [suffix] => 
    )
    Array
    (
        [subdomain] => dkim.domainkey
        [hostname] => phea
        [suffix] => fr
    )
    

    the problem come from _ character.

    this regex fix the problem

        const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';
    
    bug 
    opened by Erwane 5
  • Get rid of the intl extension requirement

    Get rid of the intl extension requirement

    I use this library inside a symfony project where intl is not required because symfony give a polyfill for it.

    Currently its not possible to use this library without intl installed (composer will error). I would recommend to use the symfony intl polyfill https://github.com/symfony/polyfill-intl-icu instead of require ext-intl.

    "symfony/polyfill-intl-icu": "^1.0",
    
    opened by alexander-schranz 4
  • Cache directory is not writable

    Cache directory is not writable

    Thanks for this extension.

    I get this error: Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file

    So, I should change permission cache directory in vendor that this is not good job.

    I suggest modify code for check permission and writable with is_writable() and if don't have permission, try to change with chmod(). or more better solution is auto created cache directory with mkdir('cache', 0777); command.

    enhancement 
    opened by NabiKAZ 4
  • Uncaught OutOfRangeException

    Uncaught OutOfRangeException

    Hi, I'm getting the error:

    Stack trace:
    Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
    File: xxx/vendor/layershifter/tld-extract/src/Result.php
    Line: 220
    
    thrown
    #0 LayerShifter\TLDExtract\Result->__get('errors')
    

    My code is straight from the example:

    $extract = new LayerShifter\TLDExtract\Extract();
    $result = $extract->parse('shop.github.com');
    $result->getRegistrableDomain();
    

    Composer install details:

    Using version ^2.0 for layershifter/tld-extract
    ./composer.json has been updated
    Loading composer repositories with package information
    Updating dependencies (including require-dev)
    Package operations: 5 installs, 0 updates, 0 removals
      - Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)         
      - Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)         
      - Installing layershifter/tld-support (1.1.1): Downloading (100%)         
      - Installing layershifter/tld-database (1.0.65): Downloading (100%)         
      - Installing layershifter/tld-extract (2.0.1): Downloading (100%)   
    

    Am I missing something or doing something wrong?

    Many thanks,

    jkns

    opened by jkns 3
  • Should getSubdomains return an empty array if no subdomains found?

    Should getSubdomains return an empty array if no subdomains found?

    As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains() function, I assumed it would return an empty array if none were found.

    Turns out it returns a null, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:

    $subdomainLabels = $hostnameExtract->getSubdomains() ?? array();
    

    So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.

    enhancement 
    opened by Rohaq 3
  • Extract registerable domain returns empty

    Extract registerable domain returns empty

    I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :

    mail-sor-f69.google.com (you would expect google.com) static.vnpt.vn (you would expect vnpt.vn) mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)

    and there are many , many, many more.

    opened by ghost 3
  • IPv6 addresses are not recognized

    IPv6 addresses are not recognized

    It seems that IPv6 addresses are not recognized properly:

    >>> $w = new LayerShifter\TLDExtract\Extract()
    => LayerShifter\TLDExtract\Extract {#197}
    >>> $res = $w->parse('2bf2:eaa0::7:314:5474')
    => LayerShifter\TLDExtract\Result {#205}
    >>> $res->isIp()
    => false
    >>> $res->isValidDomain()
    => false
    
    question 
    opened by supriyo-biswas 3
  • Doesn't correctly handle URLs with a

    Doesn't correctly handle URLs with a "?" just after the domain

    For "http://example.com?foo=bar", getRegistrableDomain() returns "example.com?foo=bar" rather than "example.com".

    My temporary quick fix in my code is to replace all "?" characters with "/" before passing the URL to TLDExtract.

    bug 
    opened by orrd 3
  • Make Extract::suffixExists protected

    Make Extract::suffixExists protected

    Currently the suffixExists method on the Extract class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source

    opened by Gman98ish 2
  • test..com is valid domain

    test..com is valid domain

    $result = $extract->parse('test..com');
    $result->isValidDomain();  // gives true
    

    Registrable Domain ist .com in this case... I think this domain name should be invalid.

    bug 
    opened by buzanits 2
  • Error with .co.il domain

    Error with .co.il domain

    Hi,

    My code is as follows: $line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();

    I get www.upfile.co.il instead of upfile.co.il.

    Any idea what is going wrong?

    invalid 
    opened by ghost 2
  • TLDExtract not properly parsing hostname

    TLDExtract not properly parsing hostname

    I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

    The URL is called blogspot.com

    $url = 'blogspot.com';
    $domain = tld_extract($url);
    var_dump($domain);
    
    Returns: 
    object(LayerShifter\TLDExtract\Result)[9]
      private 'subdomain' => null
      private 'hostname' => string 'blogspot.com' (length=12)
      private 'suffix' => null
    

    Weirdly the URL 'flogspot.com' works fine and returns:

    object(LayerShifter\TLDExtract\Result)[9]
      private 'subdomain' => null
      private 'hostname' => string 'flogspot' (length=8)
      private 'suffix' => string 'com' (length=3)
    

    The URL logspot.com also works and returns:

    object(LayerShifter\TLDExtract\Result)[9]
      private 'subdomain' => null
      private 'hostname' => string 'logspot' (length=7)
      private 'suffix' => string 'com' (length=3)
    

    Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

    opened by leem32 1
  • RFC

    RFC

    Some reading for improve or fix in case it’s needed :

    https://en.wikipedia.org/wiki/Domain_name introduce https://tools.ietf.org/html/rfc1034 https://tools.ietf.org/html/rfc1035

    domain name are URL https://en.wikipedia.org/wiki/Url introduce https://url.spec.whatwg.org/

    should permit to close #45

    opened by HumanG33k 0
  • "@" symbol should not be allowed in domain name

    $extract = new Extract();
    $result = $extract->parse("test@[email protected]")->isValidDomain();
    

    This results in true, when it should be false ("@" is not allowed in domain names)

    opened by skeets23 0
  • Parser bug when subdomain has

    Parser bug when subdomain has "-"

    The parser fails for the following:

    // If the subdomain has "-"
    $url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';
    
    // Extract domain parts
    $extract = new \LayerShifter\TLDExtract\Extract();
    $domainParser = $extract->parse($url);
    
    parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
    $domainParser->getSubdomain(); // null 
    
    opened by NizarBlond 1
  •  TLD for domain test.ru is not recognized

    TLD for domain test.ru is not recognized

    I have tested with hundred of domains, but for the domain test.ru i get the following result:

    Result {#1711 ▼
      -subdomain: null
      -hostname: "test.ru"
      -suffix: null
    }
    
    bug 
    opened by LinkingYou 2
  • Incorrect Parsing of Malicious Payload

    Incorrect Parsing of Malicious Payload

    Hello There is a bug in the libary where it mishandles the backslash character

    For example https://1337.karimrahal.com.bla.com would give bla.com as the registered domain

    where as the browser would lead to karimrahal.com domain

    bug 
    opened by ohm-s 0
Releases(2.0.1)
Owner
Oleksandr Fediashov
Oleksandr Fediashov
Enables the possibility generating sanitized URL parts from persisted patterns.

#Persisted sanitized pattern mapping What does it do? Enables the possibility generating sanitized URL parts from persisted patterns. How does it work

Markus Hofmann 1 Apr 7, 2022
A PHP-based self-hosted URL shortener that can be used to serve shortened URLs under your own custom domain.

A PHP-based self-hosted URL shortener that can be used to serve shortened URLs under your own custom domain. Table of Contents Full documentation Dock

null 1.7k Dec 29, 2022
Laravel URL Localization Manager - [ccTLD, sub-domain, sub-directory].

Laravel URL Localization - (ccTLD, sub-domain, sub-directory). with Simple & Easy Helpers. Afrikaans Akan shqip አማርኛ العربية հայերեն অসমীয়া azərbayca

Pharaonic 2 Aug 7, 2022
Purl is a simple Object Oriented URL manipulation library for PHP 7.2+

Purl Purl is a simple Object Oriented URL manipulation library for PHP 7.2+ Installation The suggested installation method is via composer: composer r

Jonathan H. Wage 908 Dec 21, 2022
URI manipulation Library

URI The Uri package provides simple and intuitive classes to manage URIs in PHP. You will be able to parse, build and resolve URIs create URIs from di

The League of Extraordinary Packages 886 Jan 6, 2023
A simple PHP library to parse and manipulate URLs

Url is a simple library to ease creating and managing Urls in PHP.

The League of Extraordinary Packages 351 Dec 30, 2022
Public Suffix List based domain parsing implemented in PHP

PHP Domain Parser PHP Domain Parser is a resource based domain parser implemented in PHP. Motivation While there are plenty of excellent URL parsers a

Jeremy Kendall 1k Jan 1, 2023
A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output as (optimized) CSS

PHP CSS Parser A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output

Raphael Schweikert 1.6k Jan 5, 2023
Fact Extraction and VERification Over Unstructured and Structured information

Repository for Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), used for the FEVER Workshop Shared Task at EMNLP2021.

Rami 49 Dec 9, 2022
A full-featured Webpack + vue-loader setup with hot reload, linting, testing & css extraction.

#Vue-Cli Template for Larvel + Webpack + Hotreload (HMR) I had a really tough time getting my workflow rocking between Laravel and VueJS projects. I f

Gary Williams 73 Nov 29, 2022
A list of documentation and example code to access the University of Florida's public (undocumented) API

uf_api A list of documentation and example code to access the University of Florida's public (undocumented) API Courses Gym Common Data (admissions an

Rob Olsthoorn 49 Oct 6, 2022
Enables the possibility generating sanitized URL parts from persisted patterns.

#Persisted sanitized pattern mapping What does it do? Enables the possibility generating sanitized URL parts from persisted patterns. How does it work

Markus Hofmann 1 Apr 7, 2022
Rah memcached - Store parts of Textpattern CMS templates in Memcached

rah_memcached Packagist | Issues | Donate A plugin for Textpattern CMS that stores parts of your templates in Memcached, a distributed in-memory key-v

Jukka Svahn 2 Aug 12, 2022
Opulence is a PHP web application framework that simplifies the difficult parts of creating and maintaining a secure, scalable website.

Opulence Introduction Opulence is a PHP web application framework that simplifies the difficult parts of creating and maintaining a secure, scalable w

Opulence 733 Sep 8, 2022
Blade Snip allows you to use parts of a blade template multiple times. Basically partials, but inline.

Blade Snip Blade Snip allows you to use parts of a blade template multiple times. Basically partials, but inline: <div class="products"> @snip('pr

Jack Sleight 18 Dec 4, 2022
My aim is to make a complete website that should have all the essential parts a website should have.

Gaming-Ninja I aim to make a complete website that should have all the essential parts a website should have. https://gamingninja-3399.000webhostapp.c

Anand Jaiswar 3 Nov 23, 2022
DBML parser for PHP8. It's a PHP parser for DBML syntax.

DBML parser written on PHP8 DBML (database markup language) is a simple, readable DSL language designed to define database structures. This page outli

Pavel Buchnev 32 Dec 29, 2022
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍

HtmlParser php html解析工具,类似与PHP Simple HTML DOM Parser。 由于基于php模块dom,所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。 注意:html代码必须是utf-8编码字符,如果不是请转成utf-8

俊杰jerry 522 Dec 29, 2022
The Hoa\String library (deprecated by Hoa\Ustring).

Hoa is a modular, extensible and structured set of PHP libraries. Moreover, Hoa aims at being a bridge between industrial and research worlds. Hoa\Str

Hoa 6 Mar 27, 2018
Image optimization / compression library. This library is able to optimize png, jpg and gif files in very easy and handy way. It uses optipng, pngquant, pngcrush, pngout, gifsicle, jpegoptim and jpegtran tools.

Image Optimizer This library is handy and very easy to use optimizer for image files. It uses optipng, pngquant, jpegoptim, svgo and few more librarie

Piotr Śliwa 879 Dec 30, 2022