[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

Oleksandr Fediashov

Last update: Oct 18, 2022

Related tags

URL php php-library tldextract tld subdomain domain-parser public-suffix-list

Overview

DEPRECATED

Consider to use https://github.com/jeremykendall/php-domain-parser as maintained alternative.

TLDExtract

TLDExtract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL, e.g. domain parser. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

TLDExtract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

$result = tld_extract('http://forums.news.cnn.com/');
var_dump($result);

object(LayerShifter\TLDExtract\Result)#34 (3) {
  ["subdomain":"LayerShifter\TLDExtract\Result":private]=>
  string(11) "forums.news"
  ["hostname":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "cnn"
  ["suffix":"LayerShifter\TLDExtract\Result":private]=>
  string(3) "com"
}

Result implements ArrayAccess interface, so you simple can access to its result.

var_dump($result['subdomain']);
string(11) "forums.news"
var_dump($result['hostname']);
string(3) "cnn"
var_dump($result['suffix']);
string(3) "com"

Also you can simply convert result to JSON.

var_dump($result->toJson());
string(54) "{"subdomain":"forums.news","hostname":"cnn","suffix":"com"}"

This package is compliant with PSR-1, PSR-2, PSR-4. If you notice compliance oversights, please send a patch via pull request.

Does TLDExtract make requests to Public Suffix List website?

No. TLDExtract uses database from TLDDatabase that generated from Public Suffix List and updated regularly. It does not make any HTTP requests to parse or validate a domain.

Requirements

The following versions of PHP are supported.

PHP 5.5
PHP 5.6
PHP 7.0
PHP 7.1
PHP 7.2
PHP 7.3
HHVM

Install

Via Composer

$ composer require layershifter/tld-extract

Additional result methods

Class LayerShifter\TLDExtract\Result has some usable methods:

$extract = new LayerShifter\TLDExtract\Extract();

# For domain 'shop.github.com'

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false

# For IP '192.168.0.1'

$result = $extract->parse('192.168.0.1');
$result->getFullHost(); // will return (string) '192.168.0.1'
$result->getRegistrableDomain(); // will return null
$result->isValidDomain(); // will return (bool) false
$result->isIp(); // will return (bool) true

Custom database

By default package is using database from TLDDatabase package, but you can override this behaviour simply:

new LayerShifter\TLDExtract\Extract(__DIR__ . '/cache/mydatabase.php');

For more details and how keep database updated TLDDatabase.

Implement own result

By default after parse you will receive object of LayerShifter\TLDExtract\Result class, but sometime you need own methods or additional functionality.

You can create own class that implements LayerShifter\TLDExtract\ResultInterface and use it as parse result.

class CustomResult implements LayerShifter\TLDExtract\ResultInterface {}

new LayerShifter\TLDExtract\Extract(null, CustomResult::class);

Parsing modes

Package has three modes of parsing:

allow ICANN suffixes (domains are those delegated by ICANN or part of the IANA root zone database);
allow private domains (domains are amendments submitted to Public Suffix List by the domain holder, as an expression of how they operate their domain security policy);
allow custom (domains that are not in list, but can be usable, for example: example, mycompany, etc).

For keeping compatibility with Public Suffix List ideas package runs in all these modes by default, but you can easily change this behavior:

use LayerShifter\TLDExtract\Extract;

new Extract(null, null, Extract::MODE_ALLOW_ICANN);
new Extract(null, null, Extract::MODE_ALLOW_PRIVATE);
new Extract(null, null, Extract::MODE_ALLOW_NOT_EXISTING_SUFFIXES);
new Extract(null, null, Extract::MODE_ALLOW_ICANN | Extract::MODE_ALLOW_PRIVATE);

Change log

Please see CHANGELOG for more information what has changed recently.

Testing

$ composer test

Contributing

Please see CONTRIBUTING and CONDUCT for details.

License

This library is released under the Apache 2.0 license. Please see License File for more information.

Comments

underscore in hostname

Hi,

i've found a weird behavior in domain extraction;

$extract = new Extract();
$result = $extract->parse('dkim._domainkey.phea.fr');
print_r($result->toArray());

$result = $extract->parse('dkim.domainkey.phea.fr');
print_r($result->toArray());

result

Array
(
    [subdomain] => dkim._domainkey.phea
    [hostname] => fr
    [suffix] => 
)
Array
(
    [subdomain] => dkim.domainkey
    [hostname] => phea
    [suffix] => fr
)

the problem come from _ character.

this regex fix the problem

    const HOSTNAME_PATTERN = '#^((?!-)[a-z0-9_-]{0,62}[a-z0-9]\.)+[a-z]{2,63}|[xn\-\-a-z0-9]]{6,63}$#';

bug

opened by Erwane 5

Get rid of the intl extension requirement
I use this library inside a symfony project where intl is not required because symfony give a polyfill for it.

Currently its not possible to use this library without intl installed (composer will error). I would recommend to use the symfony intl polyfill https://github.com/symfony/polyfill-intl-icu instead of require ext-intl.

"symfony/polyfill-intl-icu": "^1.0",
opened by alexander-schranz 4
Cache directory is not writable

Thanks for this extension.

I get this error: Cannot put TLD list to cache file /xxxxx/vendor/layershifter/tld-extract/src/../cache/.tld_set, check writes rights on directory of file

So, I should change permission cache directory in vendor that this is not good job.

I suggest modify code for check permission and writable with is_writable() and if don't have permission, try to change with chmod(). or more better solution is auto created cache directory with mkdir('cache', 0777); command.
enhancement

opened by NabiKAZ 4

Uncaught OutOfRangeException

Hi, I'm getting the error:

Stack trace:
Message: Uncaught OutOfRangeException: Unknown field "errors" in xxx/vendor/layershifter/tld-extract/src/Result.php:220
File: xxx/vendor/layershifter/tld-extract/src/Result.php
Line: 220

thrown
#0 LayerShifter\TLDExtract\Result->__get('errors')

My code is straight from the example:

$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('shop.github.com');
$result->getRegistrableDomain();

Composer install details:

Using version ^2.0 for layershifter/tld-extract
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Package operations: 5 installs, 0 updates, 0 removals
  - Installing symfony/polyfill-php72 (v1.10.0): Downloading (100%)         
  - Installing symfony/polyfill-intl-idn (v1.10.0): Downloading (100%)         
  - Installing layershifter/tld-support (1.1.1): Downloading (100%)         
  - Installing layershifter/tld-database (1.0.65): Downloading (100%)         
  - Installing layershifter/tld-extract (2.0.1): Downloading (100%)

Am I missing something or doing something wrong?

Many thanks,

jkns

opened by jkns 3

Should getSubdomains return an empty array if no subdomains found?
As part of some validation requirements, I wanted to loop over subdomain labels, and since an array was the expected return of the getSubdomains() function, I assumed it would return an empty array if none were found.

Turns out it returns a null, meaning it broke any attempts to use it as the expected return type. I ended up working around this using a null coalescence like the following:

$subdomainLabels = $hostnameExtract->getSubdomains() ?? array();

So, should this function return an empty array by default? I'd personally prefer if it did, since it keeps the return type consistent - but others might have differing opinions.
enhancement
opened by Rohaq 3
Extract registerable domain returns empty

I have a loop that extracts registerable domains. However, despite most of them running fine, a still have a lot of empty responses. What could be the cause that it does not find a registerable domain for the following examples :

mail-sor-f69.google.com (you would expect google.com) static.vnpt.vn (you would expect vnpt.vn) mx1.sub5.homie.mail.dreamhost.com (you would expect dreamhost.com)

and there are many , many, many more.

opened by ghost 3

IPv6 addresses are not recognized

It seems that IPv6 addresses are not recognized properly:

>>> $w = new LayerShifter\TLDExtract\Extract()
=> LayerShifter\TLDExtract\Extract {#197}
>>> $res = $w->parse('2bf2:eaa0::7:314:5474')
=> LayerShifter\TLDExtract\Result {#205}
>>> $res->isIp()
=> false
>>> $res->isValidDomain()
=> false

question

opened by supriyo-biswas 3

Doesn't correctly handle URLs with a "?" just after the domain

For "http://example.com?foo=bar", getRegistrableDomain() returns "example.com?foo=bar" rather than "example.com".

My temporary quick fix in my code is to replace all "?" characters with "/" before passing the URL to TLDExtract.
bug

opened by orrd 3
Make Extract::suffixExists protected

Currently the suffixExists method on the Extract class is private, it would be useful if it was protected, so you could override default behavior and provide an alternate data source

opened by Gman98ish 2
test..com is valid domain
$result = $extract->parse('test..com'); $result->isValidDomain(); // gives true

Registrable Domain ist .com in this case... I think this domain name should be invalid.
bug
opened by buzanits 2
Error with .co.il domain

Hi,

My code is as follows: $line = 'http://www.upfile.co.il/xxxxxxxx.html'; $extract = new LayerShifter\TLDExtract\Extract(); $domain = $extract->parse($line); $domain->getRegistrableDomain();

I get www.upfile.co.il instead of upfile.co.il.

Any idea what is going wrong?
invalid

opened by ghost 2

TLDExtract not properly parsing hostname

I'm running some domain names through TLDExtract and came across a domain not being properly parsed.

The URL is called blogspot.com

$url = 'blogspot.com';
$domain = tld_extract($url);
var_dump($domain);

Returns: 
object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'blogspot.com' (length=12)
  private 'suffix' => null

Weirdly the URL 'flogspot.com' works fine and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'flogspot' (length=8)
  private 'suffix' => string 'com' (length=3)

The URL logspot.com also works and returns:

object(LayerShifter\TLDExtract\Result)[9]
  private 'subdomain' => null
  private 'hostname' => string 'logspot' (length=7)
  private 'suffix' => string 'com' (length=3)

Any idea why the TLD in 'blogspot.com' is not being added to the suffix? Is this a bug?

opened by leem32 1

RFC

Some reading for improve or fix in case it’s needed :

https://en.wikipedia.org/wiki/Domain_name introduce https://tools.ietf.org/html/rfc1034 https://tools.ietf.org/html/rfc1035

domain name are URL https://en.wikipedia.org/wiki/Url introduce https://url.spec.whatwg.org/

should permit to close #45

opened by HumanG33k 0
"@" symbol should not be allowed in domain name
$extract = new Extract(); $result = $extract->parse("test@test@google.com")->isValidDomain();

This results in true, when it should be false ("@" is not allowed in domain names)
opened by skeets23 0

Parser bug when subdomain has "-"

The parser fails for the following:

// If the subdomain has "-"
$url = 'https://s3-ap-southeast-2.amazonaws.com/blabla/blabla/wp-content/uploads/media/2019/03/16860571424_31c94205de_b.jpg';

// Extract domain parts
$extract = new \LayerShifter\TLDExtract\Extract();
$domainParser = $extract->parse($url);

parse_url($url, PHP_URL_HOST); // s3-ap-southeast-2.amazonaws.com
$domainParser->getSubdomain(); // null

opened by NizarBlond 1

TLD for domain test.ru is not recognized
I have tested with hundred of domains, but for the domain test.ru i get the following result:

Result {#1711 ▼ -subdomain: null -hostname: "test.ru" -suffix: null }
bug
opened by LinkingYou 2
Incorrect Parsing of Malicious Payload

Hello There is a bug in the libary where it mishandles the backslash character

For example https://1337.karimrahal.com.bla.com would give bla.com as the registered domain

where as the browser would lead to karimrahal.com domain
bug

opened by ohm-s 0

Releases(2.0.1)

2.0.1(Feb 11, 2019)
Docs

Update supported php versions (#40)

Fix getSubdomains function description example (#41)

Improvements

PHP 7.3 support

Use symfony/polyfill-intl-idn instead of true/punycode (#39)

Source code(tar.gz)
Source code(zip)

2.0.0(Sep 28, 2018)

Breaking changes

`getSubdomains()` now always returns array (#27, #33)

$extract = new Extract();
$result = $extract->parse('github.com');

$result->getSubdomains(); // before `null`
$result->getSubdomains(); // after `[]`

Source code(tar.gz)
Source code(zip)

1.2.7(Sep 28, 2018)
Fixes

Make Extract::suffixExists protected (#34)

Source code(tar.gz)
Source code(zip)
1.2.6(Sep 25, 2018)
Fixes

Underscore at end of hostname label causes parsing to fail (#32)

Source code(tar.gz)
Source code(zip)
1.2.5(Jun 19, 2018)
Fixes

underscore in hostname bug (#25)

Source code(tar.gz)
Source code(zip)
1.2.4(Apr 14, 2018)
Fixes

test..com is valid domain (#22)

isValidDomain returns true for domains that are too long (#23)

Source code(tar.gz)
Source code(zip)
1.2.3(Nov 18, 2017)
Impovements

PHP 7.2 support

Fixes

use INTL_IDNA_VARIANT_UTS46 for idn_* functions (#20)

Source code(tar.gz)
Source code(zip)
1.2.2(Oct 17, 2017)
Fixes

fix typo in class constant (#18)

Source code(tar.gz)
Source code(zip)
1.2.1(Apr 17, 2017)
Fixes

incorrect parsing domains with number sign (#16)

Source code(tar.gz)
Source code(zip)
1.2.0(Nov 17, 2016)
Features

removed dependency on intl extension;

PHP 7.1 support.

Source code(tar.gz)
Source code(zip)
1.1.1(Aug 3, 2016)
Fixes

Issue #5 with handling query part of URL

Source code(tar.gz)
Source code(zip)
1.1.0(Jun 29, 2016)
New release, key features:

tld_extract() function for simple usage;

setExtractionMode() method on Extract class for setting extract options.

Source code(tar.gz)
Source code(zip)
1.0.0(Jun 20, 2016)
IDN support;

Database in separate weekly updatable package;

Full test coverage.

Source code(tar.gz)
Source code(zip)
0.2.0(Jan 10, 2016)

Fixes by static analizer Result can be resurned by custom class New methods on class Result Code review
Source code(tar.gz)
Source code(zip)
0.1.3(Nov 24, 2015)

Method for manual updating of cache file
Source code(tar.gz)
Source code(zip)
0.1.2(Nov 3, 2015)

Bug fix
Source code(tar.gz)
Source code(zip)
0.1.1(Oct 26, 2015)

Adds properties to Result()
Source code(tar.gz)
Source code(zip)
0.1(Oct 24, 2015)

First release
Source code(tar.gz)
Source code(zip)

Owner

Oleksandr Fediashov

GitHub

Enables the possibility generating sanitized URL parts from persisted patterns.

#Persisted sanitized pattern mapping What does it do? Enables the possibility generating sanitized URL parts from persisted patterns. How does it work

1 Apr 7, 2022

A PHP-based self-hosted URL shortener that can be used to serve shortened URLs under your own custom domain.

A PHP-based self-hosted URL shortener that can be used to serve shortened URLs under your own custom domain. Table of Contents Full documentation Dock

1.7k Dec 29, 2022

Laravel URL Localization Manager - [ccTLD, sub-domain, sub-directory].

Laravel URL Localization - (ccTLD, sub-domain, sub-directory). with Simple & Easy Helpers. Afrikaans Akan shqip አማርኛ العربية հայերեն অসমীয়া azərbayca

2 Aug 7, 2022

Purl is a simple Object Oriented URL manipulation library for PHP 7.2+

Purl Purl is a simple Object Oriented URL manipulation library for PHP 7.2+ Installation The suggested installation method is via composer: composer r

908 Dec 21, 2022

URI manipulation Library

URI The Uri package provides simple and intuitive classes to manage URIs in PHP. You will be able to parse, build and resolve URIs create URIs from di

886 Jan 6, 2023

A simple PHP library to parse and manipulate URLs

Url is a simple library to ease creating and managing Urls in PHP.

351 Dec 30, 2022

Public Suffix List based domain parsing implemented in PHP

PHP Domain Parser PHP Domain Parser is a resource based domain parser implemented in PHP. Motivation While there are plenty of excellent URL parsers a

1k Jan 1, 2023

A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output as (optimized) CSS

PHP CSS Parser A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output

1.6k Jan 5, 2023

Fact Extraction and VERification Over Unstructured and Structured information

Repository for Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), used for the FEVER Workshop Shared Task at EMNLP2021.

49 Dec 9, 2022

A full-featured Webpack + vue-loader setup with hot reload, linting, testing & css extraction.

#Vue-Cli Template for Larvel + Webpack + Hotreload (HMR) I had a really tough time getting my workflow rocking between Laravel and VueJS projects. I f

73 Nov 29, 2022

A list of documentation and example code to access the University of Florida's public (undocumented) API

uf_api A list of documentation and example code to access the University of Florida's public (undocumented) API Courses Gym Common Data (admissions an

49 Oct 6, 2022

Enables the possibility generating sanitized URL parts from persisted patterns.

#Persisted sanitized pattern mapping What does it do? Enables the possibility generating sanitized URL parts from persisted patterns. How does it work

1 Apr 7, 2022

Rah memcached - Store parts of Textpattern CMS templates in Memcached

rah_memcached Packagist | Issues | Donate A plugin for Textpattern CMS that stores parts of your templates in Memcached, a distributed in-memory key-v

2 Aug 12, 2022

Opulence is a PHP web application framework that simplifies the difficult parts of creating and maintaining a secure, scalable website.

Opulence Introduction Opulence is a PHP web application framework that simplifies the difficult parts of creating and maintaining a secure, scalable w

733 Sep 8, 2022

Image optimization / compression library. This library is able to optimize png, jpg and gif files in very easy and handy way. It uses optipng, pngquant, pngcrush, pngout, gifsicle, jpegoptim and jpegtran tools.

Image Optimizer This library is handy and very easy to use optimizer for image files. It uses optipng, pngquant, jpegoptim, svgo and few more librarie

879 Dec 30, 2022

[DEPRECATED] Library for extraction of domain parts e.g. TLD. Domain parser that uses Public Suffix List

Related tags

Overview

DEPRECATED

TLDExtract

Does TLDExtract make requests to Public Suffix List website?

Requirements

Install

Additional result methods

Custom database

Implement own result

Parsing modes

Change log

Testing

Contributing

License

Comments

Releases(2.0.1)

2.0.1(Feb 11, 2019)

Docs

Improvements

2.0.0(Sep 28, 2018)

Breaking changes

getSubdomains() now always returns array (#27, #33)

1.2.7(Sep 28, 2018)

Fixes

1.2.6(Sep 25, 2018)

Fixes

1.2.5(Jun 19, 2018)

Fixes

1.2.4(Apr 14, 2018)

Fixes

1.2.3(Nov 18, 2017)

Impovements

Fixes

1.2.2(Oct 17, 2017)

Fixes

1.2.1(Apr 17, 2017)

Fixes

1.2.0(Nov 17, 2016)

Features

1.1.1(Aug 3, 2016)

Fixes

1.1.0(Jun 29, 2016)

1.0.0(Jun 20, 2016)

0.2.0(Jan 10, 2016)

0.1.3(Nov 24, 2015)

0.1.2(Nov 3, 2015)

0.1.1(Oct 26, 2015)

0.1(Oct 24, 2015)

Owner

Oleksandr Fediashov

Enables the possibility generating sanitized URL parts from persisted patterns.

A PHP-based self-hosted URL shortener that can be used to serve shortened URLs under your own custom domain.

Laravel URL Localization Manager - [ccTLD, sub-domain, sub-directory].

Purl is a simple Object Oriented URL manipulation library for PHP 7.2+

URI manipulation Library

A simple PHP library to parse and manipulate URLs

Public Suffix List based domain parsing implemented in PHP

A Parser for CSS Files written in PHP. Allows extraction of CSS files into a data structure, manipulation of said structure and output as (optimized) CSS

Fact Extraction and VERification Over Unstructured and Structured information

A full-featured Webpack + vue-loader setup with hot reload, linting, testing & css extraction.

A list of documentation and example code to access the University of Florida's public (undocumented) API

Enables the possibility generating sanitized URL parts from persisted patterns.

Rah memcached - Store parts of Textpattern CMS templates in Memcached

Opulence is a PHP web application framework that simplifies the difficult parts of creating and maintaining a secure, scalable website.

Blade Snip allows you to use parts of a blade template multiple times. Basically partials, but inline.

My aim is to make a complete website that should have all the essential parts a website should have.

DBML parser for PHP8. It's a PHP parser for DBML syntax.

php html parser，类似与PHP Simple HTML DOM Parser，但是比它快好几倍

The Hoa\String library (deprecated by Hoa\Ustring).

Image optimization / compression library. This library is able to optimize png, jpg and gif files in very easy and handy way. It uses optipng, pngquant, pngcrush, pngout, gifsicle, jpegoptim and jpegtran tools.

`getSubdomains()` now always returns array (#27, #33)