Patchwork UTF-8 for PHP: Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP

Overview

Patchwork UTF-8 for PHP

Latest Stable Version Total Downloads Build Status SensioLabsInsight

Patchwork UTF-8 gives PHP developpers extensive, portable and performant handling of UTF-8 and grapheme clusters.

It provides both :

  • a portability layer for mbstring, iconv, and intl Normalizer and grapheme_* functions,
  • an UTF-8 grapheme clusters aware replica of native string functions.

It can also serve as a documentation source referencing the practical problems that arise when handling UTF-8 in PHP: Unicode concepts, related algorithms, bugs in PHP core, workarounds, etc.

Version 1.2 adds best-fit mappings for UTF-8 to Code Page approximations. It also adds Unicode filesystem access under Windows, using preferably wfio or a COM based fallback otherwise.

Portability

Unicode handling in PHP is best performed using a combo of mbstring, iconv, intl and pcre with the u flag enabled. But when an application is expected to run on many servers, you should be aware that these 4 extensions are not always enabled.

Patchwork UTF-8 provides pure PHP implementations for 3 of those 4 extensions. pcre compiled with unicode support is required but is widely available. The following set of portability-fallbacks allows an application to run on a server even if one or more of those extensions are not enabled:

  • utf8_encode, utf8_decode,
  • mbstring: mb_check_encoding, mb_convert_case, mb_convert_encoding, mb_decode_mimeheader, mb_detect_encoding, mb_detect_order, mb_encode_mimeheader, mb_encoding_aliases, mb_get_info, mb_http_input, mb_http_output, mb_internal_encoding, mb_language, mb_list_encodings, mb_output_handler, mb_strlen, mb_strpos, mb_strrpos, mb_strtolower, mb_strtoupper, mb_stripos, mb_stristr, mb_strrchr, mb_strrichr, mb_strripos, mb_strstr, mb_strwidth, mb_substitute_character, mb_substr, mb_substr_count,
  • iconv: iconv, iconv_mime_decode, iconv_mime_decode_headers, iconv_get_encoding, iconv_set_encoding, iconv_mime_encode, ob_iconv_handler, iconv_strlen, iconv_strpos, iconv_strrpos, iconv_substr,
  • intl: Normalizer, grapheme_extract, grapheme_stripos, grapheme_stristr, grapheme_strlen, grapheme_strpos, grapheme_strripos, grapheme_strrpos, grapheme_strstr, grapheme_substr, normalizer_is_normalized, normalizer_normalize.

Patchwork\Utf8

Grapheme clusters should always be considered when working with generic Unicode strings. The Patchwork\Utf8 class implements the quasi-complete set of native string functions that need UTF-8 grapheme clusters awareness. Function names, arguments and behavior carefully replicates native PHP string functions.

Some more functions are also provided to help handling UTF-8 strings:

  • filter(): normalizes to UTF-8 NFC, converting from CP-1252 when needed,
  • isUtf8(): checks if a string contains well formed UTF-8 data,
  • toAscii(): generic UTF-8 to ASCII transliteration,
  • strtocasefold(): unicode transformation for caseless matching,
  • strtonatfold(): generic case sensitive transformation for collation matching,
  • strwidth(): computes the width of a string when printed on a terminal,
  • wrapPath(): unicode filesystem access under Windows and other OSes.

Mirrored string functions are: strlen, substr, strpos, stripos, strrpos, strripos, strstr, stristr, strrchr, strrichr, strtolower, strtoupper, wordwrap, chr, count_chars, ltrim, ord, rtrim, trim, str_ireplace, str_pad, str_shuffle, str_split, str_word_count, strcmp, strnatcmp, strcasecmp, strnatcasecmp, strncasecmp, strncmp, strcspn, strpbrk, strrev, strspn, strtr, substr_compare, substr_count, substr_replace, ucfirst, lcfirst, ucwords, number_format, utf8_encode, utf8_decode, json_decode, filter_input, filter_input_array.

Notably missing (but hard to replicate) are printf-family functions.

The implementation favors performance over full edge cases handling. It generally works on UTF-8 normalized strings and provides filters to get them.

As the turkish locale requires special cares, a Patchwork\TurkishUtf8 class is provided for working with this locale. It clones all the features of Patchwork\Utf8 but knows about the turkish specifics.

Usage

The recommended way to install Patchwork UTF-8 is through composer. Just create a composer.json file and run the php composer.phar install command to install it:

{
    "require": {
        "patchwork/utf8": "~1.2"
    }
}

Then, early in your bootstrap sequence, you have to configure your environment:

\Patchwork\Utf8\Bootup::initAll(); // Enables the portablity layer and configures PHP for UTF-8
\Patchwork\Utf8\Bootup::filterRequestUri(); // Redirects to an UTF-8 encoded URL if it's not already the case
\Patchwork\Utf8\Bootup::filterRequestInputs(); // Normalizes HTTP inputs to UTF-8 NFC

Run phpunit to see the code in action.

Make sure that you are confident about using UTF-8 by reading Character Sets / Character Encoding Issues and Handling UTF-8 with PHP, or PHP et UTF-8 for french readers.

You should also get familiar with the concept of Unicode Normalization and Grapheme Clusters.

Do not blindly replace all use of PHP's string functions. Most of the time you will not need to, and you will be introducing a significant performance overhead to your application.

Screen your input on the outer perimeter so that only well formed UTF-8 pass through. When dealing with badly formed UTF-8, you should not try to fix it (see Unicode Security Considerations). Instead, consider it as CP-1252 and use Patchwork\Utf8::utf8_encode() to get an UTF-8 string. Don't forget also to choose one unicode normalization form and stick to it. NFC is now the defacto standard. Patchwork\Utf8::filter() implements this behavior: it converts from CP1252 and to NFC.

This library is orthogonal to mbstring.func_overload and will not work if the php.ini setting is enabled.

Licensing

Patchwork\Utf8 is free software; you can redistribute it and/or modify it under the terms of the (at your option):

Unicode handling requires tedious work to be implemented and maintained on the long run. As such, contributions such as unit tests, bug reports, comments or patches licensed under both licenses are really welcomed.

I hope many projects could adopt this code and together help solve the unicode subject for PHP.

You might also like...
A PHP internationalization library, powered by CLDR data.

intl A PHP 7.1+ internationalization library, powered by CLDR data. Features: NumberFormatter and CurrencyFormatter, inspired by intl. Currencies Lang

Composer package providing translation features for PHP apps

PHP translation This is a composer package providing translation support for PHP applications. It is similar to gettext, in usage, with these differen

Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP

Patchwork UTF-8 for PHP Patchwork UTF-8 gives PHP developpers extensive, portable and performant handling of UTF-8 and grapheme clusters. It provides

Provides an object-oriented API to strings and deals with bytes, UTF-8 code points and grapheme clusters in a unified way.

String Component The String component provides an object-oriented API to strings and deals with bytes, UTF-8 code points and grapheme clusters in a un

🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.

🉑 Portable UTF-8 Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Production Ready, Carefully Crafted, Extensive Vuejs Laravel Free Admin Template 🤩
Production Ready, Carefully Crafted, Extensive Vuejs Laravel Free Admin Template 🤩

Materio - Vuetify VueJS Laravel Free Admin Template Production ready carefully crafted most comprehensive admin template Introduction If you’re a deve

(Live Link) Extensive ecommerce site with vendors, mods & ability to add to cart without being logged in. Upgraded to Laravel 8.x
(Live Link) Extensive ecommerce site with vendors, mods & ability to add to cart without being logged in. Upgraded to Laravel 8.x

(Live Link) Extensive ecommerce site with vendors, mods & ability to add to cart without being logged in. Upgraded to Laravel 8.x

Symfony Polyfill / Intl: Grapheme

This component provides a partial, native PHP implementation of the Grapheme functions from the Intl extension.

Columnar analytics for PHP - a pure PHP library to read and write simple columnar files in a performant way.

Columnar Analytics (in pure PHP) On GitHub: https://github.com/envoymediagroup/columna About the project What does it do? This library allows you to w

🚀Hyperf is an extremely performant and flexible PHP CLI framework
🚀Hyperf is an extremely performant and flexible PHP CLI framework

Hyperf is an extremely performant and flexible PHP CLI framework, powered by a state-of-the-art coroutine server and a large number of battle-tested components. Aside from decisively beating PHP-FPM frameworks in benchmarks, Hyperf is unique in its focus on flexibility and composition.

Hyperf is an extremely performant and flexible PHP CLI framework

🚀 A coroutine framework that focuses on hyperspeed and flexibility. Building microservice or middleware with ease.

🎲Neard is a portable WAMP software stack involving useful binaries, tools and applications for your web development.
🎲Neard is a portable WAMP software stack involving useful binaries, tools and applications for your web development.

About Neard is a portable WAMP software stack involving useful binaries, tools and applications for your web development. It also offers several versi

Python implementation of the portable PHP password hashing framework

Portable PHP password hashing framework implemented in Python. This Python implementation meant to be an exact port of the the original PHP version.

Performant pure-PHP AMQP (RabbitMQ) sync/async (ReactPHP) library

BunnyPHP Performant pure-PHP AMQP (RabbitMQ) sync/async (ReactPHP) library Requirements BunnyPHP requires PHP 7.1 and newer. Installation Add as Compo

🔡 Portable ASCII library - performance optimized (ascii) string functions for php.

🔡 Portable ASCII Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Elephant - a highly performant PHP Cache Driver for Kirby 3

🐘 Kirby3 PHP Cache-Driver Elephant - a highly performant PHP Cache Driver for Kirby 3 Commerical Usage Support open source! This plugin is free but i

A modern, portable, easy to use crypto library.
A modern, portable, easy to use crypto library.

Sodium is a new, easy-to-use software library for encryption, decryption, signatures, password hashing and more. It is a portable, cross-compilable, i

Portable EnderChest for PocketMine-MP 4.0 Servers
Portable EnderChest for PocketMine-MP 4.0 Servers

EnderChest Portable EnderChest for PocketMine-MP 4.0 Servers Everything can be configured via 'config.yml', the name of the enderchest, a custom sound

Feather - a highly performant SQLite Cache Driver for Kirby 3

🪶 Kirby3 SQLite Cache-Driver Feather - a highly performant SQLite Cache Driver for Kirby 3 Commerical Usage Support open source! This plugin is free

Owner
Nicolas Grekas
Submitting features and bug(fixes) @symfony, to make it always faster, easier to use and better designed, uncompromising. @ESPCI_Alumni engineer.
Nicolas Grekas
PHP library to collect and manipulate gettext (.po, .mo, .php, .json, etc)

Gettext Note: this is the documentation of the new 5.x version. Go to 4.x branch if you're looking for the old 4.x version Created by Oscar Otero http

Gettext 651 Dec 29, 2022
List of 77 languages for Laravel Framework 4, 5, 6, 7 and 8, Laravel Jetstream , Laravel Fortify, Laravel Cashier and Laravel Nova.

Laravel Lang In this repository, you can find the lang files for the Laravel Framework 4/5/6/7/8, Laravel Jetstream , Laravel Fortify, Laravel Cashier

Laravel Lang 6.9k Dec 29, 2022
Provides support for message translation and localization for dates and numbers.

The I18n library provides a I18n service locator that can be used for setting the current locale, building translation bundles and translating messages. Additionally, it provides the Time and Number classes which can be used to output dates, currencies and any numbers in the right format for the specified locale.

CakePHP 26 Oct 22, 2022
A morphological solution for Russian and English language written completely in PHP.

Morphos A morphological solution for Russian and English language written completely in PHP. Tests & Quality: Features [✓] Inflection of Personal name

Sergey 723 Jan 4, 2023
Easy multilingual urls and redirection support for the Laravel framework

Linguist - Multilingual urls and redirects for Laravel This package provides an easy multilingual urls and redirection support for the Laravel framewo

Tanel Tammik 189 Jul 18, 2022
Automatically translate and review your content via Lokalise

This extension will work as a bridge between Pimcore and Lokalise for the purpose of automating the whole translation workflow. Thus eliminating most of the manual steps in the task along with availing quality translation-review service from Lokalise.

Pravin chaudhary 6 Jan 10, 2022
Filament Translations - Manage your translation with DB and cache

Filament Translations Manage your translation with DB and cache, you can scan your languages tags like trans(), __(), and get the string inside and tr

Fady Mondy 32 Nov 28, 2022
A convenience package for php multilingual web applications

PHP Translation Install Lifecycle Configuration Content of PHP File Content of Json File Content Of Database Table Use Of Array Or Json Database PHP T

Ahmet Barut 3 Jul 7, 2022
Geographer is a PHP library that knows how any country, state or city is called in any language

Geographer Geographer is a PHP library that knows how any country, state or city is called in any language. Documentation on the official website Incl

Menara Solutions 757 Nov 24, 2022
Official PHP library for the DeepL language translation API.

deepl-php Official PHP client library for the DeepL API. The DeepL API is a language translation API that allows other computer programs to send texts

DeepL 78 Dec 23, 2022