Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP

Overview

Patchwork UTF-8 for PHP

Latest Stable Version Total Downloads Build Status SensioLabsInsight

Patchwork UTF-8 gives PHP developpers extensive, portable and performant handling of UTF-8 and grapheme clusters.

It provides both :

  • a portability layer for mbstring, iconv, and intl Normalizer and grapheme_* functions,
  • an UTF-8 grapheme clusters aware replica of native string functions.

It can also serve as a documentation source referencing the practical problems that arise when handling UTF-8 in PHP: Unicode concepts, related algorithms, bugs in PHP core, workarounds, etc.

Version 1.2 adds best-fit mappings for UTF-8 to Code Page approximations. It also adds Unicode filesystem access under Windows, using preferably wfio or a COM based fallback otherwise.

Portability

Unicode handling in PHP is best performed using a combo of mbstring, iconv, intl and pcre with the u flag enabled. But when an application is expected to run on many servers, you should be aware that these 4 extensions are not always enabled.

Patchwork UTF-8 provides pure PHP implementations for 3 of those 4 extensions. pcre compiled with unicode support is required but is widely available. The following set of portability-fallbacks allows an application to run on a server even if one or more of those extensions are not enabled:

  • utf8_encode, utf8_decode,
  • mbstring: mb_check_encoding, mb_convert_case, mb_convert_encoding, mb_decode_mimeheader, mb_detect_encoding, mb_detect_order, mb_encode_mimeheader, mb_encoding_aliases, mb_get_info, mb_http_input, mb_http_output, mb_internal_encoding, mb_language, mb_list_encodings, mb_output_handler, mb_strlen, mb_strpos, mb_strrpos, mb_strtolower, mb_strtoupper, mb_stripos, mb_stristr, mb_strrchr, mb_strrichr, mb_strripos, mb_strstr, mb_strwidth, mb_substitute_character, mb_substr, mb_substr_count,
  • iconv: iconv, iconv_mime_decode, iconv_mime_decode_headers, iconv_get_encoding, iconv_set_encoding, iconv_mime_encode, ob_iconv_handler, iconv_strlen, iconv_strpos, iconv_strrpos, iconv_substr,
  • intl: Normalizer, grapheme_extract, grapheme_stripos, grapheme_stristr, grapheme_strlen, grapheme_strpos, grapheme_strripos, grapheme_strrpos, grapheme_strstr, grapheme_substr, normalizer_is_normalized, normalizer_normalize.

Patchwork\Utf8

Grapheme clusters should always be considered when working with generic Unicode strings. The Patchwork\Utf8 class implements the quasi-complete set of native string functions that need UTF-8 grapheme clusters awareness. Function names, arguments and behavior carefully replicates native PHP string functions.

Some more functions are also provided to help handling UTF-8 strings:

  • filter(): normalizes to UTF-8 NFC, converting from CP-1252 when needed,
  • isUtf8(): checks if a string contains well formed UTF-8 data,
  • toAscii(): generic UTF-8 to ASCII transliteration,
  • strtocasefold(): unicode transformation for caseless matching,
  • strtonatfold(): generic case sensitive transformation for collation matching,
  • strwidth(): computes the width of a string when printed on a terminal,
  • wrapPath(): unicode filesystem access under Windows and other OSes.

Mirrored string functions are: strlen, substr, strpos, stripos, strrpos, strripos, strstr, stristr, strrchr, strrichr, strtolower, strtoupper, wordwrap, chr, count_chars, ltrim, ord, rtrim, trim, str_ireplace, str_pad, str_shuffle, str_split, str_word_count, strcmp, strnatcmp, strcasecmp, strnatcasecmp, strncasecmp, strncmp, strcspn, strpbrk, strrev, strspn, strtr, substr_compare, substr_count, substr_replace, ucfirst, lcfirst, ucwords, number_format, utf8_encode, utf8_decode, json_decode, filter_input, filter_input_array.

Notably missing (but hard to replicate) are printf-family functions.

The implementation favors performance over full edge cases handling. It generally works on UTF-8 normalized strings and provides filters to get them.

As the turkish locale requires special cares, a Patchwork\TurkishUtf8 class is provided for working with this locale. It clones all the features of Patchwork\Utf8 but knows about the turkish specifics.

Usage

The recommended way to install Patchwork UTF-8 is through composer. Just create a composer.json file and run the php composer.phar install command to install it:

{
    "require": {
        "patchwork/utf8": "~1.2"
    }
}

Then, early in your bootstrap sequence, you have to configure your environment:

\Patchwork\Utf8\Bootup::initAll(); // Enables the portablity layer and configures PHP for UTF-8
\Patchwork\Utf8\Bootup::filterRequestUri(); // Redirects to an UTF-8 encoded URL if it's not already the case
\Patchwork\Utf8\Bootup::filterRequestInputs(); // Normalizes HTTP inputs to UTF-8 NFC

Run phpunit to see the code in action.

Make sure that you are confident about using UTF-8 by reading Character Sets / Character Encoding Issues and Handling UTF-8 with PHP, or PHP et UTF-8 for french readers.

You should also get familiar with the concept of Unicode Normalization and Grapheme Clusters.

Do not blindly replace all use of PHP's string functions. Most of the time you will not need to, and you will be introducing a significant performance overhead to your application.

Screen your input on the outer perimeter so that only well formed UTF-8 pass through. When dealing with badly formed UTF-8, you should not try to fix it (see Unicode Security Considerations). Instead, consider it as CP-1252 and use Patchwork\Utf8::utf8_encode() to get an UTF-8 string. Don't forget also to choose one unicode normalization form and stick to it. NFC is now the defacto standard. Patchwork\Utf8::filter() implements this behavior: it converts from CP1252 and to NFC.

This library is orthogonal to mbstring.func_overload and will not work if the php.ini setting is enabled.

Licensing

Patchwork\Utf8 is free software; you can redistribute it and/or modify it under the terms of the (at your option):

Unicode handling requires tedious work to be implemented and maintained on the long run. As such, contributions such as unit tests, bug reports, comments or patches licensed under both licenses are really welcomed.

I hope many projects could adopt this code and together help solve the unicode subject for PHP.

You might also like...
A lightweight php class for formatting sql statements. Handles automatic indentation and syntax highlighting.

SqlFormatter A lightweight php class for formatting sql statements. It can automatically indent and add line breaks in addition to syntax highlighting

A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.

URLify for PHP A fast PHP slug generator and transliteration library, started as a PHP port of URLify.js from the Django project. Handles symbols from

ColorJizz is a PHP library for manipulating and converting colors.

#Getting started: ColorJizz-PHP uses the PSR-0 standards for namespaces, so there should be no trouble using with frameworks like Symfony 2. ###Autolo

:clamp: HtmlMin: HTML Compressor and Minifier via PHP

🗜️ HtmlMin: HTML Compressor and Minifier for PHP Description HtmlMin is a fast and very easy to use PHP library that minifies given HTML5 source by r

A lightweight php class for formatting sql statements. Handles automatic indentation and syntax highlighting.

SqlFormatter A lightweight php class for formatting sql statements. It can automatically indent and add line breaks in addition to syntax highlighting

Tutorial for computer vision and machine learning in PHP 7/8 by opencv (installation + examples + documentation)
Tutorial for computer vision and machine learning in PHP 7/8 by opencv (installation + examples + documentation)

Examples detect face by cascade classifier detect face by pretrained caffe model res10_300x300_ssd by ddn module detect facemarks by LBF algorithm rec

Mobile_Detect is a lightweight PHP class for detecting mobile devices (including tablets). It uses the User-Agent string combined with specific HTTP headers to detect the mobile environment.
Mobile_Detect is a lightweight PHP class for detecting mobile devices (including tablets). It uses the User-Agent string combined with specific HTTP headers to detect the mobile environment.

Motto: "Every business should have a detection script to detect mobile readers." About Mobile Detect is a lightweight PHP class for detecting mobile d

A PHP library for generating universally unique identifiers (UUIDs).

ramsey/uuid A PHP library for generating and working with UUIDs. ramsey/uuid is a PHP library for generating and working with universally unique ident

👮 A PHP desktop/mobile user agent parser with support for Laravel, based on Mobiledetect
👮 A PHP desktop/mobile user agent parser with support for Laravel, based on Mobiledetect

Agent A PHP desktop/mobile user agent parser with support for Laravel, based on Mobile Detect with desktop support and additional functionality. Insta

Owner
Nicolas Grekas
Submitting features and bug(fixes) @symfony, to make it always faster, easier to use and better designed, uncompromising. @ESPCI_Alumni engineer.
Nicolas Grekas
🉑 Portable UTF-8 library - performance optimized (unicode) string functions for php.

?? Portable UTF-8 Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Lars Moelleken 474 Dec 22, 2022
Render Persian Text (UTF-8 Hexadecimals)

Persian-Glyphs Purpose This class takes Persian text (encoded in Windows-1256 character set) as input and performs Persian glyph joining on it and out

Rf 3 Aug 25, 2021
🔡 Portable ASCII library - performance optimized (ascii) string functions for php.

?? Portable ASCII Description It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your

Lars Moelleken 380 Jan 6, 2023
A PHP string manipulation library with multibyte support. Compatible with PHP 5.4+, PHP 7+, and HHVM.

A PHP string manipulation library with multibyte support. Compatible with PHP 5.4+, PHP 7+, and HHVM. s('string')->toTitleCase()->ensureRight('y') ==

Daniel St. Jules 2.5k Dec 28, 2022
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.

jieba-php "結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件,目前翻譯版本為 jieba-0.33 版本,未來再慢慢往上升級,效能也需要再改善,請有興趣的開發者一起加入開發!若想使用 Python 版本請前往 fxsjy/jieba 現在已經可以支援繁體中文!只要將字典切換為 bi

Fukuball Lin 1.2k Dec 31, 2022
highlight.php is a server-side syntax highlighter written in PHP that currently supports 185 languages

highlight.php is a server-side syntax highlighter written in PHP that currently supports 185 languages. It's a port of highlight.js by Ivan Sagalaev that makes full use of the language and style definitions of the original JavaScript project.

Geert Bergman 633 Dec 27, 2022
php-crossplane - Reliable and fast NGINX configuration file parser and builder

php-crossplane Reliable and fast NGINX configuration file parser and builder ℹ️ This is a PHP port of the Nginx Python crossplane package which can be

null 19 Jun 30, 2022
A tiny PHP class-based program to analyze an input file and extract all of that words and detect how many times every word is repeated

A tiny PHP class-based program to analyze an input file and extract all of that words and detect how many times every word is repeated

Max Base 4 Feb 22, 2022
PHP library to detect and manipulate indentation of strings and files

indentation PHP library to detect and manipulate the indentation of files and strings Installation composer require --dev colinodell/indentation Usage

Colin O'Dell 34 Nov 28, 2022
A PHP class which allows the decoding and encoding of a wider variety of characters compared to the standard htmlentities and html_entity_decode functions.

The ability to encode and decode a certain set of characters called 'Html Entities' has existed since PHP4. Amongst the vast number of functions built into PHP, there are 4 nearly identical functions that are used to encode and decode html entities; despite their similarities, however, 2 of them do provide additional capabilities not available to the others.

Gavin G Gordon (Markowski) 2 Nov 12, 2022