A PHP component to convert HTML into a plain text format

Overview

html2text Build Status Total Downloads

html2text is a very simple script that uses DOM methods to convert HTML into a format similar to what would be rendered by a browser - perfect for places where you need a quick text representation. For example:

<html>
<title>Ignored Title</title>
<body>
  <h1>Hello, World!</h1>

  <p>This is some e-mail content.
  Even though it has whitespace and newlines, the e-mail converter
  will handle it correctly.

  <p>Even mismatched tags.</p>

  <div>A div</div>
  <div>Another div</div>
  <div>A div<div>within a div</div></div>

  <a href="http://foo.com">A link</a>

</body>
</html>

Will be converted into:

Hello, World!

This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.

Even mismatched tags.

A div
Another div
A div
within a div

[A link](http://foo.com)

See the original blog post or the related StackOverflow answer.

Installing

You can use Composer to add the package to your project:

{
  "require": {
    "soundasleep/html2text": "~1.1"
  }
}

And then use it quite simply:

$text = \Soundasleep\Html2Text::convert($html);

You can also include the supplied html2text.php and use $text = convert_html_to_text($html); instead.

Options

Option Default Description
ignore_errors false Set to true to ignore any XML parsing errors.
drop_links false Set to true to not render links as [http://foo.com](My Link), but rather just My Link.

Pass along options as a second argument to convert, for example:

$options = array(
  'ignore_errors' => true,
  // other options go here
);
$text = \Soundasleep\Html2Text::convert($html, $options);

Tests

Some very basic tests are provided in the tests/ directory. Run them with composer install && vendor/bin/phpunit.

Troubleshooting

Class 'DOMDocument' not found

You need to install the PHP XML extension for your PHP version. e.g. apt-get install php7.1-xml

License

html2text is licensed under MIT, making it suitable for both Eclipse and GPL projects.

Other versions

Also see html2text_ruby, a Ruby implementation.

Comments
  • Looking for maintainer

    Looking for maintainer

    Hi everyone! I'm no longer a PHP dev and I've run out of capacity to maintain this project, so I'm looking for some maintainers going forward. Alternatively I can archive the project as read-only.

    Ideal criteria:

    • You have at least one project on GitHub
    • You have experience releasing components to Composer

    Other than that I'm happy for maintainers to take this project into whatever direction it needs to go! :)

    For the future of this project I'd suggest some of the most critical tasks are

    • [x] Move CI from travis-ci to Github Actions
    • [x] Update to work under PHP 8 e.g. #87
    opened by soundasleep 12
  • Optimize/improve newline/whitespace handling

    Optimize/improve newline/whitespace handling

    1. Processing really large MS Office-derived HTMLs is really really slow because iterating/modifying DOMNodeLists is really slow. Fixed by integrating with the existing tree walk rather than in-place tree modification
    2. Various newline fixes to bring output more in line with browser rendering. For example, <p>Hello<br></p> is rendered as one trailing newline in browsers, but 2 in the old html2text version.
    3. Armor leading whitespace inside pre blocks to preserve it through the final trimming
    opened by bartbutler 9
  • Remove default value for parameter followed by required parameter

    Remove default value for parameter followed by required parameter

    Parameter with default value followed by required parameter is deprecated in PHP 8. Just removing the default values should not lead to a change in functionality. Ref

    opened by Stadly 7
  • add blockquote support

    add blockquote support

    Pretty self-explanatory. We use html2text to generate plaintext versions of HTML emails, among other things. Block quote tags were previously ignored, but they should not be.

    opened by bartbutler 7
  • Bug with non-breaking spaces in 0.3.0?

    Bug with non-breaking spaces in 0.3.0?

    Hi,

    I've found a very weird case where html2text returned broken output after upgrading to 0.3.0, and I was able to narrow it down to this line:

    $html = str_replace("\xa0", " ", $html);

    I examined the contents of $html in a hex editor immediately before and after the line and found this diff. The original source snippet reads "für Ihre ", with the spaces being nbsp's, and is UTF-8 encoded.

    Before: 66 c3 bc 72 c2 a0 49 68 72 65 c2 a0 After: 66 c3 83 c2 bc 72 c3 82 20 49 68 72 65 c3 82 20

    So apparently the nbsp's (c2 a0) have been transformed into c3 82 20, which looks like a regular space (20) but with some gibberish in front of it. Also, the multi-byte character 'ü' (c3 bc) is now c3 83 c2 bc, which is also nonsensical.

    I've downgraded to 0.2.3 and all is fine now, but I'd like to let you know in case you'd like to look into this.

    opened by ulrichsg 6
  • Changes to static functions

    Changes to static functions

    Hi There, Was using this great script in a function and came looking to see if it had been updated - seems quite allot of changes. However the changes to static functions seem to break for my scenario.

    I am a hobbyist / old-school php guy - so any guidance or pointers would be greatly appreciated - the old html2text.php file as a standalone effort as brilliant and worked very well.

    I am getting the following errors:

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag time invalid in Entity, line: 307 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: ID makeComment already defined in Entity, line: 391 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag section invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity, line: 560 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 563 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 576 in /----->/src/Html2Text.php on line 40 on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 589 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 601 in /----->/src/Html2Text.php on line 40

    Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag article invalid in Entity, line: 613 in /----->/src/Html2Text.php on line 40

    opened by feelsickened 6
  • CI Generator

    CI Generator

    This pull request includes configuration for running lint, code style checks, and tests within your CI environment. Feel free to commit any additional changes to the shift-53969 branch.

    Before merging, you need to:

    • Review all pull request comments for additional changes
    • Ensure your CI build is running successfully
    opened by edgrosvenor 5
  • 💀 Dead project?

    💀 Dead project?

    Hi,

    We've been waiting for an upgrade of this package for compatibility with PHP 8 (#88) for a few months, and even though a fix has been proposed in #86 and #87, no response has been received.

    In fact, the last commit & the last release date back to Feb 2019, and so does the last comment activity from @soundasleep I could pinpoint from a quick search in the issue tracker, even though you look active on GitHub recently.

    Should we consider this project abandoned and fork it, @soundasleep? Or do you need help from fellow maintainers? I'm happy to take over the project, but I don't want to if you may be willing to pursue it at some point.

    I hope you take no offense, it's open-source and it's OK if you cannot/don't want to maintain the project anymore. But please let us know! Thank you.

    opened by BenMorel 5
  • Using < character as input to html2text

    Using < character as input to html2text

    Dear Jevon and Team, Appreciate your effort in maintaining this library. We just started using this library and noticed a small issue that you may have already addressed. Our input HTML text contains valid '<' characteras a part of the content (not the html tag). The library DomParser seems to be stripping that out. Is there a way we can escape that character and send as input to your library

    opened by vsoundar 5
  • Ignore scripting tags, speed improvement

    Ignore scripting tags, speed improvement

    This PR does two things:

    1. Ignore scripting tags like <?php
    2. Improve iteration speed, especially for really large chunks of HTML text. Apparently iterating over DOMDocuments is very slow unless you do it in a specific way: https://stackoverflow.com/questions/13927221/how-to-improve-performance-iterating-a-domdocument
    opened by bartbutler 5
  • How to convert a html file to text via php command

    How to convert a html file to text via php command

    This tool is great, I follow the README and have the composer environment ready with docker container.

    $ cat composer.json
    
    {
      "require": {
        "soundasleep/html2text": "~0.2"
      }
    }
    
    $ docker pull composer/composer
    $ docker run -v $(pwd):/app composer/composer install
    Loading composer repositories with package information
    Installing dependencies (including require-dev)
      - Installing soundasleep/html2text (0.2.3)
        Downloading: 100%
    
    Writing lock file
    Generating autoload files
    
    $ ls -l 
    -rw-r--r--  1 bill  staff        59 26 Nov 17:44 composer.json
    drwxr-xr-x  5 bill  staff       170 26 Nov 17:50 vendor
    -rw-r--r--  1 bill  staff      2322 26 Nov 17:50 composer.lock
    

    So seems I have installed the dependency properly. What can I do the rest to convert the html file to text file?

    something likes:

    $ cat convert.php
    <?php
    require '/var/www/html/vendor/autoload.php';
    $text = Html2Text\Html2Text::convert($html);
    ?>
    
    $ php convert.php test.html test.txt
    
    Warning: DOMDocument::loadHTML(): Empty string supplied as input in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 43
    
    Fatal error: Uncaught exception 'Html2Text\Html2TextException' with message 'Could not load HTML - badly formed?' in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php:44
    Stack trace:
    #0 /Users/bill/pdf/convert.php(3): Html2Text\Html2Text::convert(NULL)
    #1 {main}
      thrown in /Users/bill/pdf/vendor/soundasleep/html2text/src/Html2Text.php on line 44
    

    So how to feed the parameter (test.html) and get output file (test.txt) with php command?

    opened by ozbillwang 5
  • error for correct url with multiple get params

    error for correct url with multiple get params

    Capture d’écran 2022-11-02 à 12 16 56

    >>> $html = '<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>'
    => "<a href="https://www.google.com?utm_source=croix&utm_campaign=croix">ok</a>"
    
    >>> \Soundasleep\Html2Text::convert($html);
    PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /Users/maxime/Repos/benevolt-app/vendor/soundasleep/html2text/src/Html2Text.php on line 171
    => "[ok](https://www.google.com?utm_source=croix&utm_campaign=croix)"
    
    >>> $html = '<a href="https://www.google.com?utm_source=croix">ok</a>'
    => "<a href="https://www.google.com?utm_source=croix">ok</a>"
    
    >>> \Soundasleep\Html2Text::convert($html);
    => "[ok](https://www.google.com?utm_source=croix)"
    
    opened by maximepvrt 0
  • Links without text should be discarded

    Links without text should be discarded

    Hi there!

    $html = "<a href='http://a.com'></a><a href='http://b.com'></a>";
    dd(\Soundasleep\Html2Text::convert($html));
    

    Produces http://a.comhttp://b.com, which produces incorrect HTML if placed through a markdown parser or auto link parser. I think the output should be one of the following, preferring the ones first mentioned

    1. [](http://a.com)[](http://b.com)
    2. Totally empty
    3. http://a.com http://b.com, additional space after each link
    opened by bilogic 0
  • Incorrect operation of the drop_links option

    Incorrect operation of the drop_links option

    Hi, When I call Html2Text::convert("<a href='https://google.ru'></a>", ['drop_links' => true]); i got href instead empty

    I think result should be empty because i use option 'drop_links' => true

    opened by sniftaliyev 0
  • Accept DOMDocument input

    Accept DOMDocument input

    Parsing can be slow and sometimes we need the intermediate DOMDocument as well (to process links in code outside Html2Text, for instance).

    This PR restructures convert() so that it can be easily called as two separate functions while returning the DOMDocument to the caller. is_office_document has also been moved into the options array with sensible default behavior.

    There should be no side-effects of this refactor for users.

    opened by bartbutler 0
Releases(2.1.0)
  • 2.1.0(Jan 6, 2023)

  • 2.0.0(Jan 29, 2022)

    This release makes the package compatible with PHP 8.0 and 8.1.

    This change constitutes a major release because in order to do this, we changed the signature of a public method (removed some default values), which breaks backward compatibility.

    Source code(tar.gz)
    Source code(zip)
Owner
Jevon Wright
Jevon Wright
The VarExporter component allows exporting any serializable PHP data structure to plain PHP code.

The VarExporter component allows exporting any serializable PHP data structure to plain PHP code. While doing so, it preserves all the semantics associated with the serialization mechanism of PHP (__wakeup, __sleep, Serializable).

Symfony 1.8k Jan 1, 2023
Laminas\Text is a component to work on text strings

laminas-text This package is considered feature-complete, and is now in security-only maintenance mode, following a decision by the Technical Steering

Laminas Project 38 Dec 31, 2022
Zend\Text is a component to work on text strings from Zend Framework

zend-text Repository abandoned 2019-12-31 This repository has moved to laminas/laminas-text. Zend\Text is a component to work on text strings. It cont

Zend Framework 31 Jan 24, 2021
A PHP library to convert text to speech using various services

speaker A PHP library to convert text to speech using various services

Craig Duncan 98 Nov 27, 2022
Decimal handling as value object instead of plain strings.

Decimal Object Decimal value object for PHP. Background When working with monetary values, normal data types like int or float are not suitable for ex

Spryker 16 Oct 24, 2022
A plain-language, step-by-step guide for the computer novice to build their own cloud.

This is a plain-language, step-by-step guide for the computer novice wanting to build their own cloud, looking to declare independence from Google and its ilk, to save on monthly hosting fees, or just learn a new skill.

Paul Knight 25 Nov 19, 2022
A PHP/Laravel package to fetch Notion Pages and convert it to HTML!

Generate HTML from Notion Page This package converts all the blocks in a Notion page into HTML using Notion's API. For more details on Notion API, ple

Usama Rehan 4 Nov 23, 2022
A class to help convert bytes into other units (kb, mb, etc).

A class to help convert bytes into other units (kb, mb, etc). This package can be used to convert int|float values from bytes to KB, MB and GB as well

Ryan Chandler 12 Mar 15, 2022
An article about alternative solution for convert object into a JSON Object for your api.

Do we really need a serializer for our JSON API? The last years I did build a lot of JSON APIs but personally was never happy about the magic of using

Alexander Schranz 1 Feb 1, 2022
JSONFinder - a library that can find json values in a mixed text or html documents, can filter and search the json tree, and converts php objects to json without 'ext-json' extension.

JSONFinder - a library that can find json values in a mixed text or html documents, can filter and search the json tree, and converts php objects to json without 'ext-json' extension.

Eboubaker Eboubaker 2 Jul 31, 2022
A custom twig extension to truncate text while preserving HTML tags.

TwigTruncateExtension A custom twig extension to truncate text while preserving HTML tags. Installation Add the library to your app's composer.json:

dzango 12 Oct 30, 2019
N2Web turns your Notion HTML export into a fully functional static website

Notion2Web N2Web turns your Notion HTML export into a fully functional static website. What is Notion? Notion is an online tool. But I can't tell you

Lars Lehmann 15 Nov 23, 2022
html-sanitizer is a library aiming at handling, cleaning and sanitizing HTML sent by external users

html-sanitizer html-sanitizer is a library aiming at handling, cleaning and sanitizing HTML sent by external users (who you cannot trust), allowing yo

Titouan Galopin 381 Dec 12, 2022
Sanitize untrustworthy HTML user input (Symfony integration for https://github.com/tgalopin/html-sanitizer)

html-sanitizer is a library aiming at handling, cleaning and sanitizing HTML sent by external users (who you cannot trust), allowing you to store it and display it safely. It has sensible defaults to provide a great developer experience while still being entierely configurable.

Titouan Galopin 86 Oct 5, 2022
A pure PHP implementation of the MessagePack serialization format / msgpack.org[PHP]

msgpack.php A pure PHP implementation of the MessagePack serialization format. Features Fully compliant with the latest MessagePack specification, inc

Eugene Leonovich 368 Dec 19, 2022
This is a library to serialize PHP variables in JSON format

This is a library to serialize PHP variables in JSON format. It is similar of the serialize() function in PHP, but the output is a string JSON encoded. You can also unserialize the JSON generated by this tool and have you PHP content back.

Zumba 118 Dec 12, 2022
A PHP wrapper around Libreoffice for converting documents from one format to another.

Document Converter A PHP wrapper around Libreoffice for converting documents from one format to another. For example: Microsoft Word to PDF OpenOffice

Lukas White 0 Jul 28, 2022