Convert HTML to Markdown with PHP

Overview

HTML To Markdown for PHP

Latest Version Software License Build Status Coverage Status Quality Score Total Downloads

Library which converts HTML to Markdown for your sanity and convenience.

Requires: PHP 7.2+

Lead Developer: @colinodell

Original Author: @nickcernis

Why convert HTML to Markdown?

"What alchemy is this?" you mutter. "I can see why you'd convert Markdown to HTML," you continue, already labouring the question somewhat, "but why go the other way?"

Typically you would convert HTML to Markdown if:

  1. You have an existing HTML document that needs to be edited by people with good taste.
  2. You want to store new content in HTML format but edit it as Markdown.
  3. You want to convert HTML email to plain text email.
  4. You know a guy who's been converting HTML to Markdown for years, and now he can speak Elvish. You'd quite like to be able to speak Elvish.
  5. You just really like Markdown.

How to use it

Require the library by issuing this command:

composer require league/html-to-markdown

Add require 'vendor/autoload.php'; to the top of your script.

Next, create a new HtmlConverter instance, passing in your valid HTML code to its convert() function:

use League\HTMLToMarkdown\HtmlConverter;

$converter = new HtmlConverter();

$html = "<h3>Quick, to the Batpoles!</h3>";
$markdown = $converter->convert($html);

The $markdown variable now contains the Markdown version of your HTML as a string:

echo $markdown; // ==> ### Quick, to the Batpoles!

The included demo directory contains an HTML->Markdown conversion form to try out.

Conversion options

By default, HTML To Markdown preserves HTML tags without Markdown equivalents, like <span> and <div>.

To strip HTML tags that don't have a Markdown equivalent while preserving the content inside them, set strip_tags to true, like this:

$converter = new HtmlConverter(array('strip_tags' => true));

$html = '<span>Turnips!</span>';
$markdown = $converter->convert($html); // $markdown now contains "Turnips!"

Or more explicitly, like this:

$converter = new HtmlConverter();
$converter->getConfig()->setOption('strip_tags', true);

$html = '<span>Turnips!</span>';
$markdown = $converter->convert($html); // $markdown now contains "Turnips!"

Note that only the tags themselves are stripped, not the content they hold.

To strip tags and their content, pass a space-separated list of tags in remove_nodes, like this:

$converter = new HtmlConverter(array('remove_nodes' => 'span div'));

$html = '<span>Turnips!</span><div>Monkeys!</div>';
$markdown = $converter->convert($html); // $markdown now contains ""

By default, all comments are stripped from the content. To preserve them, use the preserve_comments option, like this:

$converter = new HtmlConverter(array('preserve_comments' => true));

$html = '<span>Turnips!</span><!-- Monkeys! -->';
$markdown = $converter->convert($html); // $markdown now contains "Turnips!<!-- Monkeys! -->"

To preserve only specific comments, set preserve_comments with an array of strings, like this:

$converter = new HtmlConverter(array('preserve_comments' => array('Eggs!')));

$html = '<span>Turnips!</span><!-- Monkeys! --><!-- Eggs! -->';
$markdown = $converter->convert($html); // $markdown now contains "Turnips!<!-- Eggs! -->"

By default, placeholder links are preserved. To strip the placeholder links, use the strip_placeholder_links option, like this:

$converter = new HtmlConverter(array('strip_placeholder_links' => true));

$html = '<a>Github</a>';
$markdown = $converter->convert($html); // $markdown now contains "Github"

Style options

By default bold tags are converted using the asterisk syntax, and italic tags are converted using the underlined syntax. Change these by using the bold_style and italic_style options.

$converter = new HtmlConverter();
$converter->getConfig()->setOption('italic_style', '*');
$converter->getConfig()->setOption('bold_style', '__');

$html = '<em>Italic</em> and a <strong>bold</strong>';
$markdown = $converter->convert($html); // $markdown now contains "*Italic* and a __bold__"

Line break options

By default, br tags are converted to two spaces followed by a newline character as per traditional Markdown. Set hard_break to true to omit the two spaces, as per GitHub Flavored Markdown (GFM).

$converter = new HtmlConverter();
$html = '<p>test<br>line break</p>';

$converter->getConfig()->setOption('hard_break', true);
$markdown = $converter->convert($html); // $markdown now contains "test\nline break"

$converter->getConfig()->setOption('hard_break', false); // default
$markdown = $converter->convert($html); // $markdown now contains "test  \nline break"

Autolinking options

By default, a tags are converted to the easiest possible link syntax, i.e. if no text or title is available, then the <url> syntax will be used rather than the full [url](url) syntax. Set use_autolinks to false to change this behavior to always use the full link syntax.

$converter = new HtmlConverter();
$html = '<p><a href="https://thephpleague.com">https://thephpleague.com</a></p>';

$converter->getConfig()->setOption('use_autolinks', true);
$markdown = $converter->convert($html); // $markdown now contains "<https://thephpleague.com>"

$converter->getConfig()->setOption('use_autolinks', false); // default
$markdown = $converter->convert($html); // $markdown now contains "[https://google.com](https://google.com)"

Passing custom Environment object

You can pass current Environment object to customize i.e. which converters should be used.

$environment = new Environment(array(
    // your configuration here
));
$environment->addConverter(new HeaderConverter()); // optionally - add converter manually

$converter = new HtmlConverter($environment);

$html = '<h3>Header</h3>
<img src="" />
';
$markdown = $converter->convert($html); // $markdown now contains "### Header" and "<img src="" />"

Table support

Support for Markdown tables is not enabled by default because it is not part of the original Markdown syntax. To use tables add the converter explicitly:

use League\HTMLToMarkdown\HtmlConverter;
use League\HTMLToMarkdown\Converter\TableConverter;

$converter = new HtmlConverter();
$converter->getEnvironment()->addConverter(new TableConverter());

$html = "<table><tr><th>A</th></tr><tr><td>a</td></tr></table>";
$markdown = $converter->convert($html);

Limitations

  • Markdown Extra, MultiMarkdown and other variants aren't supported – just Markdown.

Style notes

  • Setext (underlined) headers are the default for H1 and H2. If you prefer the ATX style for H1 and H2 (# Header 1 and ## Header 2), set header_style to 'atx' in the options array when you instantiate the object:

    $converter = new HtmlConverter(array('header_style'=>'atx'));

    Headers of H3 priority and lower always use atx style.

  • Links and images are referenced inline. Footnote references (where image src and anchor href attributes are listed in the footnotes) are not used.

  • Blockquotes aren't line wrapped – it makes the converted Markdown easier to edit.

Dependencies

HTML To Markdown requires PHP's xml, lib-xml, and dom extensions, all of which are enabled by default on most distributions.

Errors such as "Fatal error: Class 'DOMDocument' not found" on distributions such as CentOS that disable PHP's xml extension can be resolved by installing php-xml.

Contributors

Many thanks to all contributors so far. Further improvements and feature suggestions are very welcome.

How it works

HTML To Markdown creates a DOMDocument from the supplied HTML, walks through the tree, and converts each node to a text node containing the equivalent markdown, starting from the most deeply nested node and working inwards towards the root node.

To-do

  • Support for nested lists and lists inside blockquotes.
  • Offer an option to preserve tags as HTML if they contain attributes that can't be represented with Markdown (e.g. style).

Trying to convert Markdown to HTML?

Use one of these great libraries:

No guarantees about the Elvish, though.

You might also like...
Advanced shortcode (BBCode) parser and engine for PHP

Shortcode Shortcode is a framework agnostic PHP library allowing to find, extract and process text fragments called "shortcodes" or "BBCodes". Example

A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Parsica The easiest way to build robust parsers in PHP.

A simple Atom/RSS parsing library for PHP.

SimplePie SimplePie is a very fast and easy-to-use class, written in PHP, that puts the 'simple' back into 'really simple syndication'. Flexible enoug

This is a simple php project to help a friend how parse a xml file.

xml-parser-with-laravie Requirements PHP 7.4+ Composer 2+ How to to setup to test? This is very simple, just follow this commands git clone https://gi

This is a php parser for plantuml source file.

PlantUML parser for PHP Overview This package builds AST of class definitions from plantuml files. This package works only with php. Installation Via

Efficient, easy-to-use, and fast PHP JSON stream parser

JSON Machine Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP 5.6+. See TL;DR.

 PHP SIP Parsing/Rendering Library
PHP SIP Parsing/Rendering Library

PHP SIP Parsing/Rendering Library RFC 3261 compliant SIP parsing and rendering library for PHP 7.4. Quickstart SIP Message Parsing Once installed, you

Comments
  • Escape setext heading underlines

    Escape setext heading underlines

    Version(s) affected

    5.1.0

    Description

    According to https://spec.commonmark.org/0.30/#setext-heading-underline a line containing any number of =s makes the line above it a heading 1.

    How to reproduce

    HTML:

    <p>Foo<br>=<br>Bar</p>
    

    Output:

    Foo  
    =  
    Bar
    

    Expected output:

    Foo  
    \=  
    Bar
    
    bug commonmark-compatibility character-escaping 
    opened by olli7 0
  • Line breaks inside tag

    Line breaks inside tag

    Version(s) affected

    5.0.2

    Description

    Line breaks inside tags produce incorrect markdown

    How to reproduce

    HTML:

    <b>Hello<br><br>World</b>
    

    Output:

    **Hello  
      
    world**
    

    Expected output:

    **Hello**
      
    **world**
    
    bug up-for-grabs commonmark-compatibility 
    opened by multiwebinc 3
  •  <pre class=">

    
    	                                    
    

    Version(s) affected

    5.0.2

    Description

    How to reproduce

    html

    <pre class="language-"><code>GET /announcements
     </code></pre>
    

    after convert

    ```
    <pre class="language-">```
    GET /announcements
    
    ```
    ```
    
    bug commonmark-compatibility 
    opened by kbitlive 1
Releases(5.1.0)
  • 5.1.0(Mar 2, 2022)

  • 5.0.2(Nov 6, 2021)

  • 5.0.1(Sep 17, 2021)

  • 5.0.0(Mar 29, 2021)

    Added

    • Added support for tables (#203)
      • This feature is disable by default - see README for how to enable it
    • Added new strip_placeholder_links option to strip <a> tags without href attributes (#196)
    • Added new methods to ElementInterface:
      • hasParent()
      • getNextSibling()
      • getPreviousSibling()
      • getListItemLevel()
    • Added several parameter and return types across all classes
    • Added new PreConverterInterface to allow converters to perform any necessary pre-parsing

    Changed

    • Supported PHP versions increased to PHP 7.2 - 8.0
    • HtmlConverter::convert() may now throw a \RuntimeException when unexpected DOMDocument-related errors occur

    Fixed

    • Fixed complex nested lists containing heading and paragraphs (#198)
    • Fixed consecutive emphasis producing incorrect markdown (#202)
    Source code(tar.gz)
    Source code(zip)
  • 4.10.0(Jul 1, 2020)

  • 4.9.1(Dec 28, 2019)

  • 4.9.0(Nov 2, 2019)

  • 4.8.3(Oct 31, 2019)

  • 4.8.2(Aug 2, 2019)

    Fixed

    • Fixed headers not being placed onto a new line in some cases (#172)
    • Fixed handling of links containing spaces (#175)

    Removed

    • Removed support for HHVM
    Source code(tar.gz)
    Source code(zip)
  • 4.8.1(Dec 24, 2018)

    Added

    • Added support for PHP 7.3 :tada:

    Fixed

    • Fixed paragraphs following tables (#165, #166)
    • Fixed incorrect list item escaping (#168, #169)
    Source code(tar.gz)
    Source code(zip)
  • 4.8.0(Sep 18, 2018)

    Added

    • Added support for email auto-linking
    • Added a new interface (HtmlConverterInterface) for the main HtmlConverter class
    • Added additional test cases (#14)

    Changed

    • The italic_style option now defaults to '*' so that in-word emphasis is handled properly (#75)

    Fixed

    • Fixed several issues of <code> and <pre> tags not converting to blocks or inlines properly (#26, #70, #102, #140, #161, #162)
    • Fixed in-word emphasis using underscores as delimiter (#75)
    • Fixed character escaping inside of <div> elements
    • Fixed header edge cases

    Deprecated

    • The bold_style and italic_style options have been deprecated (#75)
    Source code(tar.gz)
    Source code(zip)
  • 4.7.0(May 19, 2018)

    Added

    • Added setOptions() function for chainable calling (#149)
    • Added new list_item_style_alternate option for converting every-other list with a different character (#155)

    Fixed

    • Fixed insufficient newlines after code blocks (#144, #148)
    • Fixed trailing spaces not being preserved in link anchors (#157)
    • Fixed list-like lines not being escaped inside of lists items (#159)
    Source code(tar.gz)
    Source code(zip)
  • 4.6.2(Jan 7, 2018)

  • 4.6.1(Jan 1, 2018)

  • 4.6.0(Oct 24, 2017)

  • 4.5.0(Oct 9, 2017)

  • 4.4.1(Mar 16, 2017)

  • 4.4.0(Dec 28, 2016)

    Added

    • Added hard_break configuration option (#112, #115)
    • The HtmlConverter can now be instantiated with an Environment (#118)

    Fixed

    • Fixed handling of paragraphs in list item elements (#47, #110)
    • Fixed phantom spaces when newlines follow br elements (#116, #117)
    • Fixed link converter not sanitizing inner spaces properly (#119, #120)
    Source code(tar.gz)
    Source code(zip)
  • 4.3.1(Oct 27, 2016)

    Changed

    • Revised the sanitization implementation (#109)

    Fixed

    • Fixed tag-like content not being escaped (#67, #109)
    • Fixed thematic break-like content not being escaped (#65, #109)
    • Fixed codefence-like content not being escaped (#64, #109)
    Source code(tar.gz)
    Source code(zip)
  • 4.3.0(Oct 26, 2016)

    Added

    • Added full support for PHP 7.0 and 7.1

    Changed

    • Changed <pre> and <pre><code> conversions to use backticks instead of indendation (#102)

    Fixed

    • Fixed issue where specified code language was not preserved (#70, #102)
    • Fixed issue where <code> tags nested in <pre> was not converted properly (#70, #102)
    • Fixed header-like content not being escaped (#76, #105)
    • Fixed blockquote-like content not being escaped (#77, #103)
    • Fixed ordered list-like content not being escaped (#73, #106)
    • Fixed unordered list-like content not being escaped (#71, #107)
    Source code(tar.gz)
    Source code(zip)
  • 4.2.2(Sep 27, 2016)

  • 4.2.1(May 18, 2016)

    Fixed

    • Fixed path to autoload.php when used as a library (#98)
    • Fixed edge case for tags containing only whitespace (#99)

    Removed

    • Removed double HTML entity decoding, as this is not desireable (#60)
    Source code(tar.gz)
    Source code(zip)
  • 4.2.0(Feb 1, 2016)

    Added

    • Added the ability to invoke HtmlConverter objects as functions (#85)

    Fixed

    • Fixed improper handling of nested list items (#19 and #84)
    • Fixed preceeding or trailing spaces within emphasis tags (#83)
    Source code(tar.gz)
    Source code(zip)
  • 4.1.1(Nov 20, 2015)

  • 4.1.0(Oct 29, 2015)

  • 4.0.1(Sep 1, 2015)

    Fixed

    • Added escaping to avoid * and _ in a text being rendered as emphasis (#48)

    Removed

    • Removed the demo (#51)
    • .styleci.yml and CONTRIBUTING.md are no longer included in distributions (#50)
    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Jul 25, 2015)

    This release changes the visibility of several methods/properties. #42 and #43 brought to light that some visiblities were not ideally set, so this releases fixes that. Moving forwards this should reduce the chance of introducing BC-breaking changes.

    Added

    • Added new HtmlConverter::getEnvironment() method to expose the Environment (#42, #43)

    Changed

    • Changed Environment::addConverter() from protected to public, enabling custom converters to be added (#42, #43)
    • Changed HtmlConverter::createDOMDocument() from protected to private
    • Changed Element::nextCached from protected to private
    • Made the Environment class final
    Source code(tar.gz)
    Source code(zip)
  • 3.1.1(Jul 23, 2015)

  • 3.1.0(Jul 20, 2015)

    Added

    • Added new equals method to Element to check for equality

    Changes

    • Use Linux line endings consistently instead of plaform-specific line endings (#36)

    Fixed

    • Cleaned up code style
    Source code(tar.gz)
    Source code(zip)
  • 3.0.0(Jul 14, 2015)

    Changed

    • Changed namespace to League\HTMLToMarkdown
    • Changed packagist name to league/html-to-markdown
    • Re-organized code into several separate classes
    • <a> tags with identical href and inner text are now rendered using angular bracket syntax (#31)
    • <div> elements are now treated as block-level elements (#33)
    Source code(tar.gz)
    Source code(zip)
Owner
The League of Extraordinary Packages
A group of developers who have banded together to build solid, well tested PHP packages using modern coding standards.
The League of Extraordinary Packages
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍

HtmlParser php html解析工具,类似与PHP Simple HTML DOM Parser。 由于基于php模块dom,所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。 注意:html代码必须是utf-8编码字符,如果不是请转成utf-8

俊杰jerry 522 Dec 29, 2022
Better Markdown Parser in PHP

Parsedown Better Markdown Parser in PHP - Demo. Features One File No Dependencies Super Fast Extensible GitHub flavored Tested in 5.3 to 7.3 Markdown

Emanuil Rusev 14.3k Jan 8, 2023
Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

league/commonmark league/commonmark is a highly-extensible PHP Markdown parser created by Colin O'Dell which supports the full CommonMark spec and Git

The League of Extraordinary Packages 2.4k Jan 1, 2023
A super fast, highly extensible markdown parser for PHP

A super fast, highly extensible markdown parser for PHP What is this? A set of PHP classes, each representing a Markdown flavor, and a command line to

Carsten Brandt 989 Dec 16, 2022
UpToDocs scans a Markdown file for PHP code blocks, and executes each one in a separate process.

UpToDocs UpToDocs scans a Markdown file for PHP code blocks, and executes each one in a separate process. Include this in your CI workflows, to make s

Mathias Verraes 56 Nov 26, 2022
A New Markdown parser for PHP5.4

Ciconia - A New Markdown Parser for PHP The Markdown parser for PHP5.4, it is fully extensible. Ciconia is the collection of extension, so you can rep

Kazuyuki Hayashi 357 Jan 3, 2023
Plug and play flat file markdown blog for your Laravel-projects

Ampersand Plug-and-play flat file markdown blog tool for your Laravel-project. Create an article or blog-section on your site without the hassle of se

Marcus Olsson 22 Dec 5, 2022
📜 Modern Simple HTML DOM Parser for PHP

?? Simple Html Dom Parser for PHP A HTML DOM parser written in PHP - let you manipulate HTML in a very easy way! This is a fork of PHP Simple HTML DOM

Lars Moelleken 665 Jan 4, 2023
HTML sanitizer, written in PHP, aiming to provide XSS-safe markup based on explicitly allowed tags, attributes and values.

TYPO3 HTML Sanitizer ℹ️ Common safe HTML tags & attributes as given in \TYPO3\HtmlSanitizer\Builder\CommonBuilder still might be adjusted, extended or

TYPO3 GitHub Department 22 Dec 14, 2022
An HTML5 parser and serializer for PHP.

HTML5-PHP HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has w

null 1.2k Dec 31, 2022