PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

Related tags

PDF pdfparser
Overview

PdfParser

Pdf Parser, a standalone PHP library, provides various tools to extract data from a PDF file.

CI Scrutinizer Code Quality Code Coverage License

Latest Stable Version Total Downloads Monthly Downloads Daily Downloads

Website : https://www.pdfparser.org

Test the API on our demo page.

This project is supported by Actualys.

Features

Features included :

  • Load/parse objects and headers
  • Extract meta data (author, description, ...)
  • Extract text from ordered pages
  • Support of compressed pdf
  • Support of MAC OS Roman charset encoding
  • Handling of hexa and octal encoding in text sections
  • PSR-0 compliant (autoloader)
  • PSR-1 compliant (code styling)

Currently, secured documents are not supported.

This Library is under active maintenance. There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!

Documentation

Read the documentation on website.

Original PDF References files can be downloaded from this url: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

For developers: Please read DEVELOPER.md for more information about local development of the PDFParser library.

Installation

Using Composer

  • Obtain Composer
  • Run composer require smalot/pdfparser

Use alternate file loader

In case you can't use Composer, you can include alt_autoload.php-dist into your project. It will load all required files at once. Afterwards you can use PDFParser class and others.

License

This library is under the LGPLv3 license.

Comments
  • Fix encoding for encoding dictionary without Type item.

    Fix encoding for encoding dictionary without Type item.

    image

    In PDF-file in Font internal encryption point to dictionary without Type item, therefore encryption treated not like encryption but just like PDFObject and as result text incorrectly decoded (because both BaseEncoding item and differences array item ignored).

    This PR fixes it.

    Also has been deleted unnecessary unicode string decode test for file with WinAnsiEncoding text encoding.

    Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

    Potentially this PR can fix some of the many other opened issues related to incorrect result text encoding.

    enhancement needs work fix encoding issues 
    opened by likemusic 34
  • Using PdfParser without Composer

    Using PdfParser without Composer

    Alternative Autoloader built in

    Since v0.18.2 you don't need to do the following steps to use PDFParser without Composer. Please check https://github.com/smalot/pdfparser#install for further information on our alternative autoloader.


    :exclamation: Outdated

    Last checked in 2020

    Updated file: vendor-autoload.zip - See https://github.com/smalot/pdfparser/issues/117#issuecomment-673408008

    The ../vendor/autoload.php gets generated when we use composer and we include it in our scripts for PdfParser access. If we wish to freeze our install and manage it without using Composer, this said file can be created to have the following:

    <?php
    /**
     * this file acts as vendor/autoload.php
     */
    
    /*
    Using PDFParser without Composer
    Folder structure
    ================
    webroot
      pdfdemos
        INV001.pdf # test PDF file to extract text from for demo
        test.php # our operational demo file
      vendor
        autoload.php
        smalot
          pdfparser # unpack from git master https://github.com/smalot/pdfparser/archive/master.zip release is 0.9.25 dated 2015-09-15
            docs # optional
            samples # optional
            src
              Smalot
                PdfParser
    */
    
    $prerequisites = array();
    
    /**
     * TODO: ADAPT THIS PATH TO pdfparser
     */ 
    $pdfparser = '/host/path/to/pdfparser';
    
    $prerequisites['pdfparser'] = array (
        $pdfparser.'/Config.php',
        $pdfparser.'/Parser.php',
        $pdfparser.'/Document.php',
        $pdfparser.'/Header.php',
        $pdfparser.'/PDFObject.php',
        $pdfparser.'/Element.php',
        $pdfparser.'/Encoding.php',
        $pdfparser.'/Font.php',
        $pdfparser.'/Page.php',
        $pdfparser.'/Pages.php',
        $pdfparser.'/Element/ElementArray.php',
        $pdfparser.'/Element/ElementBoolean.php',
        $pdfparser.'/Element/ElementString.php',
        $pdfparser.'/Element/ElementDate.php',
        $pdfparser.'/Element/ElementHexa.php',
        $pdfparser.'/Element/ElementMissing.php',
        $pdfparser.'/Element/ElementName.php',
        $pdfparser.'/Element/ElementNull.php',
        $pdfparser.'/Element/ElementNumeric.php',
        $pdfparser.'/Element/ElementStruct.php',
        $pdfparser.'/Element/ElementXRef.php',
        $pdfparser.'/Encoding/StandardEncoding.php',
        $pdfparser.'/Encoding/ISOLatin1Encoding.php',
        $pdfparser.'/Encoding/ISOLatin9Encoding.php',
        $pdfparser.'/Encoding/MacRomanEncoding.php',
        $pdfparser.'/Encoding/WinAnsiEncoding.php',
        $pdfparser.'/Font/FontCIDFontType0.php',
        $pdfparser.'/Font/FontCIDFontType2.php',
        $pdfparser.'/Font/FontTrueType.php',
        $pdfparser.'/Font/FontType0.php',
        $pdfparser.'/Font/FontType1.php',
        $pdfparser.'/RawData/FilterHelper.php',
        $pdfparser.'/RawData/RawDataParser.php',
        $pdfparser.'/XObject/Form.php',
        $pdfparser.'/XObject/Image.php'
    );
    
    foreach($prerequisites as $project => $includes) {
        foreach($includes as $mapping => $file) {
          require_once $file;
        }
    }
    
    /*
    // Information for comparison with composer
    use Datamatrix;
    use PDF417;
    use QRcode;
    use TCPDF;
    use TCPDF2DBarcode;
    use TCPDFBarcode;
    use TCPDF_COLORS;
    use TCPDF_FILTERS;
    use TCPDF_FONTS;
    use TCPDF_FONT_DATA;
    use TCPDF_IMAGES;
    use TCPDF_IMPORT;
    use TCPDF_PARSER;
    use TCPDF_STATIC;
    */
    

    We can now create a test.php in the deployment folder (pdfdemos here) with:

    <?php
    include "../vendor/autoload.php";
    
    $directory = getcwd();
    $file = 'INV001.pdf';
    $fullfile = $directory . '/' . $file;
    $content = '';
    $out = '';
    $parser = new \Smalot\PdfParser\Parser();
    
    $document = $parser->parseFile($fullfile);
    $pages    = $document->getPages();
    $page     = $pages[0];
    $content  = $page->getText();
    $out      = $content;
    echo '<pre>' . $out . '</pre>';
    

    EDIT 1 by k00ni: added updated PHP code from @ndmax. Also removed tecnickcom/tcpdf (not needed anymore) and added code highlighting.

    documentation 
    opened by apmuthu 32
  • Object list not found. Possible secured file.

    Object list not found. Possible secured file.

    When trying to encode my pdf i get the following error

    Object list not found. Possible secured file.
    

    What does this mean?

    If I use pdftotext from the command line, I'm able to output the text of my test-file just fine.

    opened by johannesjo 30
  • Inserting white spaces beetween letters

    Inserting white spaces beetween letters

    When I try to extract text from this file (http://billybala.brgweb.com.br/tmp/1435983113.pdf) it inserts white space beetween letters:

    The code:

       $fulltext = 'Full text: ';
       $parser = new \Smalot\PdfParser\Parser();
       $pdfsource = $parser->parseFile($pdf);
       $pages  = $pdfsource->getPages();
       $pagecount = count($pages);
       $output .= "Total pages: $pagecount<br>";
       // Loop over each page to extract text.
       foreach ($pages as $page) {
        $fulltext .= utf8_decode($page->getText());
       }
       echo $fulltext;
    

    The output is:

    Full text: C 3 9 9 0 0 9 N T O U D e a r E d i t o r : W i t h g r o w i n g p o p u l a t i o n a n d e v e r m o r e a d v a n c e d t e c h n o l o g i e s , n a t u r a l r e s o u r c e s p r o v i d e d b y l a n d c a n n o l o n g e r f u l f i l l t h e i n c r e a s i n g d e m a n d s o f t h e h u m a n p o p u l a t i o n . I n a n a t t e m p t t o e n s u r e t h e i r m a r i n e r i g h t s a n d i n t e r e s t s , m a n y c o u n t r i e s h a v e t a k e n t h e s t e p t o e s t a b l i s h c o m p e t e n t m a r i n e a u t h o r i t i e s . T h e s e a u t h o r i t i e s a r e a s s i g n e d t h e m i s s i o n t o i n t e g r a t e o c e a n p o l i c i e s , a n d h a v e t h e r e s p o n s i b i l i t y t o o v e r s e e v a r i o u s m a r i n e a f f a i r s . T h i s p a p e r g i v e s i n s i g h t i n t o t h e s c o p e o f m a r i n e a f f a i r s , a n d s u m m a r i z e s t h e p r e s e n t s t a t e o f m a r i n e a u t h o r i t y e s t a b l i s h m e n t i n a n u m b e r o f o c e a n s t a t e s i n c l u d i n g t h e U n i t e d S t a t e s , C a n a d a , C h i n a , J a p a n a n d K o r e a . I t g o e s o n t o d i s c u s s t h e h i s t o r y a n d p r o c e s s o f e s t a b l i s h i n g c o m p e t e n t m a r i n e a u t h o r i t i e s i n T a i w a n . T h e T a i w a n e s e G o v e r n m e n t h a s c o n f i r m e d t h e e s t a b l i s h m e n t o f T h e T a s k F o r c e f o r M a r i t i m e A f f a i r s ( T h e T a s k F o r c e ) . I t i s r e s p o n s i b l e f o r t h e i n t e g r a t i o n o f v a r i o u s m a r i n e a u t h o r i t i e s i n T a i w a n a n d t h e n e w l y e s t a b l i s h e d c o m p e t e n t a u t h o r i t y i s s c h e d u l e d t o c o m e i n t o o p e r a t i o n i n J a n u a r y 2 0 1 2 . T h e T a s k F o r c e w i l l s e r v e a s a u s e f u l s o u r c e o f r e f e r e n c e f o r m a n y s c h o l a r s w o r k i n g i n t h e m a r i n e a n d o c e a n s c i e n c e d i s c i p l i n e s . F u r t h e r m o r e , t h e i n s i g h t f u l a n a l y s i s p r e s e n t e d i n t h i s p a p e r w i l l e n a b l e y o u r r e a d e r s t o a c q u i r e a b e t t e r u n d e r s t a n d i n g o f t h e s c o p e o f m a r i n e a f f a i r s , t h e s t a t u s o f c o m p e t e n t m a r i n e a u t h o r i t i e s i n c e r t a i n c o u n t r i e s , a n d t h e h i s t o r y a n d p r o c e s s i n t h e e s t a b l i s h m e n t o f s u c h a u t h o r i t i e s i n T a i w a n . F o r t h e r e a s o n s s t a t e d , I f e e l t h i s p a p e r p r o v i d e s h i g h l y v a l u a b l e i n f o r m a t i o n s u i t a b l e f o r p u b l i c a t i o n i n y o u r j o u r n a l . S h o u l d t h e e d i t o r h a v e a n y s u g g e s t i o n o r c o m m e n t p l e a s e d o n o t h e s i t a t e t o c o n t a c t u s , w e s h a l l r e s p o n d i m m e d i a t e l y .

    bug help wanted 
    opened by ricardobrg 27
  • getDataTm() seems to be broken completely (wrong coordinates and wrongly decoded text content)

    getDataTm() seems to be broken completely (wrong coordinates and wrongly decoded text content)

    I created a very simple PDF for testing, because I had issues getting correct x and y coordinates when using getDataTm() (unfortunately, I can't provide any other PDFs for privacy reasons...)

    => xy.pdf

        $parser = new \Smalot\PdfParser\Parser();
        $pdf    = $parser->parseFile('xy.pdf');
        
        $pages = $pdf->getPages();
    
        foreach ($pages as $index => $page) {
            if (null === $page) {
                continue;
            }
            
            var_dump($page->getDataTm());
        }
    

    For this PDF, the Text is not read correctly when using getDataTm() (but interestingly, it works when using getText() instead). Instead, where the text should be, there's what looks like HEX-encoded strings:

    array (size=13)
      0 => 
        array (size=2)
          0 => 
            array (size=6)
              0 => string '1' (length=1)
              1 => string '0' (length=1)
              2 => string '0' (length=1)
              3 => string '1' (length=1)
              4 => string '33.173' (length=6)
              5 => string '665.721' (length=7)
          1 => string '00270050005000430042005300010001' (length=32)
      1 => 
        array (size=2)
          0 => 
            array (size=6)
              0 => string '1' (length=1)
              1 => string '0' (length=1)
              2 => string '0' (length=1)
              3 => string '1' (length=1)
              4 => string '213.173' (length=7)
              5 => string '665.721' (length=7)
          1 => string '00230042005B004300420053' (length=24)
      2 => [...]
    
    bug 
    opened by Connum 25
  • mb_convert_encoding(): Illegal character encoding specified

    mb_convert_encoding(): Illegal character encoding specified

    If i parse a PDF i get the following error: "mb_convert_encoding(): Illegal character encoding specified" in "/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php Line: 500". The list of supported character encodings doesnt list any entry of "Mac" or something... http://php.net/manual/de/mbstring.supported-encodings.php.

    missing or incomplete functionality help wanted 
    opened by Ablont 24
  • PDF content has extra spaces inserted into it

    PDF content has extra spaces inserted into it

    My company uses wkhtmltopdf to generate PDF files. Our automated testing uses pdfparser to verify that the generated content matches what we expect. I'm in the process of updating from v0.11 to v0.18.2. We got the expected output in the older version, but the newer version is inserting extra spaces into the content.

    Expected content: Snippet 1 EN Actual content: S nip pet 1 E N

    Note that I have stepped through the code and verified that they are in fact spaces and not other bad unicode characters, and found the code that is inserting the spaces.

    I'm not sure of the actual terminology here, since I'm unfamiliar with the PDF format, but I'll do my best to describe what's going on.

    The actual "content" of the PDF file is as follows:

    /F6 16 Tf 1 0 0 -1 0 0 Tm
    8 -15 Td <0001> Tj
    10.1562500 0 Td <0002> Tj
    10.1406250 0 Td <0003> Tj
    4.43750000 0 Td <0004> Tj
    10.1562500 0 Td <0004> Tj
    10.1562500 0 Td <0005> Tj
    9.84375000 0 Td <0006> Tj
    6.26562500 0 Td <0007> Tj
    5.07812500 0 Td <0008> Tj
    10.1718750 0 Td <0007> Tj
    5.07812500 0 Td <0009> Tj
    10.1093750 0 Td <000A> Tj 
    

    From what I understand, Td is an x, y offset, and Tj is an index into an encoding table, as follows:

    0001 = 'S'
    0002 = 'n'
    0003 = 'i'
    0004 = 'p'
    0005 = 'e'
    0006 = 't'
    0007 = ' '
    0008 = '1'
    0009 = 'E'
    000A = 'N'
    

    As you can see, the content is correct and does not contain extra spaces. The issue seems to come from the Td command. Some have wider x offsets, causing the following code in pdfparser to insert extra spaces (starting at PDFObject:289, "// horizontal offset"):

                        // move text current point
                        case 'Td':
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
                            $y = array_pop($args);
                            $x = array_pop($args);
                            if (((float) $x <= 0) ||
                                (false !== $current_position_td['y'] && (float) $y < (float) ($current_position_td['y']))
                            ) {
                                // vertical offset
                                $text .= "\n";
                            } elseif (false !== $current_position_td['x'] && (float) $x > (float) (
                                    $current_position_td['x']
                                )
                            ) {
                                // horizontal offset
                                $text .= ' ';
                            }
                            $current_position_td = ['x' => $x, 'y' => $y];
                            break;
    

    I have tested the exact file in other parsers (gimp, okular, chrome, firefox, edge, word) and all render correctly.

    I have attached the file for reference.

    36360.00.d28.pdf

    bug help wanted 
    opened by LordMonoxide 23
  • consider scaling by fontSize(Tf, Tfs) and text matrix (Tm)

    consider scaling by fontSize(Tf, Tfs) and text matrix (Tm)

    closes #532

    Previously, the coordinates in getDataTm() were calculated without taking into accout the scaling that happens by the set font size and text matrix.

    In a PDF that would e.g. look like:

    • /R9 11.04 Tf (second operand is font size)
    • 0.999402 0 0 1 70.8 698.96 Tm (first and fourth operand set horizontal and vertical scaliing)

    When the posiiton is changed using the text positioning operators Td or TD, the new coordinates are now calculated with the set scaling.

    One line in the test files was changed to reflect the correct scaling.

    This pull request also fixes a sign error in the calculation of the y-coordinate in TD.

    enhancement fix 
    opened by oliver681 19
  • respect space width when using

    respect space width when using "Move text position" stream operator

    fixes at least #201

    maybe #72 too, but i cannot download the pdf anymore

    closes https://github.com/smalot/pdfparser/pull/245 maybe closes https://github.com/smalot/pdfparser/pull/256

    enhancement needs work fix 
    opened by PaulBehrendtVentoro 18
  • Page getDataTm always return empty array

    Page getDataTm always return empty array

    Hi,

    I'm using version 2.2.1 and trying to spot location of a string in some pages.

    The PDF is generated by my own use of TCPDI.php

    Method getText works like a charm but getDataTm always return an empty array ([]).

    Here is the page details:

    { "Type": "Page", "Parent": { "Type": "Pages", "Count": "65" }, "LastModified": "2022-08-31T13:57:58+02:00", "Resources": [], "MediaBox": [ 0, 0, 595, 842 ], "CropBox": [ 0, 0, 595, 842 ], "BleedBox": [ 0, 0, 595, 842 ], "TrimBox": [ 0, 0, 595, 842 ], "ArtBox": [ 0, 0, 595, 842 ], "Contents": { "Filter": "FlateDecode", "Length": "97" }, "Rotate": "0", "Group": { "Type": "Group", "S": "Transparency", "CS": "DeviceRGB" }, "PZ": "1" },

    help wanted PDF required to demonstrate issue 
    opened by PauArjona 17
  • Poll: Is this library ready for 1.0?

    Poll: Is this library ready for 1.0?

    There was a discussion in #318 whether to merge in a change that might require adaptions by users of this library. PDFParser follows Semantic Version, which has some implications how to handle that situation. One is to bump major version from 0.x to 1.0. I will outline the arguments I heard so far and wanna invite all of you to vote. Feel free to comment, I will add new arguments to the lists.

    Thank you for helping here

    :+1: - Yes, jump to 1.0 :-1: - No, keep 0.x for now

    Arguments against 1.0

    1. A check to some extent is required to determine if all parts of the library (API) are as we want them to be. The focus has to be on the current feature set and behavior behind. Is there something which is fishy and has to be taken care of before 1.0?
      • Based on Github stats there are almost 800 projects directly or indirectly depending on PDFParser. Some may also use this in production environments generating money. PDFParser is an Open Source library with no obligations, but in my opinion we can't just change something and leave developers out in the cold.
      • If we just bump the version we acknowledge that the current state is fine. But if we encounter major problems in the near future we might be forced to bump again..
    2. Currently we receive many fixes and some features, for instance from @PaulBehrendtVentoro, @Connum and @izabala.
      • That is great! But because of that it is important to know, if its planned to have API changes be part in future contributions.
      • As long as we have 0.x, changes in API (or behavior) won't be merged, but in case a (basic) check like in (1.) was conducted, we can collect API changes and put them together into a new 1.x release.

    Arguments for 1.0

    1. Semantic Version allows the bump from 0.x to 1.0 at any point. Developers "know" that this might happen and their code might be affected.
    2. People using Composer constraint "^0.x" or stay at 0.x are not affected and shouldn't experience any problems.
    3. Make changes in API or behavior optional and allow developers to enable them via parameter.

    Implications here

    • [ ] prepare a new release
    • [ ] add an UPGRADE file to inform developers of required code changes

    BTW.: The vote may lead our decision to keep #314 or not.

    help wanted 
    opened by k00ni 17
Releases(v2.3.0)
  • v2.3.0(Dec 22, 2022)

    What's Changed

    • #561 Added optional getText() argument to return limited number of document pages if set by @alesrebec in https://github.com/smalot/pdfparser/pull/562
      • look here for an example
    • consider scaling by fontSize(Tf, Tfs) and text matrix (Tm) by @oliver681 in https://github.com/smalot/pdfparser/pull/559
    • Extend Github workflow file to run tests in Windows environment by @k00ni in https://github.com/smalot/pdfparser/pull/566

    New Contributors

    • @alesrebec made their first contribution in https://github.com/smalot/pdfparser/pull/562
    • @oliver681 made their first contribution in https://github.com/smalot/pdfparser/pull/559

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.2...v2.3.0

    Source code(tar.gz)
    Source code(zip)
  • v2.2.2(Dec 6, 2022)

    What's Changed

    • fix: allow to parse null or empty header element by @DogLoc in https://github.com/smalot/pdfparser/pull/560
    • behind the scenes: updated development tools and check PHP 8.1/8.2 support by @k00ni (https://github.com/smalot/pdfparser/pull/553, https://github.com/smalot/pdfparser/pull/563)

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.1...v2.2.2

    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(May 3, 2022)

    What's Changed

    • Handle null Header and falsy header elements by @dsuurlant in https://github.com/smalot/pdfparser/pull/525

    New Contributors

    • @dsuurlant made their first contribution in https://github.com/smalot/pdfparser/pull/525

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.0...v2.2.1

    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Apr 12, 2022)

    What's Changed

    • Rework documentation by @rubenvanerk in https://github.com/smalot/pdfparser/pull/513
    • fixes #520 (missing r in composer command in README.md) by @k00ni in https://github.com/smalot/pdfparser/pull/521
    • Added font info to dataTm by @shtayerc in https://github.com/smalot/pdfparser/pull/516
    • Added calculateTextWidth function to Font by @shtayerc in https://github.com/smalot/pdfparser/pull/517
    • Add issue template by @rubenvanerk in https://github.com/smalot/pdfparser/pull/524

    New Contributors

    • @shtayerc made their first contribution in https://github.com/smalot/pdfparser/pull/516

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.1.0...v2.2.0

    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Feb 3, 2022)

    What's Changed

    • Fix encoding for encoding dictionary without Type item. by @likemusic in https://github.com/smalot/pdfparser/pull/500
    • Added decodeMemoryLimit to Config to avoid memory leaks. by @b3n-l in https://github.com/smalot/pdfparser/pull/476
    • added short example how to parse base64 encoded PDFs by @granjero in https://github.com/smalot/pdfparser/pull/493
    • Make horizontal offset configurable by @rubenvanerk in https://github.com/smalot/pdfparser/pull/505
    • Link docs to wiki instead of pdfparser.org by @rubenvanerk in https://github.com/smalot/pdfparser/pull/506
    • Add return types to tests methods. Fix todos in phpDocs. Add method's descriptions for Font class. by @likemusic in https://github.com/smalot/pdfparser/pull/509

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.1...v2.1.0

    Source code(tar.gz)
    Source code(zip)
  • v2.0.1(Nov 23, 2021)

    Bugfix release

    For PHP 7 users: In 2.0.0 we used a function which is PHP 8 only. It was fixed in #486.

    • Font.php: Optimization of the uchr function by @mariuszkrzaczkowski in https://github.com/smalot/pdfparser/pull/467
    • Fix Scrutinizer-integration: mark PageTest::testGetTextPullRequest457 as "memory-heavy" by @k00ni in https://github.com/smalot/pdfparser/pull/481
    • Fixes #478 (/Index problem) by @yasheena in https://github.com/smalot/pdfparser/pull/479

    Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.0...v2.0.1

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 16, 2021)

    Breaking Changes

    ❗All function parameters as well as return types of functions are typed now. That means, if you are using values which do not fit, you may receive Type errors. Most of it was done internally and you should not get bothered. In case you use internal functions, please check your code before go into production.

    We initially decided to release 1.2.0 but finally jumped to 2.0.0 to include BC on a major release instead (see https://github.com/smalot/pdfparser/issues/480)

    Highlights

    • massive code refactoring (thanks to @jee7, #440)
    • workaround to enable FPDFs (thanks to @izabala, #453)
    • Added cache for Documents object cache dictionary, which also results in better performance in some cases (thanks to @jee7, #434)
    • prevent endless loops during Page->getText() in some cases (thanks to @Nickmanbear, #457)
    • Fixes invalid return type on unknown glyphs (thanks to @PrinsFrank, #459)
    • Fix TypeError on Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)
    • Fix TypeError on default font when no fonts available (#466, thanks for @PrinsFrank)
    • Fix for extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF (#454, thanks to @izabala)
    • Test backend was improved by @j0k3r (#460)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0-RC2(Oct 18, 2021)

    Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗

    Changes since v1.2.0-RC1

    • Fix TypeError on default font when no fonts available (#466, thanks for @PrinsFrank)
    • Fix for extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF (#454, thanks to @izabala)

    Further information about changes and fixes in 1.2.0 can be found here: https://github.com/smalot/pdfparser/releases/tag/v1.2.0-RC1

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0-RC1(Oct 15, 2021)

    Bug fix and performance release

    Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗

    Highlights:

    • massive code refactoring (thanks to @jee7, #440)
    • workaround to enable FPDFs (thanks to @izabala, #453)
    • Added cache for Documents object cache dictionary, which also results in better performance in some cases (thanks to @jee7, #434)
    • prevent endless loops during Page->getText() in some cases (thanks to @Nickmanbear, #457)
    • Fixes invalid return type on unknown glyphs (thanks to @PrinsFrank, #459)
    • Fix TypeError on Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)

    @j0k3r improved our test backend.

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Aug 16, 2021)

    Maintenance and small performance boost

    PDFs with images can be parsed with less resource consumption (like memory) from now on. @Connum added a feature with #441 to ignore image data. It must be enabled manually though. You can do it easily:

    use Smalot\PdfParser\Config;
    use Smalot\PdfParser\Parser;
    
    $config = new Config();
    $config->setRetainImageContent(false);
    $parser = new Parser([], $config);
    // $parser->parseFile (...)
    

    Besides that, we fixed a problem with Scrutinizer (part of our test infrastructure).

    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Jun 21, 2021)

    Bugfix release

    • Don't throw an exception if there is no base encoding defined (as of PDF 1.5 Reference Table 5.11) - #433, thanks @LucianoHanna
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Jun 8, 2021)

    Bugfix release

    • Fixed decode octal regex (#421, thanks @gdiasb12)
    • Fixed remaining places which use Config class and threw exceptions (#420, #424, thanks @TivoSoho)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Apr 28, 2021)

    Highlights

    • Removed support for PHP 5.6 and 7.0, requires at least PHP 7.1 or newer❗
    • extended Config.php with white space characters: it allows developers to override regex for white space recognition (#411, thanks @LucianoHanna)
    • Fixed some test-infrastructure related issues (#412, #413, #414)
    Source code(tar.gz)
    Source code(zip)
  • v0.19.0(Apr 14, 2021)

    Bugfix and feature release

    Features:

    • Add support for PDF 1.5 Xref stream (#400, thanks @smalot)
    • Add support for Reversed Chars instruction in BMC blocs (#402, thanks @smalot)

    Fixes:

    • Encoding::__toString complies with PHP specification from now on (#407, thanks @igor-krein and others from #85)
    • fix Call to a member function getFontSpaceLimit() on null (#406, thanks @xfolder)
    • Consider all PDF white-space characters in object header (#405, thanks @LucianoHanna)
    Source code(tar.gz)
    Source code(zip)
  • v0.18.2(Feb 25, 2021)

    Maintenance release

    • Bugfix for #391 (Uncaught Error: Call to undefined method Smalot\PdfParser\Header::__toString() in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php) (thanks @fsmoak)
    • Addition of an alternative autoloader for non-Composer installations (#388). Based on the work of @apmuthu and others from #117.
    Source code(tar.gz)
    Source code(zip)
  • v0.18.1(Jan 12, 2021)

    Bug fix release

    Fixes an infinite loop (and memory leak) if xref table is corrupted. For more information see #377 and #372. Thanks @partulaj!

    Source code(tar.gz)
    Source code(zip)
  • v0.18.0(Dec 30, 2020)

    :fireworks: Happy new year release! :firecracker:

    A few bug fixes and improvements.

    Fixes:

    • Implemented missing __toString method in Encoding.php (thanks @tomlutzenberger, #378).
    • In Header.php make sure init is only called if $element is of type Element (thanks @lukgru, #380).

    Improvements:

    • Improved performance in ElementName.php (thanks @mardc21, #369)
    • Added a config object to adapt default values like font space limit (thanks @k00ni, #375). Further values may be ported in future versions.
    • Switch from Travis to Github Actions (thanks @j0k3r, #376)
    Source code(tar.gz)
    Source code(zip)
  • v0.17.1(Oct 30, 2020)

    Hot fix release for a problem in PdfParser\Encoding\PostScriptGlyphs.php, for instance:

    Notice: Undefined offset: 67 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091

    Related issues: #359, #360

    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Oct 12, 2020)

    Bug fix release with a few improvements and a new composer dependency.

    Highlights:

    • added symfony/polyfill-mbstring to improve PHP 8 support (#337)
    • reverted 4f4fd10 and preserving fix for #260, fixing #319, #322 and #334 (#342)
    • revived #257: Properly decode ANSI encodings (#349)
    • allow for line breaks when splitting xrefs for id and position, fixes #19 (#345)
    • Document::getPages() should only ever return elements of type 'Page' (#350)
    • rely on getTextArray() in getDataTm() to extract the texts (#340)
    • fix missing BT command before each section (could result in wrong coordinates) and its resetting of Tm (#341)
    Source code(tar.gz)
    Source code(zip)
  • v0.16.2(Aug 31, 2020)

    Bugfix release.

    Fixes

    • Fix missing catalog bug (+ some code refactoring) #312, thanks @PaulBehrendtVentoro
    • Handle corrupted PDF #328
    • Fix error when Font aren't available #324, thanks @wivaku
    Source code(tar.gz)
    Source code(zip)
  • 0.16.1(Jun 29, 2020)

    Bugfix release.

    Fixes

    • array access on integer for php7.4 - #310, #267 - thanks @PaulBehrendtVentoro
    • mb_convert_encoding(): Illegal character encoding specified - #313, #229 - thanks @daneren2005
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jun 19, 2020)

    This release contains a lot of refinements and some fixes.

    New features

    • get text for a given set of coordinates
      • to do that use Page::getTextXY - function details
      • related pull request: #297, thanks @izabala

    Changes

    • Composer dependencies:
      • removed tecnickcom/tcpdf - see #299
      • removed atoum/atoum
      • added phpunit/phpunit
    • we ported all Atoum tests to PHPUnit - see #300
    • added further tools (like Scrutinizer, PHPStan) to improve maintenance for us and help PDFParser hackers
    • allow tests to run on PHP 8
    Source code(tar.gz)
    Source code(zip)
  • v0.15.1(May 29, 2020)

    🛠 It's a small maintenance update.

    We raised some dependencies to ensure people aren't running the library with too much outdated deps. For example, we raised tecnickcom/tcpdf to ^6.2.22 to ensure people aren't running the version containing a security issue (see https://packagist.org/packages/tecnickcom/tcpdf/advisories?version=2463879).

    The library wasn't tested for PHP version < 5.6 so we drop minimum PHP version to 5.6. There is a new test build which check the library is running ok on the lowest dependencies available.

    We also introduced PHP-CS-Fixer (mainly for developement) to ensure coding styles is ok.

    Last but not least, there are new maintainers of the lib along with @smalot:

    • @amooij
    • @k00ni
    • @j0k3r

    Merged PRs:

    • Define lowest deps #290
    • Add FriendsOfPHP/PHP-CS-Fixer to "require-dev" to enforce coding styles #292
    Source code(tar.gz)
    Source code(zip)
  • v0.15.0(Apr 21, 2020)

  • v0.14.0(Jan 23, 2019)

  • v0.13.3(Jan 11, 2019)

  • v0.13.2(Jun 23, 2018)

  • v0.13.1(Jun 22, 2018)

  • v0.13.0(Jun 22, 2018)

  • v0.12.0(Mar 16, 2018)

Owner
Sebastien MALOT
Le bio c'est beau !
Sebastien MALOT
Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using Gravity Forms.

Gravity PDF Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using the pop

Gravity PDF 90 Nov 14, 2022
Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2

Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2. If you have an enabled template and a default template for the store you need your template the system will print the pdf template.

EAdesign 64 Oct 18, 2021
Offers tools for creating pdf files.

baldeweg/pdf-bundle Offers tools for creating pdf files. Getting Started composer req baldeweg/pdf-bundle Activate the bundle in your config/bundles.p

André Baldeweg 0 Oct 13, 2022
Generate pdf file with printable labels

printable_labels_pdf Generate pdf file with printable labels with PHP code. CREATE A PDF FILE WITH LABELS EASELY: You can get a pdf file with labels f

Rafael Martin Soto 5 Sep 22, 2022
PHP library generating PDF files from UTF-8 encoded HTML

mPDF is a PHP library which generates PDF files from UTF-8 encoded HTML. It is based on FPDF and HTML2FPDF (see CREDITS), with a number of enhancement

null 3.8k Jan 2, 2023
PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. Wrapper for wkhtmltopdf/wkhtmltoimage

Snappy Snappy is a PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. It uses the excellent webkit-based wkhtmltopd

KNP Labs 4.1k Dec 30, 2022
Official clone of PHP library to generate PDF documents and barcodes

TCPDF PHP PDF Library Please consider supporting this project by making a donation via PayPal category Library author Nicola Asuni [email protected] co

Tecnick.com LTD 3.6k Jan 6, 2023
TCPDF - PHP PDF Library - https://tcpdf.org

tc-lib-pdf PHP PDF Library UNDER DEVELOPMENT (NOT READY) UPDATE: CURRENTLY ALL THE DEPENDENCY LIBRARIES ARE ALMOST COMPLETE BUT THE CORE LIBRARY STILL

Tecnick.com LTD 1.3k Dec 30, 2022
Pdf and graphic files generator library written in php

Information Examples Sample documents are in the "examples" directory. "index.php" file is the web interface to browse examples, "cli.php" is a consol

Piotr Śliwa 335 Nov 26, 2022
PHP library allowing PDF generation or snapshot from an URL or an HTML page. Wrapper for Kozea/WeasyPrint

PhpWeasyPrint PhpWeasyPrint is a PHP library allowing PDF generation from an URL or an HTML page. It's a wrapper for WeasyPrint, a smart solution help

Pontedilana 23 Oct 28, 2022
Official clone of PHP library to generate PDF documents and barcodes

TCPDF PHP PDF Library Please consider supporting this project by making a donation via PayPal category Library author Nicola Asuni [email protected] co

Tecnick.com LTD 3.6k Dec 26, 2022
PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page.

Snappy Snappy is a PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. It uses the excellent webkit-based wkhtmltopd

KNP Labs 4.1k Dec 30, 2022
HTML to PDF converter for PHP

Dompdf Dompdf is an HTML to PDF converter At its heart, dompdf is (mostly) a CSS 2.1 compliant HTML layout and rendering engine written in PHP. It is

null 9.3k Jan 1, 2023
A PHP tool that helps you write eBooks in markdown and convert to PDF.

Artwork by Eric L. Barnes and Caneco from Laravel News ❤️ . This PHP tool helps you write eBooks in markdown. Run ibis build and an eBook will be gene

Mohamed Said 1.6k Jan 2, 2023
Generate simple PDF invoices with PHP

InvoiScript Generate simple PDF invoices with PHP. Installation Run: composer require mzur/invoiscript Usage Example use Mzur\InvoiScript\Invoice; re

Martin Zurowietz 16 Aug 24, 2022
FPDI is a collection of PHP classes facilitating developers to read pages from existing PDF documents and use them as templates in FPDF.

FPDI - Free PDF Document Importer ❗ This document refers to FPDI 2. Version 1 is deprecated and development is discontinued. ❗ FPDI is a collection of

Setasign 821 Jan 4, 2023
Convert HTML to PDF using Webkit (QtWebKit)

wkhtmltopdf and wkhtmltoimage wkhtmltopdf and wkhtmltoimage are command line tools to render HTML into PDF and various image formats using the QT Webk

wkhtmltopdf 13k Jan 4, 2023
Convert html to an image, pdf or string

Convert a webpage to an image or pdf using headless Chrome The package can convert a webpage to an image or pdf. The conversion is done behind the sce

Spatie 4.1k Jan 1, 2023
Laravel Snappy PDF

Snappy PDF/Image Wrapper for Laravel 5 and Lumen 5.1 This package is a ServiceProvider for Snappy: https://github.com/KnpLabs/snappy. Wkhtmltopdf Inst

Barry vd. Heuvel 2.3k Jan 2, 2023