PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

Sebastien MALOT

Last update: Jan 2, 2023

Related tags

PDF pdfparser

Overview

PdfParser

Pdf Parser, a standalone PHP library, provides various tools to extract data from a PDF file.

Website : https://www.pdfparser.org

Test the API on our demo page.

This project is supported by Actualys.

Features

Features included :

Load/parse objects and headers
Extract meta data (author, description, ...)
Extract text from ordered pages
Support of compressed pdf
Support of MAC OS Roman charset encoding
Handling of hexa and octal encoding in text sections
PSR-0 compliant (autoloader)
PSR-1 compliant (code styling)

Currently, secured documents are not supported.

This Library is under active maintenance. There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!

Documentation

Read the documentation on website.

Original PDF References files can be downloaded from this url: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

For developers: Please read DEVELOPER.md for more information about local development of the PDFParser library.

Installation

Using Composer

Obtain Composer
Run composer require smalot/pdfparser

Use alternate file loader

In case you can't use Composer, you can include alt_autoload.php-dist into your project. It will load all required files at once. Afterwards you can use PDFParser class and others.

License

This library is under the LGPLv3 license.

Comments

Fix encoding for encoding dictionary without Type item.

In PDF-file in Font internal encryption point to dictionary without Type item, therefore encryption treated not like encryption but just like PDFObject and as result text incorrectly decoded (because both BaseEncoding item and differences array item ignored).

This PR fixes it.

Also has been deleted unnecessary unicode string decode test for file with WinAnsiEncoding text encoding.

Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

Potentially this PR can fix some of the many other opened issues related to incorrect result text encoding.
enhancement needs work fix encoding issues

opened by likemusic 34

Using PdfParser without Composer

Alternative Autoloader built in

Since v0.18.2 you don't need to do the following steps to use PDFParser without Composer. Please check https://github.com/smalot/pdfparser#install for further information on our alternative autoloader.

:exclamation: Outdated

Last checked in 2020

Updated file: vendor-autoload.zip - See https://github.com/smalot/pdfparser/issues/117#issuecomment-673408008

The ../vendor/autoload.php gets generated when we use composer and we include it in our scripts for PdfParser access. If we wish to freeze our install and manage it without using Composer, this said file can be created to have the following:

<?php
/**
 * this file acts as vendor/autoload.php
 */

/*
Using PDFParser without Composer
Folder structure
================
webroot
  pdfdemos
    INV001.pdf # test PDF file to extract text from for demo
    test.php # our operational demo file
  vendor
    autoload.php
    smalot
      pdfparser # unpack from git master https://github.com/smalot/pdfparser/archive/master.zip release is 0.9.25 dated 2015-09-15
        docs # optional
        samples # optional
        src
          Smalot
            PdfParser
*/

$prerequisites = array();

/**
 * TODO: ADAPT THIS PATH TO pdfparser
 */ 
$pdfparser = '/host/path/to/pdfparser';

$prerequisites['pdfparser'] = array (
    $pdfparser.'/Config.php',
    $pdfparser.'/Parser.php',
    $pdfparser.'/Document.php',
    $pdfparser.'/Header.php',
    $pdfparser.'/PDFObject.php',
    $pdfparser.'/Element.php',
    $pdfparser.'/Encoding.php',
    $pdfparser.'/Font.php',
    $pdfparser.'/Page.php',
    $pdfparser.'/Pages.php',
    $pdfparser.'/Element/ElementArray.php',
    $pdfparser.'/Element/ElementBoolean.php',
    $pdfparser.'/Element/ElementString.php',
    $pdfparser.'/Element/ElementDate.php',
    $pdfparser.'/Element/ElementHexa.php',
    $pdfparser.'/Element/ElementMissing.php',
    $pdfparser.'/Element/ElementName.php',
    $pdfparser.'/Element/ElementNull.php',
    $pdfparser.'/Element/ElementNumeric.php',
    $pdfparser.'/Element/ElementStruct.php',
    $pdfparser.'/Element/ElementXRef.php',
    $pdfparser.'/Encoding/StandardEncoding.php',
    $pdfparser.'/Encoding/ISOLatin1Encoding.php',
    $pdfparser.'/Encoding/ISOLatin9Encoding.php',
    $pdfparser.'/Encoding/MacRomanEncoding.php',
    $pdfparser.'/Encoding/WinAnsiEncoding.php',
    $pdfparser.'/Font/FontCIDFontType0.php',
    $pdfparser.'/Font/FontCIDFontType2.php',
    $pdfparser.'/Font/FontTrueType.php',
    $pdfparser.'/Font/FontType0.php',
    $pdfparser.'/Font/FontType1.php',
    $pdfparser.'/RawData/FilterHelper.php',
    $pdfparser.'/RawData/RawDataParser.php',
    $pdfparser.'/XObject/Form.php',
    $pdfparser.'/XObject/Image.php'
);

foreach($prerequisites as $project => $includes) {
    foreach($includes as $mapping => $file) {
      require_once $file;
    }
}

/*
// Information for comparison with composer
use Datamatrix;
use PDF417;
use QRcode;
use TCPDF;
use TCPDF2DBarcode;
use TCPDFBarcode;
use TCPDF_COLORS;
use TCPDF_FILTERS;
use TCPDF_FONTS;
use TCPDF_FONT_DATA;
use TCPDF_IMAGES;
use TCPDF_IMPORT;
use TCPDF_PARSER;
use TCPDF_STATIC;
*/

We can now create a test.php in the deployment folder (pdfdemos here) with:

<?php
include "../vendor/autoload.php";

$directory = getcwd();
$file = 'INV001.pdf';
$fullfile = $directory . '/' . $file;
$content = '';
$out = '';
$parser = new \Smalot\PdfParser\Parser();

$document = $parser->parseFile($fullfile);
$pages    = $document->getPages();
$page     = $pages[0];
$content  = $page->getText();
$out      = $content;
echo '<pre>' . $out . '</pre>';

EDIT 1 by k00ni: added updated PHP code from @ndmax. Also removed tecnickcom/tcpdf (not needed anymore) and added code highlighting.

documentation

opened by apmuthu 32

Object list not found. Possible secured file.
When trying to encode my pdf i get the following error

Object list not found. Possible secured file.

What does this mean?

If I use pdftotext from the command line, I'm able to output the text of my test-file just fine.
opened by johannesjo 30
Inserting white spaces beetween letters
When I try to extract text from this file (http://billybala.brgweb.com.br/tmp/1435983113.pdf) it inserts white space beetween letters:

The code:

$fulltext = 'Full text: '; $parser = new \Smalot\PdfParser\Parser(); $pdfsource = $parser->parseFile($pdf); $pages = $pdfsource->getPages(); $pagecount = count($pages); $output .= "Total pages: $pagecount<br>"; // Loop over each page to extract text. foreach ($pages as $page) { $fulltext .= utf8_decode($page->getText()); } echo $fulltext;

The output is:

Full text: C 3 9 9 0 0 9 N T O U D e a r E d i t o r : W i t h g r o w i n g p o p u l a t i o n a n d e v e r m o r e a d v a n c e d t e c h n o l o g i e s , n a t u r a l r e s o u r c e s p r o v i d e d b y l a n d c a n n o l o n g e r f u l f i l l t h e i n c r e a s i n g d e m a n d s o f t h e h u m a n p o p u l a t i o n . I n a n a t t e m p t t o e n s u r e t h e i r m a r i n e r i g h t s a n d i n t e r e s t s , m a n y c o u n t r i e s h a v e t a k e n t h e s t e p t o e s t a b l i s h c o m p e t e n t m a r i n e a u t h o r i t i e s . T h e s e a u t h o r i t i e s a r e a s s i g n e d t h e m i s s i o n t o i n t e g r a t e o c e a n p o l i c i e s , a n d h a v e t h e r e s p o n s i b i l i t y t o o v e r s e e v a r i o u s m a r i n e a f f a i r s . T h i s p a p e r g i v e s i n s i g h t i n t o t h e s c o p e o f m a r i n e a f f a i r s , a n d s u m m a r i z e s t h e p r e s e n t s t a t e o f m a r i n e a u t h o r i t y e s t a b l i s h m e n t i n a n u m b e r o f o c e a n s t a t e s i n c l u d i n g t h e U n i t e d S t a t e s , C a n a d a , C h i n a , J a p a n a n d K o r e a . I t g o e s o n t o d i s c u s s t h e h i s t o r y a n d p r o c e s s o f e s t a b l i s h i n g c o m p e t e n t m a r i n e a u t h o r i t i e s i n T a i w a n . T h e T a i w a n e s e G o v e r n m e n t h a s c o n f i r m e d t h e e s t a b l i s h m e n t o f T h e T a s k F o r c e f o r M a r i t i m e A f f a i r s ( T h e T a s k F o r c e ) . I t i s r e s p o n s i b l e f o r t h e i n t e g r a t i o n o f v a r i o u s m a r i n e a u t h o r i t i e s i n T a i w a n a n d t h e n e w l y e s t a b l i s h e d c o m p e t e n t a u t h o r i t y i s s c h e d u l e d t o c o m e i n t o o p e r a t i o n i n J a n u a r y 2 0 1 2 . T h e T a s k F o r c e w i l l s e r v e a s a u s e f u l s o u r c e o f r e f e r e n c e f o r m a n y s c h o l a r s w o r k i n g i n t h e m a r i n e a n d o c e a n s c i e n c e d i s c i p l i n e s . F u r t h e r m o r e , t h e i n s i g h t f u l a n a l y s i s p r e s e n t e d i n t h i s p a p e r w i l l e n a b l e y o u r r e a d e r s t o a c q u i r e a b e t t e r u n d e r s t a n d i n g o f t h e s c o p e o f m a r i n e a f f a i r s , t h e s t a t u s o f c o m p e t e n t m a r i n e a u t h o r i t i e s i n c e r t a i n c o u n t r i e s , a n d t h e h i s t o r y a n d p r o c e s s i n t h e e s t a b l i s h m e n t o f s u c h a u t h o r i t i e s i n T a i w a n . F o r t h e r e a s o n s s t a t e d , I f e e l t h i s p a p e r p r o v i d e s h i g h l y v a l u a b l e i n f o r m a t i o n s u i t a b l e f o r p u b l i c a t i o n i n y o u r j o u r n a l . S h o u l d t h e e d i t o r h a v e a n y s u g g e s t i o n o r c o m m e n t p l e a s e d o n o t h e s i t a t e t o c o n t a c t u s , w e s h a l l r e s p o n d i m m e d i a t e l y .
bug help wanted
opened by ricardobrg 27

getDataTm() seems to be broken completely (wrong coordinates and wrongly decoded text content)

I created a very simple PDF for testing, because I had issues getting correct x and y coordinates when using getDataTm() (unfortunately, I can't provide any other PDFs for privacy reasons...)

=> xy.pdf

    $parser = new \Smalot\PdfParser\Parser();
    $pdf    = $parser->parseFile('xy.pdf');
    
    $pages = $pdf->getPages();

    foreach ($pages as $index => $page) {
        if (null === $page) {
            continue;
        }
        
        var_dump($page->getDataTm());
    }

For this PDF, the Text is not read correctly when using getDataTm() (but interestingly, it works when using getText() instead). Instead, where the text should be, there's what looks like HEX-encoded strings:

array (size=13)
  0 => 
    array (size=2)
      0 => 
        array (size=6)
          0 => string '1' (length=1)
          1 => string '0' (length=1)
          2 => string '0' (length=1)
          3 => string '1' (length=1)
          4 => string '33.173' (length=6)
          5 => string '665.721' (length=7)
      1 => string '00270050005000430042005300010001' (length=32)
  1 => 
    array (size=2)
      0 => 
        array (size=6)
          0 => string '1' (length=1)
          1 => string '0' (length=1)
          2 => string '0' (length=1)
          3 => string '1' (length=1)
          4 => string '213.173' (length=7)
          5 => string '665.721' (length=7)
      1 => string '00230042005B004300420053' (length=24)
  2 => [...]

bug

opened by Connum 25

mb_convert_encoding(): Illegal character encoding specified

If i parse a PDF i get the following error: "mb_convert_encoding(): Illegal character encoding specified" in "/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php Line: 500". The list of supported character encodings doesnt list any entry of "Mac" or something... http://php.net/manual/de/mbstring.supported-encodings.php.
missing or incomplete functionality help wanted

opened by Ablont 24

PDF content has extra spaces inserted into it

My company uses wkhtmltopdf to generate PDF files. Our automated testing uses pdfparser to verify that the generated content matches what we expect. I'm in the process of updating from v0.11 to v0.18.2. We got the expected output in the older version, but the newer version is inserting extra spaces into the content.

Expected content: Snippet 1 EN Actual content: S nip pet 1 E N

Note that I have stepped through the code and verified that they are in fact spaces and not other bad unicode characters, and found the code that is inserting the spaces.

I'm not sure of the actual terminology here, since I'm unfamiliar with the PDF format, but I'll do my best to describe what's going on.

The actual "content" of the PDF file is as follows:

/F6 16 Tf 1 0 0 -1 0 0 Tm
8 -15 Td <0001> Tj
10.1562500 0 Td <0002> Tj
10.1406250 0 Td <0003> Tj
4.43750000 0 Td <0004> Tj
10.1562500 0 Td <0004> Tj
10.1562500 0 Td <0005> Tj
9.84375000 0 Td <0006> Tj
6.26562500 0 Td <0007> Tj
5.07812500 0 Td <0008> Tj
10.1718750 0 Td <0007> Tj
5.07812500 0 Td <0009> Tj
10.1093750 0 Td <000A> Tj

From what I understand, Td is an x, y offset, and Tj is an index into an encoding table, as follows:

0001 = 'S'
0002 = 'n'
0003 = 'i'
0004 = 'p'
0005 = 'e'
0006 = 't'
0007 = ' '
0008 = '1'
0009 = 'E'
000A = 'N'

As you can see, the content is correct and does not contain extra spaces. The issue seems to come from the Td command. Some have wider x offsets, causing the following code in pdfparser to insert extra spaces (starting at PDFObject:289, "// horizontal offset"):

                    // move text current point
                    case 'Td':
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
                        $y = array_pop($args);
                        $x = array_pop($args);
                        if (((float) $x <= 0) ||
                            (false !== $current_position_td['y'] && (float) $y < (float) ($current_position_td['y']))
                        ) {
                            // vertical offset
                            $text .= "\n";
                        } elseif (false !== $current_position_td['x'] && (float) $x > (float) (
                                $current_position_td['x']
                            )
                        ) {
                            // horizontal offset
                            $text .= ' ';
                        }
                        $current_position_td = ['x' => $x, 'y' => $y];
                        break;

I have tested the exact file in other parsers (gimp, okular, chrome, firefox, edge, word) and all render correctly.

I have attached the file for reference.

36360.00.d28.pdf

bug help wanted

opened by LordMonoxide 23

consider scaling by fontSize(Tf, Tfs) and text matrix (Tm)
closes #532

Previously, the coordinates in getDataTm() were calculated without taking into accout the scaling that happens by the set font size and text matrix.

In a PDF that would e.g. look like:

/R9 11.04 Tf (second operand is font size)

0.999402 0 0 1 70.8 698.96 Tm (first and fourth operand set horizontal and vertical scaliing)

When the posiiton is changed using the text positioning operators Td or TD, the new coordinates are now calculated with the set scaling.

One line in the test files was changed to reflect the correct scaling.

This pull request also fixes a sign error in the calculation of the y-coordinate in TD.
enhancement fix
opened by oliver681 19
respect space width when using "Move text position" stream operator

fixes at least #201

maybe #72 too, but i cannot download the pdf anymore

closes https://github.com/smalot/pdfparser/pull/245 maybe closes https://github.com/smalot/pdfparser/pull/256
enhancement needs work fix

opened by PaulBehrendtVentoro 18
Page getDataTm always return empty array

Hi,

I'm using version 2.2.1 and trying to spot location of a string in some pages.

The PDF is generated by my own use of TCPDI.php

Method getText works like a charm but getDataTm always return an empty array ([]).

Here is the page details:

{ "Type": "Page", "Parent": { "Type": "Pages", "Count": "65" }, "LastModified": "2022-08-31T13:57:58+02:00", "Resources": [], "MediaBox": [ 0, 0, 595, 842 ], "CropBox": [ 0, 0, 595, 842 ], "BleedBox": [ 0, 0, 595, 842 ], "TrimBox": [ 0, 0, 595, 842 ], "ArtBox": [ 0, 0, 595, 842 ], "Contents": { "Filter": "FlateDecode", "Length": "97" }, "Rotate": "0", "Group": { "Type": "Group", "S": "Transparency", "CS": "DeviceRGB" }, "PZ": "1" },
help wanted PDF required to demonstrate issue

opened by PauArjona 17
Poll: Is this library ready for 1.0?
There was a discussion in #318 whether to merge in a change that might require adaptions by users of this library. PDFParser follows Semantic Version, which has some implications how to handle that situation. One is to bump major version from 0.x to 1.0. I will outline the arguments I heard so far and wanna invite all of you to vote. Feel free to comment, I will add new arguments to the lists.

Thank you for helping here

:+1: - Yes, jump to 1.0 :-1: - No, keep 0.x for now

Arguments against 1.0

A check to some extent is required to determine if all parts of the library (API) are as we want them to be. The focus has to be on the current feature set and behavior behind. Is there something which is fishy and has to be taken care of before 1.0?

Based on Github stats there are almost 800 projects directly or indirectly depending on PDFParser. Some may also use this in production environments generating money. PDFParser is an Open Source library with no obligations, but in my opinion we can't just change something and leave developers out in the cold.

If we just bump the version we acknowledge that the current state is fine. But if we encounter major problems in the near future we might be forced to bump again..

Currently we receive many fixes and some features, for instance from @PaulBehrendtVentoro, @Connum and @izabala.

That is great! But because of that it is important to know, if its planned to have API changes be part in future contributions.

As long as we have 0.x, changes in API (or behavior) won't be merged, but in case a (basic) check like in (1.) was conducted, we can collect API changes and put them together into a new 1.x release.

Arguments for 1.0

Semantic Version allows the bump from 0.x to 1.0 at any point. Developers "know" that this might happen and their code might be affected.

People using Composer constraint "^0.x" or stay at 0.x are not affected and shouldn't experience any problems.

Make changes in API or behavior optional and allow developers to enable them via parameter.

Implications here

[ ] prepare a new release

[ ] add an UPGRADE file to inform developers of required code changes

BTW.: The vote may lead our decision to keep #314 or not.
help wanted
opened by k00ni 17

Releases(v2.3.0)

v2.3.0(Dec 22, 2022)
What's Changed

#561 Added optional getText() argument to return limited number of document pages if set by @alesrebec in https://github.com/smalot/pdfparser/pull/562

look here for an example

consider scaling by fontSize(Tf, Tfs) and text matrix (Tm) by @oliver681 in https://github.com/smalot/pdfparser/pull/559

Extend Github workflow file to run tests in Windows environment by @k00ni in https://github.com/smalot/pdfparser/pull/566

New Contributors

@alesrebec made their first contribution in https://github.com/smalot/pdfparser/pull/562

@oliver681 made their first contribution in https://github.com/smalot/pdfparser/pull/559

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.2...v2.3.0
Source code(tar.gz)
Source code(zip)
v2.2.2(Dec 6, 2022)
What's Changed

fix: allow to parse null or empty header element by @DogLoc in https://github.com/smalot/pdfparser/pull/560

behind the scenes: updated development tools and check PHP 8.1/8.2 support by @k00ni (https://github.com/smalot/pdfparser/pull/553, https://github.com/smalot/pdfparser/pull/563)

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.1...v2.2.2
Source code(tar.gz)
Source code(zip)
v2.2.1(May 3, 2022)
What's Changed

Handle null Header and falsy header elements by @dsuurlant in https://github.com/smalot/pdfparser/pull/525

New Contributors

@dsuurlant made their first contribution in https://github.com/smalot/pdfparser/pull/525

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.0...v2.2.1
Source code(tar.gz)
Source code(zip)
v2.2.0(Apr 12, 2022)
What's Changed

Rework documentation by @rubenvanerk in https://github.com/smalot/pdfparser/pull/513

fixes #520 (missing r in composer command in README.md) by @k00ni in https://github.com/smalot/pdfparser/pull/521

Added font info to dataTm by @shtayerc in https://github.com/smalot/pdfparser/pull/516

Added calculateTextWidth function to Font by @shtayerc in https://github.com/smalot/pdfparser/pull/517

Add issue template by @rubenvanerk in https://github.com/smalot/pdfparser/pull/524

New Contributors

@shtayerc made their first contribution in https://github.com/smalot/pdfparser/pull/516

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.1.0...v2.2.0
Source code(tar.gz)
Source code(zip)
v2.1.0(Feb 3, 2022)
What's Changed

Fix encoding for encoding dictionary without Type item. by @likemusic in https://github.com/smalot/pdfparser/pull/500

Added decodeMemoryLimit to Config to avoid memory leaks. by @b3n-l in https://github.com/smalot/pdfparser/pull/476

added short example how to parse base64 encoded PDFs by @granjero in https://github.com/smalot/pdfparser/pull/493

Make horizontal offset configurable by @rubenvanerk in https://github.com/smalot/pdfparser/pull/505

Link docs to wiki instead of pdfparser.org by @rubenvanerk in https://github.com/smalot/pdfparser/pull/506

Add return types to tests methods. Fix todos in phpDocs. Add method's descriptions for Font class. by @likemusic in https://github.com/smalot/pdfparser/pull/509

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.1...v2.1.0
Source code(tar.gz)
Source code(zip)
v2.0.1(Nov 23, 2021)
Bugfix release

For PHP 7 users: In 2.0.0 we used a function which is PHP 8 only. It was fixed in #486.

Font.php: Optimization of the uchr function by @mariuszkrzaczkowski in https://github.com/smalot/pdfparser/pull/467

Fix Scrutinizer-integration: mark PageTest::testGetTextPullRequest457 as "memory-heavy" by @k00ni in https://github.com/smalot/pdfparser/pull/481

Fixes #478 (/Index problem) by @yasheena in https://github.com/smalot/pdfparser/pull/479

Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.0...v2.0.1
Source code(tar.gz)
Source code(zip)
v2.0.0(Nov 16, 2021)
Breaking Changes

❗All function parameters as well as return types of functions are typed now. That means, if you are using values which do not fit, you may receive Type errors. Most of it was done internally and you should not get bothered. In case you use internal functions, please check your code before go into production.

We initially decided to release 1.2.0 but finally jumped to 2.0.0 to include BC on a major release instead (see https://github.com/smalot/pdfparser/issues/480)

Highlights

massive code refactoring (thanks to @jee7, #440)

workaround to enable FPDFs (thanks to @izabala, #453)

Added cache for Documents object cache dictionary, which also results in better performance in some cases (thanks to @jee7, #434)

prevent endless loops during Page->getText() in some cases (thanks to @Nickmanbear, #457)

Fixes invalid return type on unknown glyphs (thanks to @PrinsFrank, #459)

Fix TypeError on Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)

Fix TypeError on default font when no fonts available (#466, thanks for @PrinsFrank)

Fix for extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF (#454, thanks to @izabala)

Test backend was improved by @j0k3r (#460)

Source code(tar.gz)
Source code(zip)
v1.2.0-RC2(Oct 18, 2021)
❗Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗

Changes since v1.2.0-RC1

Fix TypeError on default font when no fonts available (#466, thanks for @PrinsFrank)

Fix for extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF (#454, thanks to @izabala)

Further information about changes and fixes in 1.2.0 can be found here: https://github.com/smalot/pdfparser/releases/tag/v1.2.0-RC1
Source code(tar.gz)
Source code(zip)
v1.2.0-RC1(Oct 15, 2021)
Bug fix and performance release

❗Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗

Highlights:

massive code refactoring (thanks to @jee7, #440)

workaround to enable FPDFs (thanks to @izabala, #453)

Added cache for Documents object cache dictionary, which also results in better performance in some cases (thanks to @jee7, #434)

prevent endless loops during Page->getText() in some cases (thanks to @Nickmanbear, #457)

Fixes invalid return type on unknown glyphs (thanks to @PrinsFrank, #459)

Fix TypeError on Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)

@j0k3r improved our test backend.
Source code(tar.gz)
Source code(zip)
v1.1.0(Aug 16, 2021)
Maintenance and small performance boost

PDFs with images can be parsed with less resource consumption (like memory) from now on. @Connum added a feature with #441 to ignore image data. It must be enabled manually though. You can do it easily:

use Smalot\PdfParser\Config; use Smalot\PdfParser\Parser; $config = new Config(); $config->setRetainImageContent(false); $parser = new Parser([], $config); // $parser->parseFile (...)

Besides that, we fixed a problem with Scrutinizer (part of our test infrastructure).
Source code(tar.gz)
Source code(zip)
v1.0.2(Jun 21, 2021)
Bugfix release

Don't throw an exception if there is no base encoding defined (as of PDF 1.5 Reference Table 5.11) - #433, thanks @LucianoHanna

Source code(tar.gz)
Source code(zip)
v1.0.1(Jun 8, 2021)
Bugfix release

Fixed decode octal regex (#421, thanks @gdiasb12)

Fixed remaining places which use Config class and threw exceptions (#420, #424, thanks @TivoSoho)

Source code(tar.gz)
Source code(zip)
v1.0.0(Apr 28, 2021)
Highlights

Removed support for PHP 5.6 and 7.0, requires at least PHP 7.1 or newer❗

extended Config.php with white space characters: it allows developers to override regex for white space recognition (#411, thanks @LucianoHanna)

Fixed some test-infrastructure related issues (#412, #413, #414)

Source code(tar.gz)
Source code(zip)
v0.19.0(Apr 14, 2021)
Bugfix and feature release

Features:

Add support for PDF 1.5 Xref stream (#400, thanks @smalot)

Add support for Reversed Chars instruction in BMC blocs (#402, thanks @smalot)

Fixes:

Encoding::__toString complies with PHP specification from now on (#407, thanks @igor-krein and others from #85)

fix Call to a member function getFontSpaceLimit() on null (#406, thanks @xfolder)

Consider all PDF white-space characters in object header (#405, thanks @LucianoHanna)

Source code(tar.gz)
Source code(zip)
v0.18.2(Feb 25, 2021)
Maintenance release

Bugfix for #391 (Uncaught Error: Call to undefined method Smalot\PdfParser\Header::__toString() in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php) (thanks @fsmoak)

Addition of an alternative autoloader for non-Composer installations (#388). Based on the work of @apmuthu and others from #117.

Source code(tar.gz)
Source code(zip)
v0.18.1(Jan 12, 2021)

Bug fix release

Fixes an infinite loop (and memory leak) if xref table is corrupted. For more information see #377 and #372. Thanks @partulaj!
Source code(tar.gz)
Source code(zip)
v0.18.0(Dec 30, 2020)
:fireworks: Happy new year release! :firecracker:

A few bug fixes and improvements.

Fixes:

Implemented missing __toString method in Encoding.php (thanks @tomlutzenberger, #378).

In Header.php make sure init is only called if $element is of type Element (thanks @lukgru, #380).

Improvements:

Improved performance in ElementName.php (thanks @mardc21, #369)

Added a config object to adapt default values like font space limit (thanks @k00ni, #375). Further values may be ported in future versions.

Switch from Travis to Github Actions (thanks @j0k3r, #376)

Source code(tar.gz)
Source code(zip)
v0.17.1(Oct 30, 2020)

Hot fix release for a problem in PdfParser\Encoding\PostScriptGlyphs.php, for instance:

Notice: Undefined offset: 67 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091

Related issues: #359, #360
Source code(tar.gz)
Source code(zip)
v0.17.0(Oct 12, 2020)
Bug fix release with a few improvements and a new composer dependency.

Highlights:

added symfony/polyfill-mbstring to improve PHP 8 support (#337)

reverted 4f4fd10 and preserving fix for #260, fixing #319, #322 and #334 (#342)

revived #257: Properly decode ANSI encodings (#349)

allow for line breaks when splitting xrefs for id and position, fixes #19 (#345)

Document::getPages() should only ever return elements of type 'Page' (#350)

rely on getTextArray() in getDataTm() to extract the texts (#340)

fix missing BT command before each section (could result in wrong coordinates) and its resetting of Tm (#341)

Source code(tar.gz)
Source code(zip)
v0.16.2(Aug 31, 2020)
Bugfix release.

Fixes

Fix missing catalog bug (+ some code refactoring) #312, thanks @PaulBehrendtVentoro

Handle corrupted PDF #328

Fix error when Font aren't available #324, thanks @wivaku

Source code(tar.gz)
Source code(zip)
0.16.1(Jun 29, 2020)
Bugfix release.

Fixes

array access on integer for php7.4 - #310, #267 - thanks @PaulBehrendtVentoro

mb_convert_encoding(): Illegal character encoding specified - #313, #229 - thanks @daneren2005

Source code(tar.gz)
Source code(zip)
v0.16.0(Jun 19, 2020)
This release contains a lot of refinements and some fixes.

New features

get text for a given set of coordinates

to do that use Page::getTextXY - function details

related pull request: #297, thanks @izabala

Changes

Composer dependencies:

removed tecnickcom/tcpdf - see #299

removed atoum/atoum

added phpunit/phpunit

we ported all Atoum tests to PHPUnit - see #300

added further tools (like Scrutinizer, PHPStan) to improve maintenance for us and help PDFParser hackers

allow tests to run on PHP 8

Source code(tar.gz)
Source code(zip)
v0.15.1(May 29, 2020)
🛠 It's a small maintenance update.

We raised some dependencies to ensure people aren't running the library with too much outdated deps. For example, we raised tecnickcom/tcpdf to ^6.2.22 to ensure people aren't running the version containing a security issue (see https://packagist.org/packages/tecnickcom/tcpdf/advisories?version=2463879).

The library wasn't tested for PHP version < 5.6 so we drop minimum PHP version to 5.6. There is a new test build which check the library is running ok on the lowest dependencies available.

We also introduced PHP-CS-Fixer (mainly for developement) to ensure coding styles is ok.

Last but not least, there are new maintainers of the lib along with @smalot:

@amooij

@k00ni

@j0k3r

Merged PRs:

Define lowest deps #290

Add FriendsOfPHP/PHP-CS-Fixer to "require-dev" to enforce coding styles #292

Source code(tar.gz)
Source code(zip)
v0.15.0(Apr 21, 2020)

Source code(tar.gz)
Source code(zip)
v0.14.0(Jan 23, 2019)

Source code(tar.gz)
Source code(zip)
v0.13.3(Jan 11, 2019)

Source code(tar.gz)
Source code(zip)
v0.13.2(Jun 23, 2018)

Source code(tar.gz)
Source code(zip)
v0.13.1(Jun 22, 2018)

Source code(tar.gz)
Source code(zip)
v0.13.0(Jun 22, 2018)

Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 16, 2018)

Source code(tar.gz)
Source code(zip)

Owner

Sebastien MALOT

Le bio c'est beau !

GitHub

Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using Gravity Forms.

Gravity PDF Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using the pop

90 Nov 14, 2022

Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2

Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2. If you have an enabled template and a default template for the store you need your template the system will print the pdf template.

64 Oct 18, 2021

Offers tools for creating pdf files.

baldeweg/pdf-bundle Offers tools for creating pdf files. Getting Started composer req baldeweg/pdf-bundle Activate the bundle in your config/bundles.p

0 Oct 13, 2022

Generate pdf file with printable labels

printable_labels_pdf Generate pdf file with printable labels with PHP code. CREATE A PDF FILE WITH LABELS EASELY: You can get a pdf file with labels f

5 Sep 22, 2022

PHP library generating PDF files from UTF-8 encoded HTML

mPDF is a PHP library which generates PDF files from UTF-8 encoded HTML. It is based on FPDF and HTML2FPDF (see CREDITS), with a number of enhancement

3.8k Jan 2, 2023

PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. Wrapper for wkhtmltopdf/wkhtmltoimage

Snappy Snappy is a PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. It uses the excellent webkit-based wkhtmltopd

4.1k Dec 30, 2022

Official clone of PHP library to generate PDF documents and barcodes

TCPDF PHP PDF Library Please consider supporting this project by making a donation via PayPal category Library author Nicola Asuni info@tecnick.com co

3.6k Jan 6, 2023

TCPDF - PHP PDF Library - https://tcpdf.org

tc-lib-pdf PHP PDF Library UNDER DEVELOPMENT (NOT READY) UPDATE: CURRENTLY ALL THE DEPENDENCY LIBRARIES ARE ALMOST COMPLETE BUT THE CORE LIBRARY STILL

1.3k Dec 30, 2022

Pdf and graphic files generator library written in php

Information Examples Sample documents are in the "examples" directory. "index.php" file is the web interface to browse examples, "cli.php" is a consol

335 Nov 26, 2022

PHP library allowing PDF generation or snapshot from an URL or an HTML page. Wrapper for Kozea/WeasyPrint

PhpWeasyPrint PhpWeasyPrint is a PHP library allowing PDF generation from an URL or an HTML page. It's a wrapper for WeasyPrint, a smart solution help

23 Oct 28, 2022

Official clone of PHP library to generate PDF documents and barcodes

TCPDF PHP PDF Library Please consider supporting this project by making a donation via PayPal category Library author Nicola Asuni info@tecnick.com co

3.6k Dec 26, 2022

PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page.

Snappy Snappy is a PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. It uses the excellent webkit-based wkhtmltopd

4.1k Dec 30, 2022

HTML to PDF converter for PHP

Dompdf Dompdf is an HTML to PDF converter At its heart, dompdf is (mostly) a CSS 2.1 compliant HTML layout and rendering engine written in PHP. It is

9.3k Jan 1, 2023

A PHP tool that helps you write eBooks in markdown and convert to PDF.

Artwork by Eric L. Barnes and Caneco from Laravel News ❤️ . This PHP tool helps you write eBooks in markdown. Run ibis build and an eBook will be gene

1.6k Jan 2, 2023

Generate simple PDF invoices with PHP

InvoiScript Generate simple PDF invoices with PHP. Installation Run: composer require mzur/invoiscript Usage Example use Mzur\InvoiScript\Invoice; re

16 Aug 24, 2022

FPDI is a collection of PHP classes facilitating developers to read pages from existing PDF documents and use them as templates in FPDF.

FPDI - Free PDF Document Importer ❗ This document refers to FPDI 2. Version 1 is deprecated and development is discontinued. ❗ FPDI is a collection of

821 Jan 4, 2023

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

Related tags

Overview

PdfParser

Features

Documentation

Installation

Using Composer

Use alternate file loader

License

Comments

Alternative Autoloader built in

:exclamation: Outdated

Arguments against 1.0

Arguments for 1.0

Implications here

Releases(v2.3.0)

v2.3.0(Dec 22, 2022)

What's Changed

New Contributors

v2.2.2(Dec 6, 2022)

What's Changed

v2.2.1(May 3, 2022)

What's Changed

New Contributors

v2.2.0(Apr 12, 2022)

What's Changed

New Contributors

v2.1.0(Feb 3, 2022)

What's Changed

v2.0.1(Nov 23, 2021)

Bugfix release

v2.0.0(Nov 16, 2021)

Breaking Changes

Highlights

v1.2.0-RC2(Oct 18, 2021)

Changes since v1.2.0-RC1

v1.2.0-RC1(Oct 15, 2021)

Bug fix and performance release

v1.1.0(Aug 16, 2021)

Maintenance and small performance boost

v1.0.2(Jun 21, 2021)

Bugfix release

v1.0.1(Jun 8, 2021)

Bugfix release

v1.0.0(Apr 28, 2021)

Highlights

v0.19.0(Apr 14, 2021)

Bugfix and feature release

v0.18.2(Feb 25, 2021)

Maintenance release

v0.18.1(Jan 12, 2021)

Bug fix release

v0.18.0(Dec 30, 2020)

:fireworks: Happy new year release! :firecracker:

v0.17.1(Oct 30, 2020)

v0.17.0(Oct 12, 2020)

v0.16.2(Aug 31, 2020)

Fixes

0.16.1(Jun 29, 2020)

Fixes

v0.16.0(Jun 19, 2020)

New features

Changes

v0.15.1(May 29, 2020)

v0.15.0(Apr 21, 2020)

v0.14.0(Jan 23, 2019)

v0.13.3(Jan 11, 2019)

v0.13.2(Jun 23, 2018)

v0.13.1(Jun 22, 2018)

v0.13.0(Jun 22, 2018)

v0.12.0(Mar 16, 2018)

Owner

Sebastien MALOT

Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using Gravity Forms.

Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2

Offers tools for creating pdf files.

Generate pdf file with printable labels

PHP library generating PDF files from UTF-8 encoded HTML

PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. Wrapper for wkhtmltopdf/wkhtmltoimage