Extract text from a pdf

Overview

Extract text from a pdf

Latest Version on Packagist GitHub Workflow Status Software License Quality Score Total Downloads

This package provides a class to extract text from a pdf.

use Spatie\PdfToText\Pdf;

echo Pdf::getText('book.pdf'); //returns the text from the pdf

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Support us

We invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.

Requirements

Behind the scenes this package leverages pdftotext. You can verify if the binary installed on your system by issueing this command:

which pdftotext

If it is installed it will return the path to the binary.

To install the binary you can use this command on Ubuntu or Debian:

apt-get install poppler-utils

On a mac you can install the binary using brew

brew install poppler

If you're on RedHat, CentOS, Rocky Linux or Fedora use this:

yum install poppler-utils

Installation

You can install the package via composer:

composer require spatie/pdf-to-text

Usage

Extracting text from a pdf is easy.

$text = (new Pdf())
    ->setPdf('book.pdf')
    ->text();

Or easier:

echo Pdf::getText('book.pdf');

By default the package will assume that the pdftotext command is located at /usr/bin/pdftotext. If it is located elsewhere pass its binary path to constructor

$text = (new Pdf('/custom/path/to/pdftotext'))
    ->setPdf('book.pdf')
    ->text();

or as the second parameter to the getText static method:

echo Pdf::getText('book.pdf', '/custom/path/to/pdftotext');

Sometimes you may want to use pdftotext options. To do so you can set them up using the setOptions method.

$text = (new Pdf())
    ->setPdf('table.pdf')
    ->setOptions(['layout', 'r 96'])
    ->text()
;

or as the third parameter to the getText static method:

echo Pdf::getText('book.pdf', null, ['layout', 'opw myP1$$Word']);

Please note that successive calls to setOptions() will overwrite options passed in during previous calls.

If you need to make multiple calls to add options (for example if you need to pass in default options when creating the Pdf object from a container, and then add context-specific options elsewhere), you can use the addOptions() method:

$text = (new Pdf())
    ->setPdf('table.pdf')
    ->setOptions(['layout', 'r 96'])
    ->addOptions(['f 1'])
    ->text()
;

Change log

Please see CHANGELOG for more information about what has changed recently.

Testing

 composer test

Contributing

Please see CONTRIBUTING for details.

Security

If you've found a bug regarding security please mail [email protected] instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.

Comments
  • Getting error Fatal error: Uncaught Spatie\PdfToText\Exceptions\CouldNotExtractText: The command

    Getting error Fatal error: Uncaught Spatie\PdfToText\Exceptions\CouldNotExtractText: The command "'/usr/bin/pdftotext' 'INVP0201_301345038.pdf' '-'" failed.

    Fatal error: Uncaught Spatie\PdfToText\Exceptions\CouldNotExtractText: The command "'/usr/bin/pdftotext' 'INVP0201_301345038.pdf' '-'" failed. Exit Code: 126(Invoked command cannot execute) Working directory: /var/www/html/reader/test Output: ================ Error Output: ================ sh: 1: exec: /usr/bin/pdftotext: Permission denied in /var/www/html/reader/test/vendor/spatie/pdf-to-text/src/Pdf.php:73 Stack trace: #0 /var/www/html/reader/test/vendor/spatie/pdf-to-text/src/Pdf.php(84): Spatie\PdfToText\Pdf->text() #1 /var/www/html/reader/test/index.php(10): Spatie\PdfToText\Pdf::getText('INVP0201_301345...') #2 {main} thrown in /var/www/html/reader/test/vendor/spatie/pdf-to-text/src/Pdf.php on line 73

    I'm using PHP 7.2 v with Spatie\PdfToText\ latest version on server

    opened by amitverma-startbitsolutions 9
  • Namespace Error Usage of This code

    Namespace Error Usage of This code

    Hi,

    I am trying to use this code and here is the function that I am using

    setPdf('book.pdf') ->text(); echo text; } get_txt(); I am getting this error. herePHP Fatal error: Uncaught Error: Class 'Spatie\PdfToText\Pdf' not found in /Users/aminialocal/Documents/project/pdftotext/t.php:18 Can you help me how to fix this problem?
    opened by AidaAmini 9
  • Adding the possibility to use pdftotext options

    Adding the possibility to use pdftotext options

    • the following method is added Pdf::setOptions
    • Pdf::getText gained a third parameters to allow specifying optional options
    • Adding InvalidOption exception when the added options is malformed.

    An option is considered valid if it is a string which starts with a "-" hyphen.

    opened by nyamsprod 7
  • Make process running configurable (timeout)

    Make process running configurable (timeout)

    Just noticed Symfony's Process class is used in this package.

    https://github.com/spatie/pdf-to-text/blob/184907c6723b956f8f9b584587bb0c0665b8f7b8/src/Pdf.php#L55

    Some large PDF files get more than 60 seconds to be processed.

    Would be useful if we can set some custom settings on process running.

    opened by robsontenorio 5
  • Different results in a Windows Local and Linux Prod

    Different results in a Windows Local and Linux Prod

    hi, i'm trying to read the same file i read in a local development windows on my centOS vpn production but the results are diferent... anyone knows how resolve this?

    image CENTOS PROD

    image WINDOWS LOCAL

    opened by CasonWebDev 4
  • could not find pdf

    could not find pdf

    // $return = Storage::disk('local')->put('file.pdf', $request->filename);
    
    	$filename = $request->file('filename')->store('');
    	// $path = storage_path() . '\' . $name;
    	// $file = Storage::get('app/'.$name);
         $return = \Spatie\PdfToText\Pdf::getText('some/16fJLA9iGiOfRCcEot43S9LtSRAYE6mq4ZzeBwdM.pdf', '/mingw64/bin/pdftotext');
    
         return $return;
    

    I'm using laravel.5.5 and its file available in my public folder evry time when i hit submit I'm getting error

    [2017-09-26 18:08:38] local.ERROR: could not find pdf /some/16fJLA9iGiOfRCcEot43S9LtSRAYE6mq4ZzeBwdM.pdf {"exception":"[object] (Spatie\\PdfToText\\Exceptions\\PdfNotFound(code: 0): could not find pdf /some/16fJLA9iGiOfRCcEot43S9LtSRAYE6mq4ZzeBwdM.pdf at E:\\laragon\\www\\precilyl\\vendor\\spatie\\pdf-to-text\\src\\Pdf.php:23)
    [stacktrace]
    #0 E:\\laragon\\www\\precilyl\\vendor\\spatie\\pdf-to-text\\src\\Pdf.php(46): Spatie\\PdfToText\\Pdf->setPdf('/some/16fJLA9iG...')
    #1 E:\\laragon\\www\\precilyl\
    outes\\api.php(28): Spatie\\PdfToText\\Pdf::getText('/some/16fJLA9iG...', '/mingw64/bin/pd...')
    #2 E:\\laragon\\www\\precilyl\\vendor\\laravel\\framework\\src\\Illuminate\\Routing\\Route.php(190): Illuminate\\Routing\\Router->{closure}(Object(Illuminate\\Http\\Request))
    #3 E:\\laragon\\www\\precilyl\\vendor\\laravel\\framework\\src\\Illuminate\\Routing\\Route.php(164): Illuminate\\Routing\\Route->runCallable()
    #4 E:\\laragon\\www\\precilyl\\vendor\\laravel\\framework\\src\\Illuminate\\Routing\\Router.php(610): Illuminate\\Routing\\Route->run()
    #5 E:\\laragon\\www\\precilyl\\vendor\\laravel\\framework\\src\\Illuminate\\Routing\\Pipeline.php(30): Illuminate\\Routing\
    
    opened by vipin733 4
  • Could not read exception when reading from url

    Could not read exception when reading from url

    I installed and configured everything and managed to get text from a local file. But when I try to get it from a public url it returns a

    Could not read exception

    Here's my code: (I am using mac)

    Throws exception return Pdf::getText('{any public url to a pdf file}', '/usr/local/bin/pdftotext');

    Works fine return Pdf::getText('{local file}', '/usr/local/bin/pdftotext');

    opened by TheAngelM97 3
  • Only to get the main content

    Only to get the main content

    Hello

    Is it possible to detect Header and Footer from the PDF file and delete them before transferring them to the article? Only to get the main content.

    Thank you for this repo.

    opened by emresaracoglu 3
  • add timeout config for Symfony Process call

    add timeout config for Symfony Process call

    Fixes #17

    I needed the ability to increase the timeout when calling Symfony Process. Some of the PDFs I am converting to text are rather large and require more time than the default of 60 seconds.

    A test was added to verify that the timeout is set on the Symfony Process by testing if Symfony throws an exception for a negative timeout value.

    I'm happy to modify anything that Spatie's team would want to see reworked or improved.

    Thanks!

    opened by jbraband 2
  • get text of docx file

    get text of docx file

    I am unable to read the file with docx extension. Things are working fine with doc files. Following is the code through which I am getting the text $file_content = Pdf::getText($user_data['profile_path'], env('WORDTOTEXT_PATH')); I have set WORDTOTEXT_PATH as enviroment variable. I am using the exe file of antiword WORDTOTEXT_PATH='C:\Program Files\Git\mingw64\bin\antiword'

    opened by umairasif 2
  • #pdftotext.exe

    #pdftotext.exe

    pdftotext is a .exe file , so i want to know that is it run in server if yes where to put this pdftotext.exe command in cpannel filemanager

    $path = 'c:/Program Files/Git/mingw64/bin/pdftotext'; echo Pdf::getText('cosmic.pdf', $path);

    or is there any other alternative for this?

    opened by tvnr690 2
Releases(1.52.0)
  • 1.52.0(Jul 15, 2022)

  • 1.6.0(Jul 14, 2022)

    What's Changed

    • Add detailed Linux distribution about Fedora-based by @peter279k in https://github.com/spatie/pdf-to-text/pull/52
    • add timeout config for Symfony Process call by @jbraband in https://github.com/spatie/pdf-to-text/pull/54

    New Contributors

    • @jbraband made their first contribution in https://github.com/spatie/pdf-to-text/pull/54

    Full Changelog: https://github.com/spatie/pdf-to-text/compare/1.51...1.6.0

    Source code(tar.gz)
    Source code(zip)
  • 1.51(Jan 1, 2022)

    What's Changed

    • Symfony 6.x support by @pich in https://github.com/spatie/pdf-to-text/pull/48

    New Contributors

    • @pich made their first contribution in https://github.com/spatie/pdf-to-text/pull/48

    Full Changelog: https://github.com/spatie/pdf-to-text/compare/1.5.0...1.51

    Source code(tar.gz)
    Source code(zip)
  • 1.5.0(Dec 16, 2021)

    What's Changed

    • Correct file permissions by @peter279k in https://github.com/spatie/pdf-to-text/pull/39
    • Drop Support for PHP 7.3 & Add Support for PHP 8.1 by @nbayramberdiyev in https://github.com/spatie/pdf-to-text/pull/47

    New Contributors

    • @peter279k made their first contribution in https://github.com/spatie/pdf-to-text/pull/39
    • @nbayramberdiyev made their first contribution in https://github.com/spatie/pdf-to-text/pull/47

    Full Changelog: https://github.com/spatie/pdf-to-text/compare/1.4.0...1.5.0

    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Nov 27, 2020)

  • 1.3.0(Mar 11, 2020)

  • 1.2.0(May 15, 2019)

  • 1.1.0(Mar 7, 2018)

  • 1.0.3(Feb 20, 2018)

  • 1.0.2(Nov 13, 2017)

  • 1.0.1(Mar 16, 2016)

  • 1.0.0(Dec 31, 2015)

  • 0.0.1(Dec 31, 2015)

Owner
Spatie
We create open source, digital products and courses for the developer community
Spatie
Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using Gravity Forms.

Gravity PDF Gravity PDF is a GPLv2-licensed WordPress plugin that allows you to automatically generate, email and download PDF documents using the pop

Gravity PDF 90 Nov 14, 2022
Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2

Magento 2 Invoice PDF Generator - helps you to customize the pdf templates for Magento 2. If you have an enabled template and a default template for the store you need your template the system will print the pdf template.

EAdesign 64 Oct 18, 2021
Convert HTML to PDF using Webkit (QtWebKit)

wkhtmltopdf and wkhtmltoimage wkhtmltopdf and wkhtmltoimage are command line tools to render HTML into PDF and various image formats using the QT Webk

wkhtmltopdf 13k Jan 4, 2023
HTML to PDF converter for PHP

Dompdf Dompdf is an HTML to PDF converter At its heart, dompdf is (mostly) a CSS 2.1 compliant HTML layout and rendering engine written in PHP. It is

null 9.3k Jan 1, 2023
PHP library generating PDF files from UTF-8 encoded HTML

mPDF is a PHP library which generates PDF files from UTF-8 encoded HTML. It is based on FPDF and HTML2FPDF (see CREDITS), with a number of enhancement

null 3.8k Jan 2, 2023
PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. Wrapper for wkhtmltopdf/wkhtmltoimage

Snappy Snappy is a PHP library allowing thumbnail, snapshot or PDF generation from a url or a html page. It uses the excellent webkit-based wkhtmltopd

KNP Labs 4.1k Dec 30, 2022
Official clone of PHP library to generate PDF documents and barcodes

TCPDF PHP PDF Library Please consider supporting this project by making a donation via PayPal category Library author Nicola Asuni [email protected] co

Tecnick.com LTD 3.6k Jan 6, 2023
TCPDF - PHP PDF Library - https://tcpdf.org

tc-lib-pdf PHP PDF Library UNDER DEVELOPMENT (NOT READY) UPDATE: CURRENTLY ALL THE DEPENDENCY LIBRARIES ARE ALMOST COMPLETE BUT THE CORE LIBRARY STILL

Tecnick.com LTD 1.3k Dec 30, 2022
Pdf and graphic files generator library written in php

Information Examples Sample documents are in the "examples" directory. "index.php" file is the web interface to browse examples, "cli.php" is a consol

Piotr Śliwa 335 Nov 26, 2022
Convert html to an image, pdf or string

Convert a webpage to an image or pdf using headless Chrome The package can convert a webpage to an image or pdf. The conversion is done behind the sce

Spatie 4.1k Jan 1, 2023
A PHP tool that helps you write eBooks in markdown and convert to PDF.

Artwork by Eric L. Barnes and Caneco from Laravel News ❤️ . This PHP tool helps you write eBooks in markdown. Run ibis build and an eBook will be gene

Mohamed Said 1.6k Jan 2, 2023
Laravel Snappy PDF

Snappy PDF/Image Wrapper for Laravel 5 and Lumen 5.1 This package is a ServiceProvider for Snappy: https://github.com/KnpLabs/snappy. Wkhtmltopdf Inst

Barry vd. Heuvel 2.3k Jan 2, 2023
Sign PDF files with valid x509 certificate

Sign PDF files with valid x509 certificate Require this package in your composer.json and update composer. This will download the package and the depe

Lucas Nepomuceno 175 Jan 2, 2023
Generate simple PDF invoices with PHP

InvoiScript Generate simple PDF invoices with PHP. Installation Run: composer require mzur/invoiscript Usage Example use Mzur\InvoiScript\Invoice; re

Martin Zurowietz 16 Aug 24, 2022
Convert a pdf to an image

Convert a pdf to an image This package provides an easy to work with class to convert PDF's to images. Spatie is a webdesign agency in Antwerp, Belgiu

Spatie 1.1k Dec 29, 2022
PHP library allowing PDF generation or snapshot from an URL or an HTML page. Wrapper for Kozea/WeasyPrint

PhpWeasyPrint PhpWeasyPrint is a PHP library allowing PDF generation from an URL or an HTML page. It's a wrapper for WeasyPrint, a smart solution help

Pontedilana 23 Oct 28, 2022
Generate pdf file with printable labels

printable_labels_pdf Generate pdf file with printable labels with PHP code. CREATE A PDF FILE WITH LABELS EASELY: You can get a pdf file with labels f

Rafael Martin Soto 5 Sep 22, 2022
A Laravel package for creating PDF files using LaTeX

LaraTeX A laravel package to generate PDFs using LaTeX · Report Bug · Request Feature For better visualization you can find a small Demo and the HTML

Ismael Wismann 67 Dec 28, 2022
Generate PDF invoices for your customers in laravel

What is Invoices? Invoices is a Laravel library that generates a PDF invoice for your customers. The PDF can be either downloaded or streamed in the b

Erik C. Forés 399 Jan 2, 2023