PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

Last update: Jun 22, 2022

php-text-analysis

alt text

Latest Stable Version

Total Downloads

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. There are tools in this library that can perform:

  • document classification
  • sentiment analysis
  • compare documents
  • frequency analysis
  • tokenization
  • stemming
  • collocations with Pointwise Mutual Information
  • lexical diversity
  • corpus analysis
  • text summarization

All the documentation for this project can be found in the book and wiki.

PHP Text Analysis Book & Wiki

A book is in the works and your contributions are needed. You can find the book at https://github.com/yooper/php-text-analysis-book

Also, documentation for the library resides in the wiki, too. https://github.com/yooper/php-text-analysis/wiki

Installation Instructions

Add PHP Text Analysis to your project

composer require yooper/php-text-analysis

Tokenization

$tokens = tokenize($text);

You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class

$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);

The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.

Normalization

By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.

$normalizedTokens = normalize_tokens(array $tokens); 
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');

$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });

Frequency Distributions

The call to freq_dist returns a FreqDist instance.

$freqDist = freq_dist(tokenize($text));

Ngram Generation

By default bigrams are generated.

$bigrams = ngrams($tokens);

Customize the ngrams

// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');

Stemming

By default stem method uses the Porter Stemmer.

$stemmedTokens = stem($tokens);

You can customize which type of stemmer to use by passing in the name of the stemmer class name

$stemmedTokens = stem($tokens, \TextAnalysis\Stemmers\MorphStemmer::class);

Keyword Extract with Rake

There is a short cut method for using the Rake algorithm. You will need to clean your data prior to using. Second parameter is the ngram size of your keywords to extract.

$rake = rake($tokens, 3);
$results = $rake->getKeywordScores();

Sentiment Analysis with Vader

Need Sentiment Analysis with PHP Use Vader, https://github.com/cjhutto/vaderSentiment . The PHP implementation can be invoked easily. Just normalize your data before hand.

$sentimentScores = vader($tokens);

Document Classification with Naive Bayes

Need to do some document classification with PHP, trying using the Naive Bayes implementation. An example of classifying movie reviews can be found in the unit tests

$nb = naive_bayes();
$nb->train('mexican', tokenize('taco nacho enchilada burrito'));        
$nb->train('american', tokenize('hamburger burger fries pop'));  
$nb->predict(tokenize('my favorite food is a burrito'));

GitHub

https://github.com/yooper/php-text-analysis
Comments
  • 1. issue with laravel composer

    I was checking if I could use this within my own project, but when trying to make a proof of concept it seemed to be unable to work with a clean laravel installation.

    macbookpro$ composer require yooper/php-text-analysis
    Using version ^1.3 for yooper/php-text-analysis
    ./composer.json has been updated
    Loading composer repositories with package information
    Updating dependencies (including require-dev)
    Your requirements could not be resolved to an installable set of packages.
    
      Problem 1
        - Installation request for yooper/php-text-analysis ^1.3 -> satisfiable by yooper/php-text-analysis[1.3].
        - Conclusion: remove symfony/console v4.0.4
        - Conclusion: don't install symfony/console v4.0.4`
    

    would it be possible to allow symfony console?

    Reviewed by jesseyofferenvan at 2018-02-15 10:16
  • 2. How can I use the TF-IDF?

    Hi, I was experimenting around and found that this library has a TFIDF implementation. Can someone show me an example to get this to work?

    What should I put for the DocumentAbstract $document and the $token? And how can I see the result?

    Thanks.

    Reviewed by nafre at 2020-02-07 08:18
  • 3. Poor Vader Sentiment Accuracy. Lots of influential words missing from the vader_lexicon.txt

    So, I tried running this implementation of the Vader algorith on this dataset: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

    Everything I do is: vader(normalize_tokens(tokenize('and . ' . $sample[0]))) (adding 'and . ' as a dummy first word as a workaround for a bug in the library)

    Here are the results:

    [
    "vader" => array:3 [
        "amazon_cells_labelled.txt" => array:9 [
          "positive" => 500
          "negative" => 500
          "matched-positive" => 367
          "failed-positive" => 133
          "matched-negative" => 223
          "failed-negative" => 277
          "matched-neutral" => 320
          "matched-%-positive" => 73.4
          "matched-%-negative" => 44.6
        ]
        "imdb_labelled.txt" => array:9 [
          "positive" => 500
          "negative" => 500
          "matched-positive" => 364
          "failed-positive" => 136
          "matched-negative" => 233
          "failed-negative" => 267
          "matched-neutral" => 261
          "matched-%-positive" => 72.8
          "matched-%-negative" => 46.6
        ]
        "yelp_labelled.txt" => array:9 [
          "positive" => 500
          "negative" => 500
          "matched-positive" => 358
          "failed-positive" => 142
          "matched-negative" => 178
          "failed-negative" => 322
          "matched-neutral" => 350
          "matched-%-positive" => 71.6
          "matched-%-negative" => 35.6
        ]
    ]
    

    I read how the algorithm works and I liked its simplicity.

    However the accuracy in the upper example seems to be extremely poor ! - Mainly because of the lean lexicon.

    Are there fuller lexicons for the Vader algorithm ? What can I do to improve accuracy other than that ? As you can see the accuracy classifying negative sentences is beyond tragic.

    Reviewed by bdteo at 2018-10-30 16:51
  • 4. how can I use this code for finding text similarity?

    hi I am searching for a piece of code to simply finding similarity between to comments. each comments have 100-300 words. how can I use this code for cosine similarity or any other method for finding text similarity. my texts are in persian language, does it matter?

    thank you.

    Reviewed by mrmrn at 2017-10-06 17:15
  • 5. PHP 7.4 compatability

    Is there any way that this package could be updated to require wamania/php-stemmer ~2 (instead of ~1) for php 7.4 compatibility? I can PR this if you'd like.

    Reviewed by tabennett at 2020-07-27 23:44
  • 6. Entity Extraction returns empty array

    Hey,

    i've started working with your wrapper for the "Stanford Named Entity Extraction", but all i get returned is an empty array. Also there are no error messages.

    This is my Code:

                use TextAnalysis\Taggers\StanfordNerTagger;
                use TextAnalysis\Tokenizers\WhitespaceTokenizer;
    
                $jarpath = [HIDDEN]/stanford-yooper/stanford-ner.jar";
                $classifierPath =[HIDDEN]/stanford-yooper/classifiers/english.all.3class.distsim.crf.ser.gz";
         
                $engText = "Marquette County is a county located in the Upper Peninsula of the US state of Michigan. As of the 2010 census, the population was 67,077.";
                
                $document = new TokensDocument((new WhitespaceTokenizer())->tokenize($engText));
                $tagger = new StanfordNerTagger($jarpath,$classifierPath);
                $output = $tagger->tag($document->getDocumentData());
                var_dump($output); //empty Array
    
    
    Reviewed by Zera97 at 2019-11-05 09:57
  • 7. question: can I find a signature from text by this code?

    I have some texts from some authors. Each one has its own signature or link in the text.

    For example author1:
    text1:

    sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd

    @jhsad.sadas.com sdsdADSA sada

    text2:

    KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
    hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf

    text3:

    jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
    @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl

    How can I find @jhsad.sadas.com in the text?

    EDIT:
    @jhsad.sadas.com is an example signature. I don't know what the real signatures of the authors might be! also it has not a format. it can be @jhsad.sadas.com,or visit my blog in fsfsd.sfsf.dfssd , or... What I have is some text from the author and I know there is a unique signature from that author in their texts.

    IDEA: I thing with converting words to vectors and finding similarity between each texts, we can use cosine similarity to find the signatures.I thing the solution must be some thing like this idea.

    Reviewed by mrmrn at 2017-10-13 12:02
  • 8. I use your examples but it does not work

    hi.I was looking for a text mining code in python and I saw this awesome php code. I installed package in ubuntu 16.04.3: sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant then I used composer install. after it finished I went to test folder:

    [email protected] ~/c/p/tests> php TestBaseCase.php 
    PHP Fatal error:  Class 'PHPUnit_Framework_TestCase' not found in /home/gn/code/php-text-analysis/tests/TestBaseCase.php on line 13
    

    also I used tokenizer as bellow:

    <?php
    use TextAnalysis\Tokenizers\GeneralTokenizer;
    
    
            $tokenizer = new GeneralTokenizer();
            $text1 = $tokenizer->tokenize('hi, how are you');
            $text2 = $tokenizer->tokenize('hello, thank you')  ;
    

    and it returned:

    [email protected] ~/c/p/tests> php similarity.php 
    PHP Fatal error:  Uncaught Error: Class 'TextAnalysis\Tokenizers\GeneralTokenizer' not found in /home/gn/code/php-text-analysis/tests/similarity.php:6
    Stack trace:
    #0 {main}
      thrown in /home/gn/code/php-text-analysis/tests/similarity.php on line 6
    

    also, I went to src folder and created a similarity.php file:

    <?php
    
            require_once 'Tokenizers/GeneralTokenizer.php'; 
            $tokenizer = new \Tokenizers\GeneralTokenizer();
            $text1 = $tokenizer->tokenize('hi, how are you');
            $text2 = $tokenizer->tokenize('hello, thank you')  ;
    
    

    and it gaves me this error: PHP Fatal error: Class 'TextAnalysis\Tokenizers\TokenizerAbstract' not found in /home/gn/code/php-text-analysis/src/Tokenizers/GeneralTokenizer.php on line 11

    what is my wrong steps and how can I use code correctly? thanks

    Reviewed by marn65 at 2017-10-08 01:00
  • 9. Composer problem OSX

    Hi Yooper. I tried to install the library on OSX but i got this error:

    `composer require yooper/php-text-analysis Using version ^1.0 for yooper/php-text-analysis ./composer.json has been updated Loading composer repositories with package information Updating dependencies (including require-dev) Your requirements could not be resolved to an installable set of packages.

    Problem 1 - Installation request for yooper/php-text-analysis ^1.0 -> satisfiable by yooper/php-text-analysis[v1.0]. - yooper/php-text-analysis v1.0 requires yooper/stop-words dev-master -> satisfiable by yooper/stop-words[dev-master] but these conflict with your requirements or minimum-stability.

    Installation failed, reverting ./composer.json to its original content. ` --bd

    Reviewed by bluedogmilan at 2016-12-01 08:20
  • 10. Using the tokenizer

    Hi Dan I'm wondering how to implement the use of the tokenizer from this toolset. You suggest: $tokenizer = new GeneralTokenizer(); $tokens = $tokenizer->tokenize("some text") but how do I instantiate the toolset itself on the php page? I've tried various include/require lines, but none of them work. Thanks.

    Reviewed by donnekgit at 2016-11-05 15:27
  • 11. Trying to access array offset on value of type bool

    • PHP Version: 7.4.2

    • File: vendor/yooper/php-text-analysis/src/Sentiment/Vader.php, line 419

    • Code:

    foreach($rows as $row)
    {
        $this->lexicon[$row[0]] = $row[1];
    }
    
    • Problem: $row is not always an array

    • Temp Fix:

    foreach($rows as $row)
    {
        if(is_array($row)) {
          $this->lexicon[$row[0]] = $row[1];
        }
    }
    
    Reviewed by repat at 2020-02-18 12:34
  • 12. Notice & Warning on lines 216, 217, 219 WordnetCorpus.php

    I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php

    it happens when you call stem() with MorphStemmer class with wordnet corpus: $stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);

    Reviewed by mehroz1 at 2021-10-05 09:17
  • 13. Entity Text Parser

    I'm seeing this aspect of the package as being rather weak. Specifically, I'd like to be able to parse nouns/noun phrases and have a better categorization of them, similar to here: https://github.com/web64/laravel-nlp#summarization

    $entities = NLP::corenlp_entities($text); /* array:6 [ "PERSON" => array:3 [ 0 => "John C. Breckinridge" 1 => "James Buchanan" ] "STATE_OR_PROVINCE" => array:1 [ 0 => "Kentucky" ] "COUNTRY" => array:1 [ 0 => "United States" ] "ORGANIZATION" => array:1 [ 0 => "Confederate States of America" ] "DATE" => array:1 [ 0 => "1857" ] "TITLE" => array:1 [ 0 => "vice president" ] ] */

    Unfortunately, this package requires Linux/Ubuntu platform, and I'm working in Windows. Is there a download or different method for doing this here?

    Reviewed by jmichaelterenin at 2020-02-07 07:08
CRUD Build a system to insert student name information, grade the class name, and edit and delete this information
CRUD  Build a system to insert student name information, grade the class name, and edit and delete this information

CRUD Build a system to insert student name information, grade the class name, and edit and delete this information

Jan 28, 2022
A pure PHP implementation of the open Language Server Protocol. Provides static code analysis for PHP for any IDE.
A pure PHP implementation of the open Language Server Protocol. Provides static code analysis for PHP for any IDE.

A pure PHP implementation of the open Language Server Protocol. Provides static code analysis for PHP for any IDE.

Jun 18, 2022
An extension for PHPStan for adding analysis for PHP Language Extensions.

PHPStan PHP Language Extensions (currently in BETA) This is an extension for PHPStan for adding analysis for PHP Language Extensions. Language feature

Jun 3, 2022
Attributes to define PHP language extensions (to be enforced by static analysis)

PHP Language Extensions (currently in BETA) This library provides attributes for extending the PHP language (e.g. adding package visibility). The inte

Jun 23, 2022
Laminas\Text is a component to work on text strings

laminas-text This package is considered feature-complete, and is now in security-only maintenance mode, following a decision by the Technical Steering

Mar 8, 2022
A framework agnostic, multi-gateway payment processing library for PHP 5.6+

Omnipay An easy to use, consistent payment processing library for PHP Omnipay is a payment processing library for PHP. It has been designed based on i

Jul 1, 2022
PHP 7+ Payment processing library. It offers everything you need to work with payments: Credit card & offsite purchasing, subscriptions, payouts etc. - provided by Forma-Pro

Supporting Payum Payum is an MIT-licensed open source project with its ongoing development made possible entirely by the support of community and our

Jun 28, 2022
Alipay driver for the Omnipay PHP payment processing library

Omnipay: Alipay Alipay driver for the Omnipay PHP payment processing library Omnipay is a framework agnostic, multi-gateway payment processing library

May 18, 2022
UnionPay driver for the Omnipay PHP payment processing library

Omnipay: UnionPay UnionPay driver for the Omnipay PHP payment processing library Omnipay is a framework agnostic, multi-gateway payment processing lib

May 18, 2022
A PHP library to convert text to speech using various services

speaker A PHP library to convert text to speech using various services

Jun 17, 2022
Track any ip address with IP-Tracer. IP-Tracer is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracer.
Track any ip address with IP-Tracer. IP-Tracer is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracer.

IP-Tracer is used to track an ip address. IP-Tracer is developed for Termux and Linux based systems. you can easily retrieve ip address information using IP-Tracer. IP-Tracer use ip-api to track ip address.

Jun 28, 2022
Orangescrum is a simple yet powerful free and open source project management software that helps team to organize their tasks, projects and deliver more.
Orangescrum is a simple yet powerful free and open source project management software that helps team to organize their tasks, projects and deliver more.

Free, open source Project Management software Introduction Orangescrum is the simple yet powerful free and open source project management software tha

Jun 29, 2022
Beautiful and understandable static analysis tool for PHP
Beautiful and understandable static analysis tool for PHP

PhpMetrics PhpMetrics provides metrics about PHP project and classes, with beautiful and readable HTML report. Documentation | Twitter | Contributing

Jun 24, 2022
Find undefined and unused variables with the PHP Codesniffer static analysis tool.

PHP_CodeSniffer VariableAnalysis Plugin for PHP_CodeSniffer static analysis tool that adds analysis of problematic variable use. Warns if variables ar

Jun 18, 2022
Jun 9, 2022
⚗️ Adds code analysis to Laravel improving developer productivity and code quality.
⚗️ Adds code analysis to Laravel improving developer productivity and code quality.

⚗️ About Larastan Larastan was created by Can Vural and Nuno Maduro, got artwork designed by @Caneco, is maintained by Can Vural, Nuno Maduro, and Vik

Jun 29, 2022