A high-level machine learning and deep learning library for the PHP language.

Overview

Rubix ML

PHP from Packagist Latest Stable Version Downloads from Packagist Code Checks GitHub

A high-level machine learning and deep learning library for the PHP language.

  • Developer-friendly API is delightful to use
  • 40+ supervised and unsupervised learning algorithms
  • Support for ETL, preprocessing, and cross-validation
  • Open source and free to use commercially

Installation

Install Rubix ML into your project using Composer:

$ composer require rubix/ml

Requirements

  • PHP 7.4 or above

Recommended

Optional

Documentation

Read the latest docs here.

What is Rubix ML?

Rubix ML is a free open-source machine learning (ML) library that allows you to build programs that learn from your data using the PHP language. We provide tools for the entire machine learning life cycle from ETL to training, cross-validation, and production with over 40 supervised and unsupervised learning algorithms. In addition, we provide tutorials and other educational content to help you get started using ML in your projects.

Getting Started

If you are new to machine learning, we recommend taking a look at the What is Machine Learning? section to get started. If you are already familiar with basic ML concepts, you can browse the basic introduction for a brief look at a typical Rubix ML project. From there, you can browse the official tutorials below which range from beginner to advanced skill level.

Tutorials & Example Projects

Check out these example projects using the Rubix ML library. Many come with instructions and a pre-cleaned dataset.

Interact With The Community

Funding

Rubix ML is funded by donations from the community. You can become a sponsor by making a contribution to one of our funding sources below.

Contributing

See CONTRIBUTING.md for guidelines.

License

The code is licensed MIT and the documentation is licensed CC BY-NC 4.0.

Comments
  • Handle multibyte string in text normalizer

    Handle multibyte string in text normalizer

    Problem

    TextNormalizer is unable to deal with string containing accents, chars like "é" or "è" are replaced by ?.

    Input: "Depuis qu’il avait emménagé à côté de chez elle, il y a de ça cinq ans." Output: "depuis qu’il avait emm?nag? ? c?t? de chez elle, il y a de ?a cinq ans."

    Fix

    Use mb_strtolower instead of strtolower.

    opened by maximecolin 19
  • Undefined offset in metric class while training Multilayer Perceptron Classifier

    Undefined offset in metric class while training Multilayer Perceptron Classifier

    Describe the bug

    When attempting to train a Multilayer Perception Classifier, I occasionally get the following type of exception. I have been able to replicate this with both the MCC and FBeta metrics. Unfortunately this exception does not occur consistently even with the same dataset.

    [2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
    [stacktrace]
    #0 /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php(107): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError()
    #1 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(414): Rubix\\ML\\CrossValidation\\Metrics\\MCC->score()
    #2 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(360): Rubix\\ML\\Classifiers\\MultilayerPerceptron->partial()
    #3 /[REDACTED]/vendor/rubix/ml/src/Pipeline.php(189): Rubix\\ML\\Classifiers\\MultilayerPerceptron->train()
    #4 /[REDACTED]/vendor/rubix/ml/src/PersistentModel.php(191): Rubix\\ML\\Pipeline->train()
    #5 /[REDACTED]/app/Console/Commands/TrainModel.php(89): Rubix\\ML\\PersistentModel->train()
    #6 [internal function]: App\\Console\\Commands\\TrainModel->handle()
    #7 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array()
    #8 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
    #9 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure()
    #10 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod()
    #11 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\\Container\\BoundMethod::call()
    #12 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\\Container\\Container->call()
    #13 /[REDACTED]/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute()
    #14 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Symfony\\Component\\Console\\Command\\Command->run()
    #15 /[REDACTED]/vendor/symfony/console/Application.php(912): Illuminate\\Console\\Command->run()
    #16 /[REDACTED]/vendor/symfony/console/Application.php(264): Symfony\\Component\\Console\\Application->doRunCommand()
    #17 /[REDACTED]/vendor/symfony/console/Application.php(140): Symfony\\Component\\Console\\Application->doRun()
    #18 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run()
    #19 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(129): Illuminate\\Console\\Application->run()
    #20 /[REDACTED]/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle()
    #21 {main}
    "}
    

    To Reproduce

    The following code is capable to recreating this error occasionally.

    $estimator = new PersistentModel(
        new Pipeline(
            [
                new TextNormalizer(),
                new WordCountVectorizer(10000, 3, new NGram(1, 3)),
                new TfIdfTransformer(),
                new ZScaleStandardizer()
            ],
            new MultilayerPerceptron([
                new Dense(100),
                new PReLU(),
                new Dense(100),
                new PReLU(),
                new Dense(100),
                new PReLU(),
                new Dense(50),
                new PReLU(),
                new Dense(50),
                new PReLU(),
            ], 100, null, 1e-4, 1000, 1e-4, 10, 0.1, null, new MCC())
        ),
        new Filesystem($modelPath.'classifier.model')
    );
    
    $estimator->setLogger(new Screen('train-model'));
    
    $estimator->train($dataset);
    

    The labelled dataset used is a series of text files split into different directories that indicate their class names. This dataset is built using the following function.

        public static function buildLabeled(): Labeled
        {
            $samples = $labels = [];
    
            $directories = glob(storage_path('app/dataset/*'));
    
            foreach($directories as $directory) {
                foreach (glob($directory.'/*.txt') as $file) {
                    $text = file_get_contents($file);
                    $samples[] = [$text];
                    $labels[] = basename($directory);
                }
            }
    
            return Labeled::build($samples, $labels);
        }
    

    Expected behavior

    Training should complete without any errors within the metric class.

    bug 
    opened by DivineOmega 16
  • Potential bug when partially training a Gaussian Naive Bayes model

    Potential bug when partially training a Gaussian Naive Bayes model

    So I have the following code:

    $estimator = new PersistentModel(
        new Pipeline([
            new TextNormalizer(),
            new StopWordFilter(StopWords::$words),
            new WordCountVectorizer(10000, 1, new NGram(1, 2, new WordStemmer('romanian'))),
            new TfIdfTransformer(),
            new ZScaleStandardizer(),
        ], new GaussianNB()),
        new Filesystem($this->path.'/categorization.model', true)
    );
    
    $categories = Category::has('variants')->with('variants')->get();
    
    $samples = [];
    $labels = [];
    foreach ($categories as $category) {
        foreach ($category->variants as $variant) {
            $samples[] = [$variant->title];
            $labels[] = $category->slug;
        }
    }
    
    $foldsNo = 10;
    $dataset = new Labeled($samples, $labels);
    $folds = $dataset->fold($foldsNo);
    
    $estimator->train($folds[0]);
    for($i = 1; $i < $foldsNo; $i++) {
        $estimator->partial($folds[$i]);
    }
    
    $estimator->save();
    
    

    which tries to train a GaussianNB model on 26 thousand products and their associated labels, but for some reason I get the following error:

    
    ErrorException
    Undefined offset: 0
    

    in rubix\ml\src\Classifiers\GaussianNB.php:267 which corresponds to the following code: $means[$column] = (($n * $mean) + ($oldWeight * $oldMeans[$column])) / ($oldWeight + $n); , the problem seems to be here $oldMeans[$column] when trying to access a property which doesn't exist, but I have no idea why that property doesn't exist.

    I don't suspect that the training data has any errors since if I fully train the same model with $estimator->train($dataset) works as expected (Batch learning in one single session).

    When I tried to debug the code I have the following:

    I see that lines 256, 257 and 259 have this code:

    $means = $oldMeans = $this->means[$class] ?? [];
    $variances = $oldVariances = $this->variances[$class] ?? [];
    
    $oldWeight = $this->weights[$class] ?? 0;
    
    

    which assigns [] empty array if $this->means[$class], $this->variances[$class] and $this->weights[$class] don't exist, but if $oldMeans is empty array I think $oldMeans[$column] would throw out that error. So a fix would be to put ($oldMeans[$column] ?? 0) in rubix\ml\src\Classifiers\GaussianNB.php:267 instead of $oldMeans[$column] I think. I am not 100% how that would affect the training of the model tho.

    I hope that I am not mistaken.

    Thank you!

    bug 
    opened by cioroianudenis 15
  • Better visualization of decision trees (fixes #136)

    Better visualization of decision trees (fixes #136)

    Following the discussion on #136, this change builds on the rules() method from a51d3f197602d624ffdcbc227c2569a0cd1d64f5 to produce a tree description for the graphviz package.

    The string generated can be send through the dot command to produce an image.

    A test case for is added to ClassificationTreeTest.php

    Running this on the dataset from divorce produces an output like

    digraph Tree {
    node [shape=box, fontname=helvetica];
    edge [fontname=helvetica];
    N0 [label="Column_20 < 1"];
    N1 [label="Column_52 < 0"];
    N2 [label="Column_7 < 0"];
    N3 [label="Outcome=divorced",style="rounded,filled",fillcolor=gray];
    N2 -> N3;
    N4 [label="Outcome=divorced",style="rounded,filled",fillcolor=gray];
    N2 -> N4;
    N1 -> N2;
    ...
    

    That after running it through dot -Tpng divorce.dot > divorce.png renders as:

    divorce

    Next steps: have a rulesGraphical that produces the image using an extra dependency. (This might be overkill to add a dependency on graphviz for every deployment of RubixML.)

    opened by DrDub 14
  • Added k-skip-n-gram word tokenizer

    Added k-skip-n-gram word tokenizer

    Hello @andrewdalpino, I've created new branch skip-gram-0.3.0 inheriting 0.3.0 as a parent. If everything is ok I will close my first PR. Could you please update this PR with correct CLA signature allowing me to pass CLA assistant?

    opened by absolutic 14
  • Old obsolete info

    Old obsolete info

    As a training to start with the machine learning with PHP by using this library, I want to create a simple project that will use the old lotto numbers to predict if a generated random numbers sequence can contain some winning numbers and the probability that the sequence will be extracted. My dataset for now is composed from the last year winning numbers series, but I'm planning to expand the dataset to add the last three years of winning extractions. My question is how I can implement the library features, in particular what is the best feature of this library I can use, and how to train correctly the AI, the dataset is unlabeled because there is no label that can classify the numbers series, this because I only got the winning number series. I'm reading the documentations and I've read some of the tutorials, but some help on how to start with this problem will be appreciated.

    question outdated 
    opened by realrecordzLab 14
  • [RFC] Data Persistence in Rubix ML

    [RFC] Data Persistence in Rubix ML

    ✏️[RFC] Data Persistence in Rubix ML
    👨‍💻 Chris Simpson
    🗓 September - October 2020

    BACKGROUND:

    It is very common for a user of the RubixML library to load or save data.

    Within RubixML most actions relating to data persistence are typically achieved via the use of Persister either directly, or via the PersistentModel class. The purpose of a Persister is to save a new Persistable object or to load a previously saved object.

    Whilst a Persister handles the persistence of objects (recursively [de]serializing the objects and their internal dependencies), the more-general action of reading or writing data also appears in numerous other places within the library e.g. The Encoding object has a write() method used to write data to the local filesystem, whilst the purpose of Extractors is to read data and transform it into Dataset objects.

    This contribution aims to introduce abstractions for data IO suitable for use across the entire codebase. It standardizes the mechanism of how data is persisted (saved/loaded) within RubixML.

    DESIGN GOALS:

    • Remove any tight-coupling to the local file system: Introduce a new abstraction modelling where data can be stored (local filesystem, remote/cloud storage, database etc). The default location can remain as the local filesystem, but alternative implementations should be used if provided.
    • Split the responsibility of reading and writing data. By design Extractors should be read-only, whilst Encoding objects need only have awareness of how to write the data that they contain.
    • Retain the simplicity and of the Persister interface, whilst introducing lower-level abstractions to power all current and future data storage operations.
    • Allow for the incremental and memory-efficient reading and writing of data via the use of steams and generators wherever possible/practical.

    APPROACH:

    This PR introduces the Rubix\ML\Storage namespace. This namespace contains abstractions relating to reading and writing of data: Datastore defines the behaviour of a generic data-storage repository. It is given the ability to perform read-related operations via the Rubix\Ml\Storage\Reader interface, and it's ability to perform write-related operations from the Rubix\Ml\Storage\Writer interface. A Datastore fully implements both of these interfaces. This separation allows for Reader to be type-hinted where read-only behaviour is intended. Writer to be used where the class only wishes to write data. Datastore that encompasses both Reader and Writer can be hinted to classes that require both mechanisms (eg Persister implementations).

    Reader:
    <?php
    
    namespace Rubix\ML\Storage;
    
    use Rubix\ML\Storage\Streams\Stream;
    
    /**
     * Reader
     *
     * @category    Machine Learning
     * @package     Rubix/ML
     * @author      Chris Simpson
     */
    interface Reader
    {
        /**
         * Return if the target exists at $location.
         *
         * @param string $location
         * @throws \Rubix\ML\Storage\Exceptions\StorageException
         * @return bool
         */
        public function exists(string $location) : bool;
    
        /**
         * Open a stream of the target data at $location.
         *
         * @param string $location
         * @param string $mode
         * @throws \Rubix\ML\Storage\Exceptions\ReadError
         * @throws \Rubix\ML\Storage\Exceptions\StorageException
         * @return \Rubix\ML\Storage\Streams\Stream
         */
        public function read(string $location, string $mode = Stream::READ_ONLY) : Stream;
    }
    
    
    Writer:
    <?php
    
    namespace Rubix\ML\Storage;
    
    /**
     * Writer
     *
     * @category    Machine Learning
     * @package     Rubix/ML
     * @author      Chris Simpson
     */
    interface Writer
    {
        /**
         * Write.
         *
         * @param string $location
         * @param mixed $data
         * @throws \Rubix\ML\Storage\Exceptions\WriteError
         * @throws \Rubix\ML\Storage\Exceptions\StorageException
         */
        public function write(string $location, $data) : void;
    
        /**
         * Move.
         *
         * NOTE: If supported by the underlying datastore this should be implemented as an atomic operation.
         *
         * @param string $from
         * @param string $to
         * @throws \Rubix\ML\Storage\Exceptions\StorageException
         */
        public function move(string $from, string $to) : void;
    
        /**
         * Delete.
         *
         * @param string $location
         * @throws \Rubix\ML\Storage\Exceptions\StorageException
         */
        public function delete(string $location) : void;
    }
    
    
    Datastore:
    <?php
    
    namespace Rubix\ML\Storage;
    
    use Stringable;
    
    /**
     * Datastore.
     *
     * Defines the behaviour of a generic storage repository (filesystem, database etc)
     *
     * @category    Machine Learning
     * @package     Rubix/ML
     * @author      Chris Simpson
     */
    interface Datastore extends Reader, Writer, Stringable
    {
        //
    }
    
    
    Stream:

    Reader::read() returns an object implementing the Stream interface. This interface acts as an OO wrapper around php streams. This functionality is explicitly wrapped to enable typehinting (resource is not a valid language-level typehint), and for consistent non repetitive stream interaction. Stream represents a generic stream of data and implements all common stream operations via methods named after the common semantics used.

    The Stream interface enables the ability to incrementally (in a 'cursorable' manner) read the data from the underlying resource, and also implements IteratorAggregate allowing the object to be iterated upon directly (yielding a line of content per iteration) Note: To read the entire contents of the stream in a single operation see Stream::contents().

    INTEGRATION:

    To illustrate the implementation and integration of these new objects I have provided an example: Rubix\ML\Storage\LocalFilesystem. This is the simplest of implementations and can be used to replace any existing interaction with data stored on the local filesystem for example Extractor implementations, Encoding objects, and the widely used Filesystem Persistor class.

    Rubix\ML\Storage\LocalFilesystem:
    <?php
    
    namespace Rubix\ML\Storage;
    
    use Rubix\ML\Storage\Streams\File;
    use Rubix\ML\Storage\Streams\Stream;
    use Rubix\ML\Storage\Exceptions\RuntimeException;
    
    /**
     * Local Datastore.
     *
     * @category    Machine Learning
     * @package     Rubix/ML
     * @author      Chris Simpson
     */
    class LocalFilesystem implements Datastore
    {
        /**
         * @see \Rubix\ML\Storage\Reader::exists()
         * @inheritdoc
         *
         * @param string $location
         *
         * @return bool
         */
        public function exists(string $location) : bool
        {
            return file_exists($location);
        }
    
        /**
         * @see \Rubix\ML\Storage\Reader::read()
         * @inheritdoc
         *
         * @param string $location
         *
         * @return \Rubix\ML\Storage\Streams\Stream
         */
        public function read(string $location, string $mode = Stream::READ_ONLY) : Stream
        {
            $resource = fopen($location, $mode);
    
            if (!$resource) {
                throw new RuntimeException("Could not open $location.");
            }
    
            $stream = new File($resource);
    
            if (!$stream->readable()) {
                throw new RuntimeException("Stream with mode {$mode} cannot be read from");
            }
    
            return $stream;
        }
    
        /**
         * @see \Rubix\ML\Storage\Writer::write()
         * @inheritdoc
         *
         * @param string $location
         * @param \Rubix\ML\Storage\Streams\Stream|string $data
         */
        public function write(string $location, $data) : void
        {
            if ($data instanceof Stream) {
                $data = $data->contents();
            }
    
            file_put_contents($location, $data, LOCK_EX);
        }
    
        /**
         * @see \Rubix\ML\Storage\Writer::delete()
         * @inheritdoc
         *
         * @param string $location
         */
        public function delete(string $location) : void
        {
            unlink($location);
        }
    
        /**
         * @see \Rubix\ML\Storage\Writer::move()
         * @inheritdoc
         *
         * @param string $from
         * @param string $to
         */
        public function move(string $from, string $to) : void
        {
            rename($from, $to);
        }
    
        /**
         * Return the string representation of the object.
         *
         * @return string
         */
        public function __toString() : string
        {
            return 'Local Filesystem';
        }
    }
    
    
    Encoding:

    The Encoding::write() method now accepts an optional second parameter of type Rubix\ML\Storage\Writer. Omitting this argument will cause an instance of Rubix\ML\Storage\LocalFilesystem to be used (so existing behaviour is transparently maintained resulting in no BC breaks)

    Extractors:

    By design, Extractor objects are read-only. All existing implementations now accept a Reader in their constructors and the implementations have been updated to show how the Stream object returned by Reader::read(): Stream can be used in-place of the existing semantics.

    Note: Any Reader implementation can be used: so Extractors are now decoupled from the local filesystem and able to read data from a multitude of storage backends.


    DISCUSSION:

    Whilst I have given this implementation considerable thought, and gone though numerous iterations prior to opening this PR, I still think there is room for a significant discussion here. The specific implementations here are intended as illustrative and I'm most interested in feedback on the design itself. I have a number of open questions myself, but think that this is mature enough to introduce external feedback and inspection.

    • Naming is hard. Whilst I've tried to be as concise as possible I think there is room to improve the naming of these abstractions! Which is preferable: Datastore? Repository? StorageOperator, StorageEngine etc? Happy to take suggestions! I would have initially used Backend but obviously didn't want to cause any confusion with the Backend interface used in the parallel processing.

    • Future proof? I've tried to make these abstractions generic and multi-purpose. If you can anticipate a future use-case that wouldn't be supported then let's discuss.

    • Backwards Compatibility: Have I broken anything? Should I have? Let's talk about the options here.

    • Unit Tests: Whilst all current unit tests are passing, I have purposefully neglected to add further tests prior to confirming the design/discussion/direction in this PR.

    • Exceptions: The exceptions were added prior to the merge of https://github.com/RubixML/RubixML/pull/116 so are more a rudimentary demonstration of potential flexibility and not a active recommendation of how we should structure an exception hierarchy.

    • Pull Request: At the time of writing I have based this PR against 0.3.0, as it contains the most up-to-date 'nightly' code. This is just to minimise the size of the diff.


    RELATED:

    • https://github.com/RubixML/RubixML/issues/108: How can the persistence subsystem be extended across the codebase? This was my starting point for this PR. I realized that Persisters are actually very good at what they do (serializing objects and storing them) and backing them by the common storage abstractions (without changing the Persister interface) would be a positive step forwards.
    • A early proof-of-concept showing how storage could be made pluggable using Flysystem 1.x: https://github.com/RubixML/RubixML/pull/106.
    • A Persistor backed by Flysystem 2.x (beta): https://github.com/RubixML/Extras/pull/3

    REVISIONS:

    2020-10-05:
    • Initial publish and request for comments.
    opened by simplechris 12
  • Flysystem Persister

    Flysystem Persister

    SUMMARY

    This PR adds a Flysystem Persistor. This enables a user to load and save models located in a remote storage backend (such as Amazon S3, Azure Blob Storage, Google Cloud Storage, Dropbox etc...)

    Closes #104 .


    A proof-of-concept PR (#106) instigated some interesting discussion about how to extend the concept of persistence in RubixML. A wider discussion around the introduction of a flexible persistence-subsystem is now ongoing. In the meantime this Flysystem Persitor will allow connectivity to a large array of remote storage solutions 🚀

    enhancement 
    opened by simplechris 11
  • Installation error: [InvalidArgumentException] could not find a version of package rubix/ml matching your minimum stability

    Installation error: [InvalidArgumentException] could not find a version of package rubix/ml matching your minimum stability

    Last time I tried setting up RubixMl on macOS, I got this error regarding not matching my minimum-stability. I was later able to install it after setting up "minimum-stability:dev" in composer.json. rubixml1

    Now I'm trying to set up RubixMl on a windows machine and I get the same error again. So far I've tried it with 'composer require rubix/ml:dev-master' and 'composer require rubix/ml:"*" and I'm still not able to successfully install. rubixml2

    I might be missing something or doing something wrong. Can someone help me on this?

    question 
    opened by najiagul 10
  • fix BallTree edge case

    fix BallTree edge case

    We had another infinite loop problem in BallTree. Whenever a dataset contained the same samples repeating more than the max leaf size BallTree::grow was trying to split the same tree over and over again - since all nodes were having the same left and right centeroid values, it kept all the samples in the right subtree and continued trying to split it. I added an exception handling code that terminated the process in that case. It can result with a leaf node containing more than max leaf size but I am not sure what else to do here - all the samples in that leaf are the same, so we don't have a criteria to further split them.

    Added a test to cover this use-case. If you run the test without the fix in BallTree class, you will see an infinite loop occurs.

    Let me know if you see other solutions here or any problems with this one. Thanks!

    opened by kroky 8
  • Validate the shape of the input vector during inference

    Validate the shape of the input vector during inference

    After training my model, I tried to predict a single sample. I tried:

    $prediction = $estimator->predictSample([0,0,1,0,1,0,1]);

    And it showed me the prediction. Then I realized I had forgot to include the label in that line, so I added the label, and it ended up like this:

    $prediction = $estimator->predictSample([0,0,1,0,1,0,1,"sim"]);

    And it showed me again the prediction. Then I got confused? Some of those 2 lines should not work because I am input a data that has different shape that the training. Then I tried adding more 0,1 items in the predictSample and it always predicts. No error shown!

    This makes that function predictSample unreliable since it should only accepts the same number of columns than the original training dataset. Also, do I need to provide the label or not? Will it make any difference? I think the label should not be provided since it's a prediction, not a training and the label should have no effect on the output of the prediction!

    enhancement 
    opened by batata004 8
  • How to Train only One Class?

    How to Train only One Class?

    @andrewdalpino Thanks for your incredible creation. As A PHP lover, I have been expecting this type of project for a long time.

    Though I'm new to ML, I need to train in only one label. I got the CIFAR-10 or the MNIST example. But I could not manage it. It's required multiple labels. How can I do it? Like I need to detect only pigeons in the image.

    opened by takielias 2
  • Use new PHP 8.0 features in version 3.0

    Use new PHP 8.0 features in version 3.0

    Rubix ML 3.0 will bump up the minimum PHP version from 7.4 to 8.0. As a result, we can now start to use new language features in the next major release. A comprehensive list of new PHP 8.0 features can be found here https://www.php.net/releases/8.0/en.php.

    For example ...

    • Define union and mixed types
    • Nullsafe operator
    • Allow ::class on objects
    • Weak Maps
    • Replace switches with new match syntax
    enhancement 
    opened by andrewdalpino 0
  • Fixed Array Memory Optimizations

    Fixed Array Memory Optimizations

    SPL Fixed Arrays have much better memory efficiency than the standard PHP array, however, they are slower and have a more verbose API. This makes them suited to use in places where we need to memorize something that is not accessed often and the size of the array is known a priori. They do not, however, work well as part of our public API since we do not want to force our users to use them. This task is to search the entire Rubix ML codebase for these types of scenarios and replace the standard PHP array with a more memory-efficient SPL Fixed Array. Where appropriate, please provide a benchmark for the changes if a benchmark does not already exist.

    See https://www.php.net/manual/en/class.splfixedarray.php

    photo_2022-09-30_18-31-00

    photo_2022-09-30_18-31-05

    optimization 
    opened by andrewdalpino 1
  • Prune redundant Decision Tree leaf nodes

    Prune redundant Decision Tree leaf nodes

    The current Decision Tree grow() implementation does not prune pure leaf nodes that have the same class outcome and both stem from a common ancestor node. Pruning would involve replacing the Split node with a pure leaf node. See image.

    test

    The problematic logic can be found here https://github.com/RubixML/ML/blob/master/src/Graph/Trees/DecisionTree.php#L188

    Instead of terminating after the Split node was added to the stack, we could detect the condition of a pure split containing only one class and then right away replace the Split node with a leaf node.

    Here is the test that generated this Graphviz visual (except the number of bins was set to 3 instead of 5) https://github.com/RubixML/ML/blob/master/tests/Classifiers/ClassificationTreeTest.php#L194.

    This should speed up inference by reducing the number of splits that need to be evaluated as well as reduce the memory and storage cost of trained Decision Tree models. Effects Classification/Regression Trees, Extras Trees, Gradient Boost, Logit Boost, Random Forest, and AdaBoost.

    help wanted optimization 
    opened by andrewdalpino 0
  • Add

    Add "Model Explanation" section to the User Guide

    With the addition of Decision Tree Graphviz visualizations and given that some Learners already implement the RanksFeatures interface which provides a method to output the importance scores of each feature in the training set, we could start to build out a separate section of the User Guide dedicated to model explainability.

    I think a good place to start would be an Introduction, a Feature Importances section, and a Decision Tree visualization section. We could move the Feature Importances section over from the Training page (https://github.com/RubixML/ML/blob/master/docs/training.md#feature-importances). We should also include an image (png) of an example Decision Tree graph.

    The page should be written in markdown like the rest of them see https://github.com/RubixML/ML/tree/master/docs.

    enhancement 
    opened by andrewdalpino 0
Releases(2.3.0)
  • 2.3.0(Dec 31, 2022)

  • 2.2.2(Dec 6, 2022)

  • 1.3.5(Dec 6, 2022)

  • 2.2.1(Oct 15, 2022)

  • 0.4.3(Oct 6, 2022)

  • 2.2.0(Oct 1, 2022)

    • Added Image Rotator transformer
    • Added One Vs Rest ensemble classifier
    • Add variance and range to the Dataset describe() report
    • Added Gower distance kernel
    • Added types() method to Dataset
    • Concatenator now accepts an iterator of iterators
    Source code(tar.gz)
    Source code(zip)
  • 2.1.1(Sep 13, 2022)

  • 2.1.0(Jul 30, 2022)

    Big thanks to @torchello and @DrDub for their huge contributions to this release!

    • Added Probabilistic Metric interface
    • Added Probabilistic and Top K Accuracy
    • Added Brier Score Probabilistic Metric
    • Export Decision Tree-based models in Graphviz "dot" format
    • Added Graphviz helper class
    • Graph subsystem memory and storage optimizations

    Warning: This release contains changes to the Graph subsystem which breaks backward compatibility for all Decision tree-based learners that were saved with a previous version. Classification Tree, Extra Tree Classifiers, Random Forests, LogitBoost, Adaboost, Regression Tree, Extra Tree Regressor, and Gradient Boost are all affected.

    Note: Moving forward, we will only release changes that break the backward compatibility of saved objects in a major release unless they are part of a bug fix. See https://docs.rubixml.com/2.0/model-persistence.html#caveats for an explanation as to why saved objects are not as straightforward to maintain backward compatibility as the API.

    Source code(tar.gz)
    Source code(zip)
  • 2.0.2(Jun 3, 2022)

  • 1.3.4(Jun 3, 2022)

  • 2.0.1(Apr 3, 2022)

  • 2.0.0(Mar 30, 2022)

    • Gradient Boost now uses gradient-based subsampling
    • Allow Token Hashing Vectorizer custom hash functions
    • Gradient Boost base estimator no longer configurable
    • Move dummy estimators to the Extras package
    • Increase default MLP window from 3 to 5
    • Decrease default Gradient Boost window from 10 to 5
    • Rename alpha regularization parameter to L2 penalty
    • Added RBX serializer class property type change detection
    • Rename boosting estimators param to epochs
    • Neural net-based learners can now train for 0 epochs
    • Rename Labeled stratify() to stratifyByLabel()
    • Added Sparse Cosine distance kernel
    • Cosine distance now optimized for dense and sparse vectors
    • Word Count Vectorizer now uses min count and max ratio DFs
    • Numeric String Converter now handles NAN and INFs
    • Numeric String Converter is now Reversible
    • Removed Numeric String Converter NAN_PLACEHOLDER constant
    • Added MurmurHash3 and FNV1a 32-bit hashing functions to Token Hashing Vectorizer
    • Changed Token Hashing Vectorizer max dimensions to 2,147,483,647
    • Increase SQL Table Extractor batch size from 100 to 256
    • Ranks Features interface no longer extends Stringable
    • Verbose Learners now log change in loss
    • Numerical instability logged as a warning instead of info
    • Added header() method to CSV and SQL Table Extractors
    • Argmax() now throws an exception when undefined
    • MLP Learners recover from numerical instability with a snapshot
    • Rename Gzip serializer to Gzip Native
    • Change RBX serializer constructor argument from base to level
    • Rename Writeable extractor interface to Exporter
    Source code(tar.gz)
    Source code(zip)
  • 1.3.2(Feb 22, 2022)

  • 0.4.2(Feb 12, 2022)

  • 1.3.1(Dec 8, 2021)

  • 1.3.0(Dec 4, 2021)

    • Switch back to the original fork of Tensor
    • Added maxBins hyper-parameter to CART-based learners
    • Added stream Deduplicator extractor
    • Added the SiLU activation function
    • Added Swish activation layer
    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Nov 10, 2021)

  • 1.2.2(Oct 31, 2021)

  • 1.2.1(Oct 11, 2021)

  • 1.2.0(Aug 1, 2021)

  • 1.1.3(Jul 14, 2021)

  • 1.1.2(Jul 6, 2021)

    • Improved random floating point number precision
    • Deduplicate Preset seeder centroids
    • Fix Gradient Boost learning rate upper bound
    • Fix Loda histogram edge alignment
    Source code(tar.gz)
    Source code(zip)
  • 1.1.1(Jul 5, 2021)

  • 1.1.0(Jul 5, 2021)

    • Update to Scienide Tensor 3.0
    • Added Nesterov's lookahead to Momentum Optimizer
    • Added Reversible transformer interface
    • MaxAbs, Z Score, and Robust scalers are now Reversible
    • Min Max Normalizer now implements Reversible
    • TF-IDF Transformer is now Reversible
    • Added Preset cluster seeder
    • Added Concatenator extractor
    Source code(tar.gz)
    Source code(zip)
  • 1.0.3(Jun 19, 2021)

  • 1.0.2(May 26, 2021)

  • 1.0.1(May 25, 2021)

    • Fix AdaMax optimizer when tensor extension loaded
    • Prevent certain specification false negatives
    • Add extension minimum version specification
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(May 8, 2021)

  • 1.0.0-rc1(May 3, 2021)

    • Added Token Hashing Vectorizer transformer
    • Added Word Stemmer tokenizer from Extras
    • Remove HTML Stripper and Whitespace Remover transformers
    • Rename steps() method to losses()
    • Steps() now returns iterable progress table w/ header
    • Remove rules() method on CART
    • Removed results() and best() methods from Grid Search
    • Change string representation of NAN to match PHP
    • Added extra whitespace pattern to Regex Filter
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0-beta2(Apr 18, 2021)

    • Interval Discretizer now uses variable width histograms
    • Added TF-IDF sublinear TF scaling and document length normalization
    • Dataset filterByColumn() is now filter()
    • Added Lambda Function transformer from Extras
    • Rename Dataset column methods to feature
    • Added Dataset general sort() using callback
    • Confusion Matrix classes no longer selectable
    • Remove Recursive Feature Eliminator transformer
    • Metric range() now returns a Tuple object
    Source code(tar.gz)
    Source code(zip)
Owner
Rubix
Machine Learning and Deep Learning for the PHP language.
Rubix
A list of ICs and IPs for AI, Machine Learning and Deep Learning.

AI Chip (ICs and IPs) Editor S.T.(Linkedin) Welcome to My Wechat Blog StarryHeavensAbove for more AI chip related articles Latest updates Add news of

Shan Tang 1.4k Jan 3, 2023
PHP Machine Learning Rain Forecaster is a simple machine learning experiment in predicting rain based on a few forecast indicators.

PHP Machine Learning Rain Forecaster is a simple machine learning experiment in predicting rain based on a few forecast indicators.: forecasted "HighT

null 4 Nov 3, 2021
Zephir is a compiled high level language aimed to the creation of C-extensions for PHP.

Zephir - is a high level programming language that eases the creation and maintainability of extensions for PHP. Zephir extensions are exported to C c

Zephir Language 3.2k Jan 2, 2023
PHP Machine Learning library

PHP-ML - Machine Learning library for PHP Fresh approach to Machine Learning in PHP. Algorithms, Cross Validation, Neural Network, Preprocessing, Feat

Jorge Casas 204 Dec 27, 2022
Explore , Experiment with data science and machine learning.

sodiumchloride Project name : sodium chloride objective : Explore,Experiment your data with datascience and machine learning version : beta 0.1.2 rele

sodium chloride 2 Jan 9, 2022
PHP Machine Learning with Naive Bayes to classify the right contraceptive based on your medical history

What is php-ml-bayes PHP-ML Bayes is a Machine Learning with Naive Bayes Algorithm to classify the right contraceptive based on your medical history.

Fikri Lazuardi 2 Jan 21, 2022
DeepCopy - Create deep copies (clones) of your objects

DeepCopy DeepCopy helps you create deep copies (clones) of your objects. It is designed to handle cycles in the association graph. Table of Contents H

My C-Labs 8.4k Dec 29, 2022
The tool converts different error reporting standards for deep compatibility with popular CI systems (TeamCity, IntelliJ IDEA, GitHub Actions, etc).

JBZoo / CI-Report-Converter Why? Installing Using as GitHub Action Example GitHub Action workflow Available Directions Help description in terminal Co

JBZoo Toolbox 17 Jun 16, 2022
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

php-text-analysis PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP l

null 464 Dec 28, 2022
Leaf's very own high-speed, high-performance server

[WIP] Leaf Eien Server Eien is Leaf's implementation of a high-speed, high-performance server based on powerful tools like Open Swoole and Swoole. Eie

Leaf Framework 8 Dec 28, 2022
A cross-language remote procedure call(RPC) framework for rapid development of high performance distributed services.

Motan Overview Motan is a cross-language remote procedure call(RPC) framework for rapid development of high performance distributed services. Related

Weibo R&D Open Source Projects 5.8k Dec 20, 2022
A repository for showcasing my knowledge of the PHP programming language, and continuing to learn the language.

Learning PHP (programming language) I know very little about PHP. This document will list all my knowledge of the PHP programming language. Basic synt

Sean P. Myrick V19.1.7.2 2 Oct 29, 2022
Couleur is a modern PHP 8.1+ color library, intended to be compatible with CSS Color Module Level 4.

?? Couleur: A modern PHP 8.1+ color library ?? Couleur: A modern PHP 8.1+ color library ?? Presentation ⚙️ Installation ?? Quick Start ?? Usage ?? Imm

Matthieu Masta Denis 3 Oct 26, 2022
XState - A State Machine for PHP

XState - A State Machine for PHP State machine library to play with any complex behavior of your PHP objects Installation You can install the package

Mouad Ziani 68 Dec 18, 2022
A PHP package for MRZ (Machine Readable Zones) code parser for Passport, Visa & Travel Document (TD1 & TD2).

MRZ (Machine Readable Zones) Parser for PHP A PHP package for MRZ (Machine Readable Zones) code parser for Passport, Visa & Travel Document (TD1 & TD2

Md. Rakibul Islam 25 Aug 24, 2022