A high-level machine learning and deep learning library for the PHP language.

Rubix

Last update: Jan 1, 2023

Related tags

Miscellaneous php data-science machine-learning natural-language-processing algorithm ai deep-learning neural-network analytics clustering cross-validation regression prediction artificial-intelligence classification machine-learning-library anomaly-detection php-ml php-ai php-machine-learning

Overview

Rubix ML

A high-level machine learning and deep learning library for the PHP language.

Developer-friendly API is delightful to use
40+ supervised and unsupervised learning algorithms
Support for ETL, preprocessing, and cross-validation
Open source and free to use commercially

Installation

Install Rubix ML into your project using Composer:

$ composer require rubix/ml

Requirements

PHP 7.4 or above

Optional

Extras Package for experimental features
GD extension for image support
Mbstring extension for fast multibyte string manipulation
SVM extension for Support Vector Machine engine (libsvm)
PDO extension for relational database support

Documentation

Read the latest docs here.

What is Rubix ML?

Rubix ML is a free open-source machine learning (ML) library that allows you to build programs that learn from your data using the PHP language. We provide tools for the entire machine learning life cycle from ETL to training, cross-validation, and production with over 40 supervised and unsupervised learning algorithms. In addition, we provide tutorials and other educational content to help you get started using ML in your projects.

Getting Started

If you are new to machine learning, we recommend taking a look at the What is Machine Learning? section to get started. If you are already familiar with basic ML concepts, you can browse the basic introduction for a brief look at a typical Rubix ML project. From there, you can browse the official tutorials below which range from beginner to advanced skill level.

Tutorials & Example Projects

Check out these example projects using the Rubix ML library. Many come with instructions and a pre-cleaned dataset.

Interact With The Community

Funding

Rubix ML is funded by donations from the community. You can become a sponsor by making a contribution to one of our funding sources below.

Github Sponsors

Contributing

See CONTRIBUTING.md for guidelines.

License

The code is licensed MIT and the documentation is licensed CC BY-NC 4.0.

Comments

Handle multibyte string in text normalizer

Problem

TextNormalizer is unable to deal with string containing accents, chars like "é" or "è" are replaced by ?.

Input: "Depuis qu’il avait emménagé à côté de chez elle, il y a de ça cinq ans." Output: "depuis qu’il avait emm?nag? ? c?t? de chez elle, il y a de ?a cinq ans."

Fix

Use mb_strtolower instead of strtolower.

opened by maximecolin 19

Undefined offset in metric class while training Multilayer Perceptron Classifier

Describe the bug

When attempting to train a Multilayer Perception Classifier, I occasionally get the following type of exception. I have been able to replicate this with both the MCC and FBeta metrics. Unfortunately this exception does not occur consistently even with the same dataset.

[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[stacktrace]
#0 /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php(107): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError()
#1 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(414): Rubix\\ML\\CrossValidation\\Metrics\\MCC->score()
#2 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(360): Rubix\\ML\\Classifiers\\MultilayerPerceptron->partial()
#3 /[REDACTED]/vendor/rubix/ml/src/Pipeline.php(189): Rubix\\ML\\Classifiers\\MultilayerPerceptron->train()
#4 /[REDACTED]/vendor/rubix/ml/src/PersistentModel.php(191): Rubix\\ML\\Pipeline->train()
#5 /[REDACTED]/app/Console/Commands/TrainModel.php(89): Rubix\\ML\\PersistentModel->train()
#6 [internal function]: App\\Console\\Commands\\TrainModel->handle()
#7 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array()
#8 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#9 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure()
#10 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod()
#11 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\\Container\\BoundMethod::call()
#12 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\\Container\\Container->call()
#13 /[REDACTED]/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute()
#14 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Symfony\\Component\\Console\\Command\\Command->run()
#15 /[REDACTED]/vendor/symfony/console/Application.php(912): Illuminate\\Console\\Command->run()
#16 /[REDACTED]/vendor/symfony/console/Application.php(264): Symfony\\Component\\Console\\Application->doRunCommand()
#17 /[REDACTED]/vendor/symfony/console/Application.php(140): Symfony\\Component\\Console\\Application->doRun()
#18 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run()
#19 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(129): Illuminate\\Console\\Application->run()
#20 /[REDACTED]/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle()
#21 {main}
"}

To Reproduce

The following code is capable to recreating this error occasionally.

$estimator = new PersistentModel(
    new Pipeline(
        [
            new TextNormalizer(),
            new WordCountVectorizer(10000, 3, new NGram(1, 3)),
            new TfIdfTransformer(),
            new ZScaleStandardizer()
        ],
        new MultilayerPerceptron([
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(50),
            new PReLU(),
            new Dense(50),
            new PReLU(),
        ], 100, null, 1e-4, 1000, 1e-4, 10, 0.1, null, new MCC())
    ),
    new Filesystem($modelPath.'classifier.model')
);

$estimator->setLogger(new Screen('train-model'));

$estimator->train($dataset);

The labelled dataset used is a series of text files split into different directories that indicate their class names. This dataset is built using the following function.

    public static function buildLabeled(): Labeled
    {
        $samples = $labels = [];

        $directories = glob(storage_path('app/dataset/*'));

        foreach($directories as $directory) {
            foreach (glob($directory.'/*.txt') as $file) {
                $text = file_get_contents($file);
                $samples[] = [$text];
                $labels[] = basename($directory);
            }
        }

        return Labeled::build($samples, $labels);
    }

Expected behavior

Training should complete without any errors within the metric class.

bug

opened by DivineOmega 16

Potential bug when partially training a Gaussian Naive Bayes model
So I have the following code:

$estimator = new PersistentModel( new Pipeline([ new TextNormalizer(), new StopWordFilter(StopWords::$words), new WordCountVectorizer(10000, 1, new NGram(1, 2, new WordStemmer('romanian'))), new TfIdfTransformer(), new ZScaleStandardizer(), ], new GaussianNB()), new Filesystem($this->path.'/categorization.model', true) ); $categories = Category::has('variants')->with('variants')->get(); $samples = []; $labels = []; foreach ($categories as $category) { foreach ($category->variants as $variant) { $samples[] = [$variant->title]; $labels[] = $category->slug; } } $foldsNo = 10; $dataset = new Labeled($samples, $labels); $folds = $dataset->fold($foldsNo); $estimator->train($folds[0]); for($i = 1; $i < $foldsNo; $i++) { $estimator->partial($folds[$i]); } $estimator->save();

which tries to train a GaussianNB model on 26 thousand products and their associated labels, but for some reason I get the following error:

ErrorException Undefined offset: 0

in rubix\ml\src\Classifiers\GaussianNB.php:267 which corresponds to the following code: $means[$column] = (($n * $mean) + ($oldWeight * $oldMeans[$column])) / ($oldWeight + $n); , the problem seems to be here $oldMeans[$column] when trying to access a property which doesn't exist, but I have no idea why that property doesn't exist.

I don't suspect that the training data has any errors since if I fully train the same model with $estimator->train($dataset) works as expected (Batch learning in one single session).

When I tried to debug the code I have the following:

I see that lines 256, 257 and 259 have this code:

$means = $oldMeans = $this->means[$class] ?? []; $variances = $oldVariances = $this->variances[$class] ?? []; $oldWeight = $this->weights[$class] ?? 0;

which assigns [] empty array if $this->means[$class], $this->variances[$class] and $this->weights[$class] don't exist, but if $oldMeans is empty array I think $oldMeans[$column] would throw out that error. So a fix would be to put ($oldMeans[$column] ?? 0) in rubix\ml\src\Classifiers\GaussianNB.php:267 instead of $oldMeans[$column] I think. I am not 100% how that would affect the training of the model tho.

I hope that I am not mistaken.

Thank you!
bug
opened by cioroianudenis 15
Better visualization of decision trees (fixes #136)
Following the discussion on #136, this change builds on the rules() method from a51d3f197602d624ffdcbc227c2569a0cd1d64f5 to produce a tree description for the graphviz package.

The string generated can be send through the dot command to produce an image.

A test case for is added to ClassificationTreeTest.php

Running this on the dataset from divorce produces an output like

digraph Tree { node [shape=box, fontname=helvetica]; edge [fontname=helvetica]; N0 [label="Column_20 < 1"]; N1 [label="Column_52 < 0"]; N2 [label="Column_7 < 0"]; N3 [label="Outcome=divorced",style="rounded,filled",fillcolor=gray]; N2 -> N3; N4 [label="Outcome=divorced",style="rounded,filled",fillcolor=gray]; N2 -> N4; N1 -> N2; ...

That after running it through dot -Tpng divorce.dot > divorce.png renders as:

Next steps: have a rulesGraphical that produces the image using an extra dependency. (This might be overkill to add a dependency on graphviz for every deployment of RubixML.)
opened by DrDub 14
Added k-skip-n-gram word tokenizer

Hello @andrewdalpino, I've created new branch skip-gram-0.3.0 inheriting 0.3.0 as a parent. If everything is ok I will close my first PR. Could you please update this PR with correct CLA signature allowing me to pass CLA assistant?

opened by absolutic 14
Old obsolete info

As a training to start with the machine learning with PHP by using this library, I want to create a simple project that will use the old lotto numbers to predict if a generated random numbers sequence can contain some winning numbers and the probability that the sequence will be extracted. My dataset for now is composed from the last year winning numbers series, but I'm planning to expand the dataset to add the last three years of winning extractions. My question is how I can implement the library features, in particular what is the best feature of this library I can use, and how to train correctly the AI, the dataset is unlabeled because there is no label that can classify the numbers series, this because I only got the winning number series. I'm reading the documentations and I've read some of the tutorials, but some help on how to start with this problem will be appreciated.
question outdated

opened by realrecordzLab 14
[RFC] Data Persistence in Rubix ML
✏️ [RFC] Data Persistence in Rubix ML

👨‍💻 Chris Simpson

🗓 September - October 2020

BACKGROUND:

It is very common for a user of the RubixML library to load or save data.

Within RubixML most actions relating to data persistence are typically achieved via the use of Persister either directly, or via the PersistentModel class. The purpose of a Persister is to save a new Persistable object or to load a previously saved object.

Whilst a Persister handles the persistence of objects (recursively [de]serializing the objects and their internal dependencies), the more-general action of reading or writing data also appears in numerous other places within the library e.g. The Encoding object has a write() method used to write data to the local filesystem, whilst the purpose of Extractors is to read data and transform it into Dataset objects.

This contribution aims to introduce abstractions for data IO suitable for use across the entire codebase. It standardizes the mechanism of how data is persisted (saved/loaded) within RubixML.

DESIGN GOALS:

Remove any tight-coupling to the local file system: Introduce a new abstraction modelling where data can be stored (local filesystem, remote/cloud storage, database etc). The default location can remain as the local filesystem, but alternative implementations should be used if provided.

Split the responsibility of reading and writing data. By design Extractors should be read-only, whilst Encoding objects need only have awareness of how to write the data that they contain.

Retain the simplicity and of the Persister interface, whilst introducing lower-level abstractions to power all current and future data storage operations.

Allow for the incremental and memory-efficient reading and writing of data via the use of steams and generators wherever possible/practical.

APPROACH:

This PR introduces the Rubix\ML\Storage namespace. This namespace contains abstractions relating to reading and writing of data: Datastore defines the behaviour of a generic data-storage repository. It is given the ability to perform read-related operations via the Rubix\Ml\Storage\Reader interface, and it's ability to perform write-related operations from the Rubix\Ml\Storage\Writer interface. A Datastore fully implements both of these interfaces. This separation allows for Reader to be type-hinted where read-only behaviour is intended. Writer to be used where the class only wishes to write data. Datastore that encompasses both Reader and Writer can be hinted to classes that require both mechanisms (eg Persister implementations).

Reader:

<?php namespace Rubix\ML\Storage; use Rubix\ML\Storage\Streams\Stream; /** * Reader * * @category Machine Learning * @package Rubix/ML * @author Chris Simpson */ interface Reader { /** * Return if the target exists at $location. * * @param string $location * @throws \Rubix\ML\Storage\Exceptions\StorageException * @return bool */ public function exists(string $location) : bool; /** * Open a stream of the target data at $location. * * @param string $location * @param string $mode * @throws \Rubix\ML\Storage\Exceptions\ReadError * @throws \Rubix\ML\Storage\Exceptions\StorageException * @return \Rubix\ML\Storage\Streams\Stream */ public function read(string $location, string $mode = Stream::READ_ONLY) : Stream; }

Writer:

<?php namespace Rubix\ML\Storage; /** * Writer * * @category Machine Learning * @package Rubix/ML * @author Chris Simpson */ interface Writer { /** * Write. * * @param string $location * @param mixed $data * @throws \Rubix\ML\Storage\Exceptions\WriteError * @throws \Rubix\ML\Storage\Exceptions\StorageException */ public function write(string $location, $data) : void; /** * Move. * * NOTE: If supported by the underlying datastore this should be implemented as an atomic operation. * * @param string $from * @param string $to * @throws \Rubix\ML\Storage\Exceptions\StorageException */ public function move(string $from, string $to) : void; /** * Delete. * * @param string $location * @throws \Rubix\ML\Storage\Exceptions\StorageException */ public function delete(string $location) : void; }

Datastore:

<?php namespace Rubix\ML\Storage; use Stringable; /** * Datastore. * * Defines the behaviour of a generic storage repository (filesystem, database etc) * * @category Machine Learning * @package Rubix/ML * @author Chris Simpson */ interface Datastore extends Reader, Writer, Stringable { // }

Stream:

Reader::read() returns an object implementing the Stream interface. This interface acts as an OO wrapper around php streams. This functionality is explicitly wrapped to enable typehinting (resource is not a valid language-level typehint), and for consistent non repetitive stream interaction. Stream represents a generic stream of data and implements all common stream operations via methods named after the common semantics used.

The Stream interface enables the ability to incrementally (in a 'cursorable' manner) read the data from the underlying resource, and also implements IteratorAggregate allowing the object to be iterated upon directly (yielding a line of content per iteration) Note: To read the entire contents of the stream in a single operation see Stream::contents().

INTEGRATION:

To illustrate the implementation and integration of these new objects I have provided an example: Rubix\ML\Storage\LocalFilesystem. This is the simplest of implementations and can be used to replace any existing interaction with data stored on the local filesystem for example Extractor implementations, Encoding objects, and the widely used Filesystem Persistor class.

Rubix\ML\Storage\LocalFilesystem:

<?php namespace Rubix\ML\Storage; use Rubix\ML\Storage\Streams\File; use Rubix\ML\Storage\Streams\Stream; use Rubix\ML\Storage\Exceptions\RuntimeException; /** * Local Datastore. * * @category Machine Learning * @package Rubix/ML * @author Chris Simpson */ class LocalFilesystem implements Datastore { /** * @see \Rubix\ML\Storage\Reader::exists() * @inheritdoc * * @param string $location * * @return bool */ public function exists(string $location) : bool { return file_exists($location); } /** * @see \Rubix\ML\Storage\Reader::read() * @inheritdoc * * @param string $location * * @return \Rubix\ML\Storage\Streams\Stream */ public function read(string $location, string $mode = Stream::READ_ONLY) : Stream { $resource = fopen($location, $mode); if (!$resource) { throw new RuntimeException("Could not open $location."); } $stream = new File($resource); if (!$stream->readable()) { throw new RuntimeException("Stream with mode {$mode} cannot be read from"); } return $stream; } /** * @see \Rubix\ML\Storage\Writer::write() * @inheritdoc * * @param string $location * @param \Rubix\ML\Storage\Streams\Stream|string $data */ public function write(string $location, $data) : void { if ($data instanceof Stream) { $data = $data->contents(); } file_put_contents($location, $data, LOCK_EX); } /** * @see \Rubix\ML\Storage\Writer::delete() * @inheritdoc * * @param string $location */ public function delete(string $location) : void { unlink($location); } /** * @see \Rubix\ML\Storage\Writer::move() * @inheritdoc * * @param string $from * @param string $to */ public function move(string $from, string $to) : void { rename($from, $to); } /** * Return the string representation of the object. * * @return string */ public function __toString() : string { return 'Local Filesystem'; } }

Encoding:

The Encoding::write() method now accepts an optional second parameter of type Rubix\ML\Storage\Writer. Omitting this argument will cause an instance of Rubix\ML\Storage\LocalFilesystem to be used (so existing behaviour is transparently maintained resulting in no BC breaks)

Extractors:

By design, Extractor objects are read-only. All existing implementations now accept a Reader in their constructors and the implementations have been updated to show how the Stream object returned by Reader::read(): Stream can be used in-place of the existing semantics.

Note: Any Reader implementation can be used: so Extractors are now decoupled from the local filesystem and able to read data from a multitude of storage backends.

DISCUSSION:

Whilst I have given this implementation considerable thought, and gone though numerous iterations prior to opening this PR, I still think there is room for a significant discussion here. The specific implementations here are intended as illustrative and I'm most interested in feedback on the design itself. I have a number of open questions myself, but think that this is mature enough to introduce external feedback and inspection.

Naming is hard. Whilst I've tried to be as concise as possible I think there is room to improve the naming of these abstractions! Which is preferable: Datastore? Repository? StorageOperator, StorageEngine etc? Happy to take suggestions! I would have initially used Backend but obviously didn't want to cause any confusion with the Backend interface used in the parallel processing.

Future proof? I've tried to make these abstractions generic and multi-purpose. If you can anticipate a future use-case that wouldn't be supported then let's discuss.

Backwards Compatibility: Have I broken anything? Should I have? Let's talk about the options here.

Unit Tests: Whilst all current unit tests are passing, I have purposefully neglected to add further tests prior to confirming the design/discussion/direction in this PR.

Exceptions: The exceptions were added prior to the merge of https://github.com/RubixML/RubixML/pull/116 so are more a rudimentary demonstration of potential flexibility and not a active recommendation of how we should structure an exception hierarchy.

Pull Request: At the time of writing I have based this PR against 0.3.0, as it contains the most up-to-date 'nightly' code. This is just to minimise the size of the diff.

RELATED:

https://github.com/RubixML/RubixML/issues/108: How can the persistence subsystem be extended across the codebase? This was my starting point for this PR. I realized that Persisters are actually very good at what they do (serializing objects and storing them) and backing them by the common storage abstractions (without changing the Persister interface) would be a positive step forwards.

A early proof-of-concept showing how storage could be made pluggable using Flysystem 1.x: https://github.com/RubixML/RubixML/pull/106.

A Persistor backed by Flysystem 2.x (beta): https://github.com/RubixML/Extras/pull/3

REVISIONS:

2020-10-05:

Initial publish and request for comments.
opened by simplechris 12
Flysystem Persister

SUMMARY

This PR adds a Flysystem Persistor. This enables a user to load and save models located in a remote storage backend (such as Amazon S3, Azure Blob Storage, Google Cloud Storage, Dropbox etc...)

Closes #104 .

A proof-of-concept PR (#106) instigated some interesting discussion about how to extend the concept of persistence in RubixML. A wider discussion around the introduction of a flexible persistence-subsystem is now ongoing. In the meantime this Flysystem Persitor will allow connectivity to a large array of remote storage solutions 🚀

enhancement

opened by simplechris 11
Installation error: [InvalidArgumentException] could not find a version of package rubix/ml matching your minimum stability

Last time I tried setting up RubixMl on macOS, I got this error regarding not matching my minimum-stability. I was later able to install it after setting up "minimum-stability:dev" in composer.json.

Now I'm trying to set up RubixMl on a windows machine and I get the same error again. So far I've tried it with 'composer require rubix/ml:dev-master' and 'composer require rubix/ml:"*" and I'm still not able to successfully install.

I might be missing something or doing something wrong. Can someone help me on this?
question

opened by najiagul 10
fix BallTree edge case

We had another infinite loop problem in BallTree. Whenever a dataset contained the same samples repeating more than the max leaf size BallTree::grow was trying to split the same tree over and over again - since all nodes were having the same left and right centeroid values, it kept all the samples in the right subtree and continued trying to split it. I added an exception handling code that terminated the process in that case. It can result with a leaf node containing more than max leaf size but I am not sure what else to do here - all the samples in that leaf are the same, so we don't have a criteria to further split them.

Added a test to cover this use-case. If you run the test without the fix in BallTree class, you will see an infinite loop occurs.

Let me know if you see other solutions here or any problems with this one. Thanks!

opened by kroky 8
Validate the shape of the input vector during inference

After training my model, I tried to predict a single sample. I tried:

$prediction = $estimator->predictSample([0,0,1,0,1,0,1]);

And it showed me the prediction. Then I realized I had forgot to include the label in that line, so I added the label, and it ended up like this:

$prediction = $estimator->predictSample([0,0,1,0,1,0,1,"sim"]);

And it showed me again the prediction. Then I got confused? Some of those 2 lines should not work because I am input a data that has different shape that the training. Then I tried adding more 0,1 items in the predictSample and it always predicts. No error shown!

This makes that function predictSample unreliable since it should only accepts the same number of columns than the original training dataset. Also, do I need to provide the label or not? Will it make any difference? I think the label should not be provided since it's a prediction, not a training and the label should have no effect on the output of the prediction!
enhancement

opened by batata004 8
How to Train only One Class?

@andrewdalpino Thanks for your incredible creation. As A PHP lover, I have been expecting this type of project for a long time.

Though I'm new to ML, I need to train in only one label. I got the CIFAR-10 or the MNIST example. But I could not manage it. It's required multiple labels. How can I do it? Like I need to detect only pigeons in the image.

opened by takielias 2
Use new PHP 8.0 features in version 3.0
Rubix ML 3.0 will bump up the minimum PHP version from 7.4 to 8.0. As a result, we can now start to use new language features in the next major release. A comprehensive list of new PHP 8.0 features can be found here https://www.php.net/releases/8.0/en.php.

For example ...

Define union and mixed types

Nullsafe operator

Allow ::class on objects

Weak Maps

Replace switches with new match syntax

enhancement
opened by andrewdalpino 0
Fixed Array Memory Optimizations

SPL Fixed Arrays have much better memory efficiency than the standard PHP array, however, they are slower and have a more verbose API. This makes them suited to use in places where we need to memorize something that is not accessed often and the size of the array is known a priori. They do not, however, work well as part of our public API since we do not want to force our users to use them. This task is to search the entire Rubix ML codebase for these types of scenarios and replace the standard PHP array with a more memory-efficient SPL Fixed Array. Where appropriate, please provide a benchmark for the changes if a benchmark does not already exist.

See https://www.php.net/manual/en/class.splfixedarray.php

optimization

opened by andrewdalpino 1
Prune redundant Decision Tree leaf nodes

The current Decision Tree grow() implementation does not prune pure leaf nodes that have the same class outcome and both stem from a common ancestor node. Pruning would involve replacing the Split node with a pure leaf node. See image.

The problematic logic can be found here https://github.com/RubixML/ML/blob/master/src/Graph/Trees/DecisionTree.php#L188

Instead of terminating after the Split node was added to the stack, we could detect the condition of a pure split containing only one class and then right away replace the Split node with a leaf node.

Here is the test that generated this Graphviz visual (except the number of bins was set to 3 instead of 5) https://github.com/RubixML/ML/blob/master/tests/Classifiers/ClassificationTreeTest.php#L194.

This should speed up inference by reducing the number of splits that need to be evaluated as well as reduce the memory and storage cost of trained Decision Tree models. Effects Classification/Regression Trees, Extras Trees, Gradient Boost, Logit Boost, Random Forest, and AdaBoost.
help wanted optimization

opened by andrewdalpino 0
Add "Model Explanation" section to the User Guide

With the addition of Decision Tree Graphviz visualizations and given that some Learners already implement the RanksFeatures interface which provides a method to output the importance scores of each feature in the training set, we could start to build out a separate section of the User Guide dedicated to model explainability.

I think a good place to start would be an Introduction, a Feature Importances section, and a Decision Tree visualization section. We could move the Feature Importances section over from the Training page (https://github.com/RubixML/ML/blob/master/docs/training.md#feature-importances). We should also include an image (png) of an example Decision Tree graph.

The page should be written in markdown like the rest of them see https://github.com/RubixML/ML/tree/master/docs.
enhancement

opened by andrewdalpino 0

✏️	[RFC] Data Persistence in Rubix ML
👨‍💻	Chris Simpson
🗓	September - October 2020

Releases(2.3.0)

2.3.0(Dec 31, 2022)
Added BM25 Transformer

Add dropFeature() method to the dataset object API

Add neural network architecture visualization via GraphViz

Source code(tar.gz)
Source code(zip)
2.2.2(Dec 6, 2022)
Fix Grid Search best model selection

Source code(tar.gz)
Source code(zip)
1.3.5(Dec 6, 2022)
Fix Grid Search best model selection

Source code(tar.gz)
Source code(zip)
2.2.1(Oct 15, 2022)
Fix Extra Tree divide by zero when split finding

Source code(tar.gz)
Source code(zip)
0.4.3(Oct 6, 2022)
Update to Flysystem 2.1 and above

Source code(tar.gz)
Source code(zip)
2.2.0(Oct 1, 2022)
Added Image Rotator transformer

Added One Vs Rest ensemble classifier

Add variance and range to the Dataset describe() report

Added Gower distance kernel

Added types() method to Dataset

Concatenator now accepts an iterator of iterators

Source code(tar.gz)
Source code(zip)
2.1.1(Sep 13, 2022)
Do not consider unset properties when determining revision

Source code(tar.gz)
Source code(zip)
2.1.0(Jul 30, 2022)
Big thanks to @torchello and @DrDub for their huge contributions to this release!

Added Probabilistic Metric interface

Added Probabilistic and Top K Accuracy

Added Brier Score Probabilistic Metric

Export Decision Tree-based models in Graphviz "dot" format

Added Graphviz helper class

Graph subsystem memory and storage optimizations

Warning: This release contains changes to the Graph subsystem which breaks backward compatibility for all Decision tree-based learners that were saved with a previous version. Classification Tree, Extra Tree Classifiers, Random Forests, LogitBoost, Adaboost, Regression Tree, Extra Tree Regressor, and Gradient Boost are all affected.

Note: Moving forward, we will only release changes that break the backward compatibility of saved objects in a major release unless they are part of a bug fix. See https://docs.rubixml.com/2.0/model-persistence.html#caveats for an explanation as to why saved objects are not as straightforward to maintain backward compatibility as the API.
Source code(tar.gz)
Source code(zip)
2.0.2(Jun 3, 2022)
Fix Decision Tree max height terminating condition

Source code(tar.gz)
Source code(zip)
1.3.4(Jun 3, 2022)
Fix Decision Tree max height terminating condition

Source code(tar.gz)
Source code(zip)
2.0.1(Apr 3, 2022)
Compensate for PHP 8.1 backward compatibility issues

Source code(tar.gz)
Source code(zip)
2.0.0(Mar 30, 2022)
Gradient Boost now uses gradient-based subsampling

Allow Token Hashing Vectorizer custom hash functions

Gradient Boost base estimator no longer configurable

Move dummy estimators to the Extras package

Increase default MLP window from 3 to 5

Decrease default Gradient Boost window from 10 to 5

Rename alpha regularization parameter to L2 penalty

Added RBX serializer class property type change detection

Rename boosting estimators param to epochs

Neural net-based learners can now train for 0 epochs

Rename Labeled stratify() to stratifyByLabel()

Added Sparse Cosine distance kernel

Cosine distance now optimized for dense and sparse vectors

Word Count Vectorizer now uses min count and max ratio DFs

Numeric String Converter now handles NAN and INFs

Numeric String Converter is now Reversible

Removed Numeric String Converter NAN_PLACEHOLDER constant

Added MurmurHash3 and FNV1a 32-bit hashing functions to Token Hashing Vectorizer

Changed Token Hashing Vectorizer max dimensions to 2,147,483,647

Increase SQL Table Extractor batch size from 100 to 256

Ranks Features interface no longer extends Stringable

Verbose Learners now log change in loss

Numerical instability logged as a warning instead of info

Added header() method to CSV and SQL Table Extractors

Argmax() now throws an exception when undefined

MLP Learners recover from numerical instability with a snapshot

Rename Gzip serializer to Gzip Native

Change RBX serializer constructor argument from base to level

Rename Writeable extractor interface to Exporter

Source code(tar.gz)
Source code(zip)
1.3.2(Feb 22, 2022)
Forego unnecessary logistic computation in Logit Boost

Source code(tar.gz)
Source code(zip)
0.4.2(Feb 12, 2022)
Fix Missing Extension exception class filename

Source code(tar.gz)
Source code(zip)
1.3.1(Dec 8, 2021)
Update to OK Bloomer 1.0 stable

Source code(tar.gz)
Source code(zip)
1.3.0(Dec 4, 2021)
Switch back to the original fork of Tensor

Added maxBins hyper-parameter to CART-based learners

Added stream Deduplicator extractor

Added the SiLU activation function

Added Swish activation layer

Source code(tar.gz)
Source code(zip)
1.2.3(Nov 10, 2021)
Fix Multiclass layer cross-entropy gradient optimization

Source code(tar.gz)
Source code(zip)
1.2.2(Oct 31, 2021)
Allow empty dataset objects in stack()

Source code(tar.gz)
Source code(zip)
1.2.1(Oct 11, 2021)
Refactor stratified methods on Labeled dataset

Narrower typehints

Source code(tar.gz)
Source code(zip)
1.2.0(Aug 1, 2021)
Added Logit Boost classifier

Interval Discretizer variable or equi-depth binning

Text Normalizers now lower or upper case

Source code(tar.gz)
Source code(zip)
1.1.3(Jul 14, 2021)
Min Max Normalizer compensate for 0 variance features

Source code(tar.gz)
Source code(zip)
1.1.2(Jul 6, 2021)
Improved random floating point number precision

Deduplicate Preset seeder centroids

Fix Gradient Boost learning rate upper bound

Fix Loda histogram edge alignment

Source code(tar.gz)
Source code(zip)
1.1.1(Jul 5, 2021)
Fix Gradient Boost subsampling and importance scores

Source code(tar.gz)
Source code(zip)
1.1.0(Jul 5, 2021)
Update to Scienide Tensor 3.0

Added Nesterov's lookahead to Momentum Optimizer

Added Reversible transformer interface

MaxAbs, Z Score, and Robust scalers are now Reversible

Min Max Normalizer now implements Reversible

TF-IDF Transformer is now Reversible

Added Preset cluster seeder

Added Concatenator extractor

Source code(tar.gz)
Source code(zip)
1.0.3(Jun 19, 2021)
Do not remove groups property from symbol table

Source code(tar.gz)
Source code(zip)
1.0.2(May 26, 2021)
Fix KNN and Hot Deck imputer reset donor samples

Source code(tar.gz)
Source code(zip)
1.0.1(May 25, 2021)
Fix AdaMax optimizer when tensor extension loaded

Prevent certain specification false negatives

Add extension minimum version specification

Source code(tar.gz)
Source code(zip)
1.0.0(May 8, 2021)
No changes

Source code(tar.gz)
Source code(zip)
1.0.0-rc1(May 3, 2021)
Added Token Hashing Vectorizer transformer

Added Word Stemmer tokenizer from Extras

Remove HTML Stripper and Whitespace Remover transformers

Rename steps() method to losses()

Steps() now returns iterable progress table w/ header

Remove rules() method on CART

Removed results() and best() methods from Grid Search

Change string representation of NAN to match PHP

Added extra whitespace pattern to Regex Filter

Source code(tar.gz)
Source code(zip)
1.0.0-beta2(Apr 18, 2021)
Interval Discretizer now uses variable width histograms

Added TF-IDF sublinear TF scaling and document length normalization

Dataset filterByColumn() is now filter()

Added Lambda Function transformer from Extras

Rename Dataset column methods to feature

Added Dataset general sort() using callback

Confusion Matrix classes no longer selectable

Remove Recursive Feature Eliminator transformer

Metric range() now returns a Tuple object

Source code(tar.gz)
Source code(zip)

A high-level machine learning and deep learning library for the PHP language.

Related tags

Overview

Rubix ML

Installation

Requirements

Recommended

Optional

Documentation

What is Rubix ML?

Getting Started

Tutorials & Example Projects

Interact With The Community

Funding

Contributing

License

Comments

Problem

Fix

BACKGROUND:

DESIGN GOALS:

APPROACH:

Reader:

Writer:

Datastore:

Stream:

INTEGRATION:

Rubix\ML\Storage\LocalFilesystem:

Encoding:

Extractors:

DISCUSSION:

RELATED:

REVISIONS:

2020-10-05:

SUMMARY

Releases(2.3.0)

2.3.0(Dec 31, 2022)

2.2.2(Dec 6, 2022)

1.3.5(Dec 6, 2022)

2.2.1(Oct 15, 2022)

0.4.3(Oct 6, 2022)

2.2.0(Oct 1, 2022)

2.1.1(Sep 13, 2022)

2.1.0(Jul 30, 2022)

2.0.2(Jun 3, 2022)

1.3.4(Jun 3, 2022)

2.0.1(Apr 3, 2022)

2.0.0(Mar 30, 2022)

1.3.2(Feb 22, 2022)

0.4.2(Feb 12, 2022)

1.3.1(Dec 8, 2021)

1.3.0(Dec 4, 2021)

1.2.3(Nov 10, 2021)

1.2.2(Oct 31, 2021)

1.2.1(Oct 11, 2021)

1.2.0(Aug 1, 2021)

1.1.3(Jul 14, 2021)

1.1.2(Jul 6, 2021)

1.1.1(Jul 5, 2021)

1.1.0(Jul 5, 2021)

1.0.3(Jun 19, 2021)

1.0.2(May 26, 2021)

1.0.1(May 25, 2021)

1.0.0(May 8, 2021)

1.0.0-rc1(May 3, 2021)

1.0.0-beta2(Apr 18, 2021)

Owner

Rubix

A list of ICs and IPs for AI, Machine Learning and Deep Learning.

PHP Machine Learning Rain Forecaster is a simple machine learning experiment in predicting rain based on a few forecast indicators.

Zephir is a compiled high level language aimed to the creation of C-extensions for PHP.

PHP Machine Learning library

Explore , Experiment with data science and machine learning.

PHP Machine Learning with Naive Bayes to classify the right contraceptive based on your medical history

DeepCopy - Create deep copies (clones) of your objects

The tool converts different error reporting standards for deep compatibility with popular CI systems (TeamCity, IntelliJ IDEA, GitHub Actions, etc).

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

Leaf's very own high-speed, high-performance server

A cross-language remote procedure call(RPC) framework for rapid development of high performance distributed services.

A repository for showcasing my knowledge of the PHP programming language, and continuing to learn the language.

`Reader`:

`Writer`:

`Datastore`:

`Stream`:

`Rubix\ML\Storage\LocalFilesystem`:

`Encoding`:

`Extractors`:

`2020-10-05`: