✏️ | [RFC] Data Persistence in Rubix ML |
👨💻 | Chris Simpson |
🗓 | September - October 2020 |
BACKGROUND:
It is very common for a user of the RubixML library to load or save data.
Within RubixML most actions relating to data persistence are typically achieved via the use of Persister
either directly, or via the PersistentModel
class. The purpose of a Persister
is to save a new Persistable
object or to load a previously saved object.
Whilst a Persister
handles the persistence of objects (recursively [de]serializing the objects and their internal dependencies), the more-general action of reading or writing data also appears in numerous other places within the library e.g. The Encoding
object has a write()
method used to write data to the local filesystem, whilst the purpose of Extractors
is to read data and transform it into Dataset
objects.
This contribution aims to introduce abstractions for data IO suitable for use across the entire codebase. It standardizes the mechanism of how data is persisted (saved/loaded) within RubixML.
DESIGN GOALS:
- Remove any tight-coupling to the local file system: Introduce a new abstraction modelling where data can be stored (local filesystem, remote/cloud storage, database etc). The default location can remain as the local filesystem, but alternative implementations should be used if provided.
- Split the responsibility of reading and writing data. By design
Extractors
should be read-only, whilst Encoding
objects need only have awareness of how to write the data that they contain.
- Retain the simplicity and of the
Persister
interface, whilst introducing lower-level abstractions to power all current and future data storage operations.
- Allow for the incremental and memory-efficient reading and writing of data via the use of steams and generators wherever possible/practical.
APPROACH:
This PR introduces the Rubix\ML\Storage
namespace. This namespace contains abstractions relating to reading and writing of data: Datastore
defines the behaviour of a generic data-storage repository. It is given the ability to perform read-related operations via the Rubix\Ml\Storage\Reader
interface, and it's ability to perform write-related operations from the Rubix\Ml\Storage\Writer
interface. A Datastore
fully implements both of these interfaces. This separation allows for Reader
to be type-hinted where read-only behaviour is intended. Writer
to be used where the class only wishes to write data. Datastore
that encompasses both Reader
and Writer
can be hinted to classes that require both mechanisms (eg Persister
implementations).
Reader
:
<?php
namespace Rubix\ML\Storage;
use Rubix\ML\Storage\Streams\Stream;
/**
* Reader
*
* @category Machine Learning
* @package Rubix/ML
* @author Chris Simpson
*/
interface Reader
{
/**
* Return if the target exists at $location.
*
* @param string $location
* @throws \Rubix\ML\Storage\Exceptions\StorageException
* @return bool
*/
public function exists(string $location) : bool;
/**
* Open a stream of the target data at $location.
*
* @param string $location
* @param string $mode
* @throws \Rubix\ML\Storage\Exceptions\ReadError
* @throws \Rubix\ML\Storage\Exceptions\StorageException
* @return \Rubix\ML\Storage\Streams\Stream
*/
public function read(string $location, string $mode = Stream::READ_ONLY) : Stream;
}
Writer
:
<?php
namespace Rubix\ML\Storage;
/**
* Writer
*
* @category Machine Learning
* @package Rubix/ML
* @author Chris Simpson
*/
interface Writer
{
/**
* Write.
*
* @param string $location
* @param mixed $data
* @throws \Rubix\ML\Storage\Exceptions\WriteError
* @throws \Rubix\ML\Storage\Exceptions\StorageException
*/
public function write(string $location, $data) : void;
/**
* Move.
*
* NOTE: If supported by the underlying datastore this should be implemented as an atomic operation.
*
* @param string $from
* @param string $to
* @throws \Rubix\ML\Storage\Exceptions\StorageException
*/
public function move(string $from, string $to) : void;
/**
* Delete.
*
* @param string $location
* @throws \Rubix\ML\Storage\Exceptions\StorageException
*/
public function delete(string $location) : void;
}
Datastore
:
<?php
namespace Rubix\ML\Storage;
use Stringable;
/**
* Datastore.
*
* Defines the behaviour of a generic storage repository (filesystem, database etc)
*
* @category Machine Learning
* @package Rubix/ML
* @author Chris Simpson
*/
interface Datastore extends Reader, Writer, Stringable
{
//
}
Stream
:
Reader::read()
returns an object implementing the Stream
interface. This interface acts as an OO wrapper around php streams. This functionality is explicitly wrapped to enable typehinting (resource
is not a valid language-level typehint), and for consistent non repetitive stream interaction. Stream
represents a generic stream of data and implements all common stream operations via methods named after the common semantics used.
The Stream
interface enables the ability to incrementally (in a 'cursorable' manner) read the data from the underlying resource, and also implements IteratorAggregate
allowing the object to be iterated upon directly (yielding a line of content per iteration) Note: To read the entire contents of the stream in a single operation see Stream::contents()
.
INTEGRATION:
To illustrate the implementation and integration of these new objects I have provided an example: Rubix\ML\Storage\LocalFilesystem
. This is the simplest of implementations and can be used to replace any existing interaction with data stored on the local filesystem for example Extractor
implementations, Encoding
objects, and the widely used Filesystem
Persistor class.
Rubix\ML\Storage\LocalFilesystem
:
<?php
namespace Rubix\ML\Storage;
use Rubix\ML\Storage\Streams\File;
use Rubix\ML\Storage\Streams\Stream;
use Rubix\ML\Storage\Exceptions\RuntimeException;
/**
* Local Datastore.
*
* @category Machine Learning
* @package Rubix/ML
* @author Chris Simpson
*/
class LocalFilesystem implements Datastore
{
/**
* @see \Rubix\ML\Storage\Reader::exists()
* @inheritdoc
*
* @param string $location
*
* @return bool
*/
public function exists(string $location) : bool
{
return file_exists($location);
}
/**
* @see \Rubix\ML\Storage\Reader::read()
* @inheritdoc
*
* @param string $location
*
* @return \Rubix\ML\Storage\Streams\Stream
*/
public function read(string $location, string $mode = Stream::READ_ONLY) : Stream
{
$resource = fopen($location, $mode);
if (!$resource) {
throw new RuntimeException("Could not open $location.");
}
$stream = new File($resource);
if (!$stream->readable()) {
throw new RuntimeException("Stream with mode {$mode} cannot be read from");
}
return $stream;
}
/**
* @see \Rubix\ML\Storage\Writer::write()
* @inheritdoc
*
* @param string $location
* @param \Rubix\ML\Storage\Streams\Stream|string $data
*/
public function write(string $location, $data) : void
{
if ($data instanceof Stream) {
$data = $data->contents();
}
file_put_contents($location, $data, LOCK_EX);
}
/**
* @see \Rubix\ML\Storage\Writer::delete()
* @inheritdoc
*
* @param string $location
*/
public function delete(string $location) : void
{
unlink($location);
}
/**
* @see \Rubix\ML\Storage\Writer::move()
* @inheritdoc
*
* @param string $from
* @param string $to
*/
public function move(string $from, string $to) : void
{
rename($from, $to);
}
/**
* Return the string representation of the object.
*
* @return string
*/
public function __toString() : string
{
return 'Local Filesystem';
}
}
Encoding
:
The Encoding::write()
method now accepts an optional second parameter of type Rubix\ML\Storage\Writer
. Omitting this argument will cause an instance of Rubix\ML\Storage\LocalFilesystem
to be used (so existing behaviour is transparently maintained resulting in no BC breaks)
Extractors
:
By design, Extractor
objects are read-only. All existing implementations now accept a Reader
in their constructors and the implementations have been updated to show how the Stream
object returned by Reader::read(): Stream
can be used in-place of the existing semantics.
Note: Any Reader
implementation can be used: so Extractors
are now decoupled from the local filesystem and able to read data from a multitude of storage backends.
DISCUSSION:
Whilst I have given this implementation considerable thought, and gone though numerous iterations prior to opening this PR, I still think there is room for a significant discussion here. The specific implementations here are intended as illustrative and I'm most interested in feedback on the design itself. I have a number of open questions myself, but think that this is mature enough to introduce external feedback and inspection.
-
Naming is hard. Whilst I've tried to be as concise as possible I think there is room to improve the naming of these abstractions! Which is preferable: Datastore
? Repository
? StorageOperator
, StorageEngine
etc? Happy to take suggestions! I would have initially used Backend
but obviously didn't want to cause any confusion with the Backend
interface used in the parallel processing.
-
Future proof? I've tried to make these abstractions generic and multi-purpose. If you can anticipate a future use-case that wouldn't be supported then let's discuss.
-
Backwards Compatibility: Have I broken anything? Should I have? Let's talk about the options here.
-
Unit Tests: Whilst all current unit tests are passing, I have purposefully neglected to add further tests prior to confirming the design/discussion/direction in this PR.
-
Exceptions: The exceptions were added prior to the merge of https://github.com/RubixML/RubixML/pull/116 so are more a rudimentary demonstration of potential flexibility and not a active recommendation of how we should structure an exception hierarchy.
-
Pull Request: At the time of writing I have based this PR against 0.3.0
, as it contains the most up-to-date 'nightly' code. This is just to minimise the size of the diff.
RELATED:
- https://github.com/RubixML/RubixML/issues/108: How can the persistence subsystem be extended across the codebase? This was my starting point for this PR. I realized that Persisters are actually very good at what they do (serializing objects and storing them) and backing them by the common storage abstractions (without changing the
Persister
interface) would be a positive step forwards.
- A early proof-of-concept showing how storage could be made pluggable using
Flysystem 1.x
: https://github.com/RubixML/RubixML/pull/106.
- A Persistor backed by
Flysystem 2.x
(beta): https://github.com/RubixML/Extras/pull/3
REVISIONS:
2020-10-05
:
- Initial publish and request for comments.