PHP implementation for reading and writing Apache Parquet files/streams

Last update: Oct 25, 2022

Related tags

Overview

php-parquet

This is the first parquet file format reader/writer implementation in PHP, based on the Thrift sources provided by the Apache Foundation. Extensive parts of the code and concepts have been ported from parquet-dotnet (see https://github.com/elastacloud/parquet-dotnet and https://github.com/aloneguid/parquet-dotnet). Therefore, thanks go out to Ivan Gavryliuk (https://github.com/aloneguid).

This package enables you to read and write Parquet files/streams w/o the use of exotic external extensions (except you want to use exotic compression methods). It has (almost?) 100% test compatibility with parquet-dotnet, regarding the core functionality, done via PHPUnit.

Important

This repository (and associated package on Packagist) has been decoupled from the first release (https://github.com/Jocoon/php-parquet, Package jocoon/parquet) and re-branded under a future, fresh family name of packages.

Preamble

For some parts of this package, some new patterns had to be invented as I haven't found any implementation that met the requirements. For most cases, there weren't any implementations available, at all.

Some highlights:

GZIP Stream Wrappers (that also write headers and checksums) for usage with fopen() and similar functions
Snappy Stream Wrappers (Snappy compression algorithm) for usage with fopen() and similar functions
Stream Wrappers that specify/open/wrap a resource id instead of (or in addition to) a file path or URI
TStreamTransport as a TTransport implementation for pure streaming Thrift data

Background

I started developing this library due to the fact, there was simply no implementation for PHP.

At my company, we needed a quick solution to archive huge amounts of data from a database in a format that is still queryable, extensible from a schema-perspective and fault-tolerant. We started testing live 'migrations' via AWS DMS to S3, which ended up crashing on certain amounts of data, due to memory limitations. And it simply was too db-oriented, next to the fact it's easy to accidentally delete data from previous loads. As we have a heavily SDS-oriented and platform-agnostic architecture, it is not my preferred way to store data as a 1:1 clone of database, like a dump. Instead, I wanted to have the ability to store data, structured dynamically, like I wanted, in the same way DMS was exporting to S3. Finally, the project died due to the reasons mentioned above.

But I couldn't get the parquet format out of my head..

The TOP 1 search result (https://stackoverflow.com/questions/44780419/how-to-create-orc-or-parquet-files-from-php-code) looked promising that it would not take that much effort to have a PHP implementation - but in fact, it did take some (about 2 weeks non-consecutive work). For me, as a PHP and C# developer, parquet-dotnet was a perfect starting point - not merely due to the fact the benchmarks are simply too compelling. But I expected the PHP implementation not to meet these levels of performance, as this is an initial implementation, showing the principle. And additionally, no one had done it before.

Raison d'être

As PHP has a huge share regarding web-related projects, this is a MUST-HAVE in times of growing need for big data applications and scenarios. For my personal motivation, this is a way to show PHP has (physically, virtually?) surpassed it's reputation as a 'scripting language'. I think - or at least I hope - there are people out there that will benefit from this package and the message it transports. Not only Thrift objects. Pun intended.

Requirements

You'll need several extensions to use this library to the full extent.

bcmath (today, this should be a must-have anyway)
gmp (for working with arbitrary large integers - and indirectly huge decimals!)
zlib (for GZIP (de-)compression)
snappy (https://github.com/kjdev/php-ext-snappy - sadly, not published yet to PECL - you'll have to compile it yourself - see Installation)

This library was originally developed to/using PHP 7.3, but it should work on PHP > 7 and will be tested on 8, when released. At the moment, tests on PHP 7.1 and 7.2 will fail due to some DateTime issues. I'll have a look at it. Tests fully pass on PHP 7.3 and 7.4. At the time of writing also 8.0.0 RC2 is performing well.

This library highly depends on

apache/thrift for working with the Thrift-related objects and data
nelexa/buffer for reading and writing binary data (I decided not to do a C# BinaryWriter clone. (UPDATE 2020-11-04: I just did my own clone, see below.)
pear/Math_BigInteger for working with binary stored arbitrary-precision decimals (paradox, I know)

As of v0.2, I've also switched to an implementation-agnostic approach of using readers and writers. Now, we're dealing with BinaryReader(Interface) and BinaryWriter(Interface) implementations that abstract the underlying mechanism. I've noticed mdurrant/php-binary-reader is just way too slow. I just didn't want to refactor everything just to try out Nelexa's reading powers. Instead, I've made those two interfaces mentioned above to abstract various packages delivering binary reading/writing. This finally leads to an optimal way of testing/benchmarking different implementations - and also mixing, e.g. using wapmorgan's package for reading while using Nelexa's for writing.

As of v0.2.1 I've done the binary reader/writer implementations myself, as no implementation met the performance requirements. Especially for writing, this ultra-lightweight implementation delivers thrice* the performance of Nelexa's buffer.
_{^{_{^{* intended, I love this word}}}}

Alternative 3rd party binary reading/writing packages in scope:

nelexa/buffer
mdurrant/php-binary-reader (reading only)
wapmorgan/binary-stream

Installation

Install this package via composer, e.g.

composer require codename/parquet

The included Dockerfile gives you an idea of the needed system requirements. The most important thing to perform, is to clone and install php-ext-snappy. At the time of writing, it has not been published do PECL, yet.

...
# NOTE: this is a dockerfile snippet. Bare metal machines will be a little bit different

RUN git clone --recursive --depth=1 https://github.com/kjdev/php-ext-snappy.git \
  && cd php-ext-snappy \
  && phpize \
  && ./configure \
  && make \
  && make install \
  && docker-php-ext-enable snappy \
  && ls -lna

...

Please note: php-ext-snappy is a little bit quirky to compile and install on Windows, so this is just a short information for installation and usage on Linux-based systems. As long as you don't need the snappy compression for reading or writing, you can use php-parquet without compiling it yourself.

Helping tools to make life easier

I've found ParquetViewer (https://github.com/mukunku/ParquetViewer) by Mukunku to be a great way of looking into the data to be read or verifying some stuff on a Windows desktop machine. At least, this helps understanding certain mechanisms, as it more-or-less visually assists by simply displaying the data as a table.

API

Usage is almost the same as parquet-dotnet. Please note, we have no using ( ... ) { }, like in C#. So you have to make sure to close/dispose unused resources yourself or let PHP's GC handle it automatically by its refcounting algorithm. (This is the reason why I don't make use of destructors like parquet-dotnet does.)

General remarks

As PHP's type system is completely different to C#, we have to make some additions on how to handle certain data types. For example, a PHP integer is nullable, somehow. An int in C#, isn't. This is a point I'm still unsure about how to deal with it. For now, I've set int (PHP integer) to be nullable - parquet-dotnet is doing this as not-nullable. You can always adjust this behaviour by manually setting ->hasNulls = true; on your DataField. Additionally, php-parquet uses a dual way of determining a type. In PHP, a primitive has it's own type (integer, bool, float/double, etc.). For class instances (especially DateTime/DateTimeImmutable), the type returned by get_type() is always object. This is the reason a second property for the DataTypeHandlers exist to match, determine and process it: phpClass.

At the time of writing, not every DataType supported by parquet-dotnet is supported here, too. F.e. I've skipped Int16, SignedByte and some more, but it shouldn't be too complicated to extend to full binary compatibility.

At the moment, this library serves the core functionality needed for reading and writing parquet files/streams. It doesn't include parquet-dotnet's Table, Row, Enumerators/helpers from the C# namespace Parquet.Data.Rows.

Reading files

use codename\parquet\ParquetReader;

// open file stream (in this example for reading only)
$fileStream = fopen(__DIR__.'/test.parquet', 'r');

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
  // create row group reader
  $groupReader = $parquetReader->OpenRowGroupReader($i);
  // read all columns inside each row group (you have an option to read only
  // required columns if you need to.
  $columns = [];
  foreach($dataFields as $field) {
    $columns[] = $groupReader->ReadColumn($field);
  }

  // get first column, for instance
  $firstColumn = $columns[0];

  // $data member, accessible through ->getData() contains an array of column data
  $data = $firstColumn->getData();

  // Print data or do other stuff with it
  print_r($data);
}

Writing files

use codename\parquet\ParquetWriter;

use codename\parquet\data\Schema;
use codename\parquet\data\DataField;
use codename\parquet\data\DataColumn;

//create data columns with schema metadata and the data you need
$idColumn = new DataColumn(
  DataField::createFromType('id', 'integer'), // NOTE: this is a little bit different to C# due to the type system of PHP
  [ 1, 2 ]
);

$cityColumn = new DataColumn(
  DataField::createFromType('city', 'string'),
  [ "London", "Derby" ]
);

// create file schema
$schema = new Schema([$idColumn->getField(), $cityColumn->getField()]);

// create file handle with w+ flag, to create a new file - if it doesn't exist yet - or truncate, if it exists
$fileStream = fopen(__DIR__.'/test.parquet', 'w+');

$parquetWriter = new ParquetWriter($schema, $fileStream);

// create a new row group in the file
$groupWriter = $parquetWriter->CreateRowGroup();

$groupWriter->WriteColumn($idColumn);
$groupWriter->WriteColumn($cityColumn);

// As we have no 'using' in PHP, I implemented finish() methods
// for ParquetWriter and ParquetRowGroupWriter

$groupWriter->finish();   // finish inner writer(s)
$parquetWriter->finish(); // finish the parquet writer last

Performance

This package also provides the same benchmark as parquet-dotnet. These are the results on my machine:

	Parquet.Net (.NET Core 2.1)	php-parquet (bare metal 7.3)	php-parquet (dockerized* 7.3)	Fastparquet (python)	parquet-mr (Java)
Read	255ms	1'090ms	1'244ms	154ms**	untested
Write (uncompressed)	209ms	1'272ms	1'392ms	237ms**	untested
Write (gzip)	1'945ms	3'314ms	3'695ms	1'737ms**	untested

* Dockerized on a Windows 10 machine with bind-mounts, which slow down most of those high-IOPS processes.
** It seems fastparquet or Python does some internal caching - the original results on first file opening are way worse (~ 2'700ms)

In general, these tests were performed with gzip compression level 6 for php-parquet. It will roughly halve with 1 (minimum compression) and almost double at 9 (maximum compression). Note, the latter might not yield the smallest file size, but always the longest compression time.

Coding Style

As this is a partial port of a package from a completely different programming language, the programming style is pretty much a pure mess. I decided to keep most of the casing (e.g. $writer->CreateRowGroup() instead of ->createRowGroup()) to keep a certain 'visual compatibility' to parquet-dotnet. At least, this is a desirable state from my perspective, as it makes comparing and extending much easier during initial development stages.

Acknowledgements

Some code parts and concepts have been ported from C#/.NET, see:

License

php-parquet is licensed under the MIT license. See file LICENSE.

Contributing

You might do a PR, if you want. Info on how to contribute is coming soon.

Comments

Add support write file custom metadata

Added support for saving custom metadata

See in parquet-dotnet implementation: https://github.com/aloneguid/parquet-dotnet/blob/dd88943c900c7da2afc20f4bbd8466aeeae23100/docs/writing.md#custom-metadata

opened by eisberg 7

Issue when readColumn in Parquet file with large amount of data

Hi, thanks for your great library.It works well with small parquet file, but when i tried to read data from Parquet file with ~500k row of data, array values from readColumn ->getData() become incorrect.

Here is my parquet file: https://dev-sc2-pn.s3.ap-northeast-1.amazonaws.com/sc2_area_master+(3).parquet

My parquet file has only 92 rows with project_id = '123456789012345678', but when i get data from colum getData(), it return more than 300k row with this project_id.

Here is my sample code.Do you have any idea about this issue?

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
	// create row group reader
	$groupReader = $parquetReader->OpenRowGroupReader($i);
	$rowCount = $groupReader->getRowCount();

	// read all columns inside each row group (you have an option to read only
	// required columns if you need to.
	$columns = [];
	foreach ($dataFields as $field) {
		$columns[] = @$groupReader->ReadColumn($field);
	}

	// $data member, accessible through ->getData() contains an array of column data
	$projectIds = $columns[0]->getData();

	dd($columns[0]->getData(0));
}

opened by padi-pm-dungnt 4

Batch Writing - Example

Hi,

Does this support batch writing?

I have 100k+ data in my database and planning to do batch writing to a parquet file. I am thinking of chunking my database 5k of data at a time then performing a write to a parquet file.

I saw it in this link - https://github.com/aloneguid/parquet-dotnet/blob/master/doc/writing.md#appending-to-files that they have an example, but when I tried that on this library I am always getting this error not a Parquet file(tail is '\\\')

This is my code:

public function export($table, $startDate, $endDate)
    {
        $fields = [];
        $columns = [];

        $appendFile = false;

        $fileName = $table . '_' . str_replace(':', '-', $startDate) . '.parquet';
        $filePath = storage_path('/' . $fileName);
        $fileStream = fopen($filePath, 'a+');

        DB::table($table)->whereBetween('created_at', [$startDate, $endDate])->orderBy('id')->chunk(5000, function ($result) use (&$fields, &$columns, &$filePath, $table, $startDate, $endDate, &$parquetWriter, &$groupWriter, &$appendFile, &$fileStream) {
            if ($result) {
                $keyOnly = Arr::first($result);

                foreach ($keyOnly as $key => $value) {
                    $colType = $this->getColumnType($table, $key);
                    $dataColumn = new DataColumn(
                        DataField::createFromType($key, $colType),
                        $result->pluck($key)->toArray()
                    );

                    $columns[] = $dataColumn;
                    $fields[] = $dataColumn->getField();
                }

                $schema = new Schema($fields);

                $parquetWriter = new ParquetWriter($schema, $fileStream, null, $appendFile);

                $appendFile = true;

                // create a new row group in the file
                $groupWriter = $parquetWriter->CreateRowGroup();
                foreach ($columns as $col) {
                    $groupWriter->WriteColumn($col);
                }

                $groupWriter->finish();   // finish inner writer(s)
            }
        });

        $parquetWriter->finish(); // finish the parquet writer last

        return $filePath;
    }

Your help is greatly appreciated! Thanks.

opened by cedricfuturistech 3

$Expected parameter of type '\codename\parquet\data\DataField', '\codename\parquet\data\Field' provided$
Expected parameter of type '\codename\parquet\data\DataField', '\codename\parquet\data\Field' provided
Hi!

In your READ example, you have

$dataFields = $parquetReader->schema->GetDataFields();

and later

foreach($dataFields as $field) { $columns[] = $groupReader->ReadColumn($field); }

However, the return value of the GetDataFields() function is Field[], while the first parameter for the ReadColumn function is a DataField.
opened by darthf1 3

Reading Row Groups

Hey Based on this library, I'm trying to implement a parquet adapter for Flow PHP. I started from writing few tests (row groups are that small just for testing purpose), you can find code below:

Code Example

<?php

use codename\parquet\data\DataColumn;
use codename\parquet\data\DataField;
use codename\parquet\data\Schema;
use codename\parquet\ParquetWriter;

require_once __DIR__ . '/../vendor/autoload.php';

$id = DataField::createFromType('id', 'integer');

$schema = new Schema([$id]);

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'w+'), null, false);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [1, 2, 3, 4]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [5, 6, 7, 8]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [9, 10, 11, 12]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [13, 14, 15, 16]));
$rowGroup->finish();
$writer->finish();

But when I tried to read that using parquet-tools I'm getting following error:

parquet-tools cat --json test.parquet
{"id":1}
{"id":2}
{"id":3}
{"id":4}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Socket is closed by peer.

I also tried to check the file content through avro parquet viewer and I'm getting this:

Any idea what might be wrong here? If you could point me in the right direction, I can debug this issue further because I'm not that familiar with parquet format so any help is welcome.

Thanks for all your work to make parquet available in PHP!

opened by norberttech 2

Issue when trying to get empty string value from parquet file
Hi there, thank you for your library. It works fine, but I had a problem when trying to retrieve data from a file where there were empty values. The problem itself occurs in the CustomBinaryReader file in the readString function. If you add a check for length and default values, everything will work fine

/** * @inheritDoc */ public function readString($length) { $this->position += $length; // ? return $length ? fread($this->stream, $length) : ''; }
opened by kleve-r 2
Unable to read RLE_DICTIONARY encoded columns

For a detailed description of the problem, see the .net repository: https://github.com/aloneguid/parquet-dotnet/issues/107

About Encodings: https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

opened by eisberg 1

ParquetReader - Get Value

Hi,

thanks for this package - I'm currently testing different parquet library in different languages to check which one could be our replacement for the current flask implementation.

It seems that your package has no problems with reading our packages ( the TS Parquet Package seems to support only Parquet 2.0 ).

I created quickly a laravel app to test it ( i have some other ideas, but the main feature should work before I start the developing ;) )

The snippet is the following:

$parquetPath = Storage::path('path/to/parquetfile.parquet');

    $parquetStream = fopen($parquetPath, 'r');

    $parquetReader = new ParquetReader($parquetStream);

    $dataFields = $parquetReader->schema->GetDataFields();

    $result = [];

    for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
        // create row group reader
        $groupReader = $parquetReader->OpenRowGroupReader($i);
        // read all columns inside each row group (you have an option to read only
        // required columns if you need to.
        $columns = [];
        foreach ($dataFields as $field) {
            $column = $groupReader->ReadColumn($field);
            $columns[$column->getField()->name] = $column->getData();
        }

        $result[] = $columns;
    }

    dd($result);

$result shows me the correct columns, but the column value which I got via getData is always the binary.

I checked the code, but wasn't able to find the relevant part to convert it back to the readable value.

Maybe you could give me a hint how to do this.

Thanks!

opened by noxify 1

When trying decode empty string with StringDataTypeHandler->plainDecode getting an error from CustomBinaryReader->readString
I'm getting an exception when StringDataTypeHandler::plainDecode getting an empty string ("") in $encoded parameter.

public function plainDecode( \codename\parquet\format\SchemaElement $tse, $encoded ) { if ($encoded === null) return null; $ms = fopen('php://memory', 'r+'); fwrite($ms, $encoded); $br = BinaryReader::createInstance($ms); $element = $this->readSingleInternal($br, $tse, -1, false); return $element; }

The current validation is only for null but the encode can be an empty string. I think the resolution can be is to check for empty and not for null only.
opened by michael-0-1 2

Owner

GitHub

A pure PHP library for reading and writing presentations documents

Branch Master : Branch Develop : PHPPresentation is a library written in pure PHP that provides a set of classes to write to different presentation fi

1.2k Jan 2, 2023

Test case to reproduce a PHP segmentation fault involving Xdebug, streams, and dates

The code in this repository causes a segmentation fault when run with PHP 8.0.6 on Fedora 33. This issue was originally reported to the Xdebug project

0 Jul 31, 2021

Object-Oriented API for PHP streams

Streamer Streamer is an Object-Oriented API for PHP streams. Why should I use Streams? A stream is a flow of bytes from one container to the other. Yo

270 Dec 21, 2022

This packages enables the ability to serve file streams in a smart way

A blade component for easy image manipulation Want to serve private hosted images without the need to code your own logic ? Want to resize your images

205 Dec 19, 2022

Docker with Apache, MySql, PhpMyAdmin and Php

docker-lamp Docker example with Apache, MySql 8.0, PhpMyAdmin and Php You can use MariaDB 10.1 if you checkout to the tag mariadb-10.1 - contribution

360 Dec 3, 2022

Apache OpenWhisk is an open source serverless cloud platform

OpenWhisk OpenWhisk is a serverless functions platform for building cloud applications. OpenWhisk offers a rich programming model for creating serverl

5.9k Jan 8, 2023

Modello base con tutto il docker configurato per php7.4, mariadb, vue3, apache...con esempi di component e chiamate rest interne

Applicazione base per utilizzare laravel con docker, php7.4, apache, mariadb10, vue3 Semplice installazione corredate di rotte web e api di base, 3 co

0 Jul 14, 2022

Back the fun of reading - PHP Port for Arc90′s Readability

PHP Readability Library If you want to use an up-to-date version of this algorithm,check this newer project: https://github.com/andreskrey/readability

517 Nov 18, 2022

A PHP web interface for scanning ISBN book codes, identify books with Antolin reading promotion offer

Ein PHP-Webinterface zum Scannen von ISBN-Buchcodes, identifiziere Bücher mit Antolin-Leseförderungs-Angebot. Einfache Installation. Für Mitarbeiter*innen in Schulbüchereien.

2 May 20, 2022

PHP OOP interface for writing Slack Block Kit messages and modals

Slack Block Kit for PHP ?? For formatting messages and modals for Slack using their Block Kit syntax via an OOP interface ?? By Jeremy Lindblom (@jere

32 Dec 20, 2022

Reference for writing clear PHP code

clearPHP Reference for writing clear PHP code It is difficult to know when one's code is well written. There are recommendations for writing PHP code

947 Dec 22, 2022

Result of our code-along meetup writing PHP 8.1 code

PHP 8.1 Demo Code This code demonstrates various PHP 8.0 and 8.1 features in a realistic, functional (but incomplete) codebase. The code is part of so

2 Nov 14, 2021

Test essentials for writing testable code that interacts with Magento core modules

Essentials for testing Magento 2 modules Using mocking frameworks for testing Magento 2 modules is counterproductive as you replicate line by line you

9 Oct 6, 2022

Php-file-iterator - FilterIterator implementation that filters files based on a list of suffixes, prefixes, and other exclusion criteria.

php-file-iterator Installation You can add this library as a local, per-project dependency to your project using Composer: composer require phpunit/ph

7.1k Jan 3, 2023

Provide CSV, JSON, XML and YAML files as an Import Source for the Icinga Director and optionally ship hand-crafted additional Icinga2 config files

Icinga Web 2 Fileshipper module The main purpose of this module is to extend Icinga Director using some of it's exported hooks. Based on them it offer

25 Sep 18, 2022

Perch Dashboard app for exporting content to (Kirby) text files and Kirby Blueprint files

toKirby Perch Dashboard app for exporting content to (Kirby) text files and Kirby Blueprint files. You can easily install and test it in a few steps.

4 Jan 15, 2022

This package allows you to send logs to files. based on monolog/monolog. You can use it during your development to make debugging easier. The file are in the var / log folder. This package is recommended for magento 2.This package allows you to send logs to files.

Custom Logger This package allows you to send logs to files. based on monolog/monolog. You can use it during your development to make debugging easier