This is a simple, streaming parser for processing large JSON documents

Overview

Streaming JSON parser for PHP

Build Status GitHub tag Packagist Coverage Status Minimum PHP Version License

This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to avoid loading the entire thing into memory, which is how just about every other JSON parser for PHP works.

For more details, I've written up a longer explanation of the JSON streaming parser that talks about pros and cons vs. the standard PHP JSON parser.

If you've ever used a SAX parser for XML (or even JSON) in another language, that's what this is. Except for JSON in PHP.

This package is compliant with PSR-4, PSR-1, and PSR-2. If you notice compliance oversights, please send a patch via pull request.

Installation

To install JsonStreamingParser you can either clone this repository or you can use composer

composer require salsify/json-streaming-parser

Usage

To use the JsonStreamingParser you just have to implement the \JsonStreamingParser\Listener interface. You then pass your Listener into the parser.

For example:

$stream = fopen('doc.json', 'r');
$listener = new YourListener();
try {
  $parser = new \JsonStreamingParser\Parser($stream, $listener);
  $parser->parse();
  fclose($stream);
} catch (Exception $e) {
  fclose($stream);
  throw $e;
}

That's it! Your Listener will receive events from the streaming parser as it works.

There is a complete example of this in example/example.php.

Running tests

make test

Projects using this library

JSON Collection Parser

JSON Objects

License

MIT License (c) Salsify, Inc.

Comments
  • Lack of namespaces / PSR-4 compliance

    Lack of namespaces / PSR-4 compliance

    Please make sure all of your classes have proper namespaces so that they may be safely used in a 3rd-party project. I would highly recommend adhering to the PSR-4 autoloading standard: http://www.php-fig.org/psr/psr-4/

    opened by hackel 13
  • Small performance additions (may be controversial)

    Small performance additions (may be controversial)

    In this changeset I've done two changes, whereas one may be disputed. First I've removed all regex functions. This gave a very small improvement, almost negligible, but it's something.

    Secondly, I made the file_position method optional on the listener. This is probably a more controversial change, but the performance benefit is 15-20%. If it was removed entirely, the performance benefit would be between 20-25%. Unfortunately, if you actually rely on file_position, this change set has a performance penalty compared to master. For me this is a no-brainer, but assume others will have some opinions.

    Put differently: Parsing a 500MB file with ~1.2M entries, I now save 10 minutes.

    I did 10 test runs (for run in {1..10}; do php tests/performance.php; done) of the perf branch vs master, here are the results:

    | phaza/perf | salsify/master | gain | | --- | --- | --- | | 9.107668877 | 10.67672706 | 14.70% | | 8.819965124 | 11.25982618 | 21.67% | | 9.094146967 | 10.73865008 | 15.31% | | 8.585276127 | 10.50739408 | 18.29% | | 9.22768712 | 10.99455309 | 16.07% | | 9.032989979 | 10.81993604 | 16.52% | | 8.707053185 | 10.64168501 | 18.18% | | 9.159993887 | 10.44938397 | 12.34% | | 8.928499937 | 10.909729 | 18.16% | | 8.877639771 | 11.15229297 | 20.40% | | ----------------- | ------------------ | ---------- | | 8.954092097 | 10.81501775 | 17.16% |

    opened by phaza 10
  • Listener can't call stop() ?

    Listener can't call stop() ?

    I might be missing something here, but it looks like the listener class can't call the parser's stop(), because it doesn't have a parser object. The perfect place to introduce them would be in Parser::__construct(), where both objects are ready for use.

    setParser(Parser $parser) would be optimal, but that would break bc for classes implementing the interface directly.

    What I did to 'fix' this, is throw and handle a ParsingStoppedException, but that seems silly since there is a stop() method...

    opened by rudiedirkx 9
  • Composer question

    Composer question

    I know that this is a stupid question, but how to install the JsonStreamingParser via composer? I've tried with "php-streaming-json-parser": "4.*" and "JsonStreamingParser" and what not..

    opened by gorankrgovic 8
  • Nested objects in a nested array - key disapear

    Nested objects in a nested array - key disapear

    To reproduce the behavior, modify example.json to :

    [
      {
        "name": "example document for wicked fast parsing of huge json docs",
        "integer": 123,
        "totally sweet scientific notation": -123.123e-2,
        "unicode? you betcha!": "ú™£¢∞§\u2665",
        "zero character": "0",
        "null is boring": null,
        //this is the important part
        "lines": [
          { "test" : 1, "test2" : 2},
          { "test" : 1, "test2" : 2}
        ]
      },
      ...
    ]
    

    Using the example.php script, it'll produce:

      [0]=>
      array(7) {
        ...
        [0]=> //this key should be "list" not "0"
        array(2) {
          [0]=>
          array(2) {
            ["test"]=>
            int(1)
            ["test2"]=>
            int(2)
          }
          [1]=>
          array(2) {
            ["test"]=>
            int(1)
            ["test2"]=>
            int(2)
          }
        }
      }
    

    I first though it was because of the start_object method but it's not. I've not looked into the core script to see if I could fix this but if someone had a clue on that it'd be great! Is it a bug or is this the indented behavior?

    opened by soyuka 8
  • I m read trying to  36 MB File Fails

    I m read trying to 36 MB File Fails

    I m read trying to 36 MB File Fails,its taking too much time. $str = file_get_contents('./JString.json'); $arr = json_decode($str, true); // decode the JSON into an associative array

    Fast too much than given code jsonstreamingparser...........

    opened by ptl07 6
  • Pledge allegiance to semantic versioning :)

    Pledge allegiance to semantic versioning :)

    Entirely up to the project owner, but it helps when reasoning about dependencies and maintenance if you know that a given dependency adheres to a particular versioning standard. The current tags show v1.0, v2.0, v3.0, v4.0 but I'm not sure if they represent breaking changes or just updates. I'm also keen to use the Name\Spaced\Update on master but not sure if master is really stable...

    http://semver.org/

    opened by robations 6
  • Still getting memory errors

    Still getting memory errors

    Can you provide better examples on how to use this? PHP still returns memory issues:

    Allowed memory size of 134217728 bytes exhausted

    The examples are lacking. All it shows is a way to read the entire file into memory, which is what I thought this code is supposed to be avoiding.

    opened by t7m 5
  • Provide a useful object reader in

    Provide a useful object reader in "core"

    Currently jsonstreamingparser is not a ready-to-use solution. It provides the following readers (listeners):

    • GeoJsonListener.php - reads GeoJSON
    • IdleListener.php - reads nothing
    • InMemoryListener.php - reads all in memory
    • SubsetConsumerListener.php - didn't get what it does as no description was provided.

    --- neither of which can be used with a arbitrary JSON data. Maybe that's so because the author didn't find any really useful and generic reader (except for GeoJSON which is certainly not generic)? That could be the case, yeah, and if we look at the project README, we would clearly read the message *simply implement your Listener and it's done!". That easy!

    Is that? Well guys I've tried to implement a so called Listener and I can tell you: IT'S BLOODY DIFFICULT, cumbersome and time consuming. Honestly I don't see any point in implementing something more difficult then just a reader of objects going one after another. Because if you really want to preprocess json, you'd definitely use a specialized tool like jq.

    So yeah, the generic case exists and it's called: reader of top-level objects.

    By a good fortune it is already implemented here: https://github.com/MAXakaWIZARD/JsonCollectionParser

    My proposition is to include it in the list of "standard" readers (listeners).

    Also, let's:

    • consider not including its Parser as it doesn't seem to be generic;
    • finish the work started here: #60 (add support for a stream of json documents);
    • finish the work started here (the same support for document streams, also objects, objects in subarrays etc)

    Let's summon @MAXakaWIZARD for the discussion.

    opened by OnkelTem 5
  • Parse of 0x7F byte in strings

    Parse of 0x7F byte in strings

    Hi,

    I am consuming some third-party JSON which does not escape \x7F byte in strings.

    src/Parser.php#L218 does consider this invalid. But i think other libraries do not consider this invalid.

    Is there any more information about this issue?

    opened by mappu 5
  • Error during import of UTF8 files.

    Error during import of UTF8 files.

    When we import a UTF8 file fread starts with reading the BOM signature for the file and it causes an error in parsing the stream with error Document must start with object or array.

    I believe the problem is fully answered here and should be a fairly easy fix :smile:

    https://stackoverflow.com/questions/9126423/php-fread-function-returning-extra-charactors-at-the-front-on-utf-8-text-files

    Thanks in advance

    opened by rsanaie 5
  • Needs better explanation / Read.me

    Needs better explanation / Read.me

    Hi, thank you for this awesome tool, it was extremely helpful parsing huge GeoJSON data. However, the explanation of the really important part is totally missing. It was difficult to find out what's going on. When I went to your blog post the same happened, when I was looking for how to implement the listener the blog post just say: It's pretty straightforward. It is but not for somebody who gets here the first time :)

    Here's my code snippet using the GeoJSON Listener for another users who needs a better example:

    <?php
    
      ini_set('max_execution_time', 0);
      ini_set('memory_limit', -1);
    
      $tmp_file_name = 'geojson.json';
    
      $stream = fopen('/tmp/'.$tmp_file_name, 'r');
    
      $listener = new \JsonStreamingParser\Listener\GeoJsonListener(function($json){
          if(!empty($json)){
              //var_dump($json['properties']);
              ////////////////// Do WHATEVER YOU WANT WITH THE FEATURE //////////////////
    
              //Storing in db
              $data = [
                  'title'     => !empty($json['properties']['shapeName']) ? $json['properties']['shapeName'] :$json['properties']['shapeISO'],
                  'iso'     => $json['properties']['shapeISO'],
                  'gb_id'     => $json['properties']['shapeID'],
                  'country_iso'     => $json['properties']['shapeGroup'],
                  'admin_level'     => str_replace('ADM', '', $json['properties']['shapeType']),
              ];
    
              $json['properties'] = $data;
    
              $data['geometry'] = json_encode($json);
    
              try{
                   //Mock DB insert
                  DB::table('geoboundaries')->insert($data);
              }catch (\Exception $exception){
                  dd(substr($exception->getMessage(), 0, 300));
              }
    
    
    
          }
    
      });
      try {
          $parser = new \JsonStreamingParser\Parser($stream, $listener);
          $parser->parse();
    
          fclose($stream);
          return true;
      } catch (Exception $e) {
          fclose($stream);
          throw $e;
      }
    
    
    opened by csimpi 1
  • resuming after `stop()` is broken

    resuming after `stop()` is broken

    So I'm trying to write a wrapper over this lib that would be used as a generator. Basically I want to yield all json-objects I need from a stream.

    The lib does not use yields or generators, so this is what I came up with:

    First, a simple listener for this example: lets collect all values from "foo" fields given they all are scalar

    class FooJsonListener extends \JsonStreamingParser\Listener\IdleListener {
        protected $isReadingFoo = false;
        protected $onFooCallback;
        
        public function onFoo(callable $onFooCallback)
        {
            $this->onFooCallback = $onFooCallback;
        }
    
        public function key(string $key): void
        {
            $this->isReadingFoo = ($key === 'foo');
        }
    
        public function value($value): void
        {
            if ($this->isReadingFoo) {
                call_user_func($this->onFooCallback, $value);
                $this->isReadingFoo = false;
            }
        }
    }
    

    Now this function starts parser, then stops every time we collected a foo-value, yields it, then resumes parsing again:

    function fooIterator($stream) {
        $jsonListener = new FooJsonListener();
        $jsonParser = new JsonStreamingParser\Parser($stream, $jsonListener);
    
        $lastFoo = null;
        $shouldYieldFoo = false;
        $jsonListener->onFoo(function ($foo) use (&$lastFoo, &$shouldYieldFoo, $jsonParser) {
            $lastFoo = $foo;
            $shouldYieldFoo = true;
            $jsonParser->stop();
        });
    
        while (true) {
            $jsonParser->parse();
            if ($shouldYieldFoo) {
                yield $lastFoo;
                $shouldYieldFoo = false;
            } else {
                break;
            }
        }
    }
    

    Looks awful, but should work. But it does not. After the first yield, next parse() will throw:

    JsonStreamingParser\Exception\ParsingException: "Parsing error in [1:1]. Expected ',' or ']' while parsing array. Got: {"
    

    It works if we don't interrupt it with "stop()" thought.

    The problem: parse() reads stream by chunks, but parses char-by-char. On any char there might be a call to listener to register parsed token. After every char it checks stopParsing flag, raised by stop(). So most of the cases, when you call stop() from listener you break chunk parsing in the middle, leftover of the chunk is discarded, and calling parse() then will proceed from the next chunk, not where it stopped. And also stopParsing is never reset to false.

    Possible fixes: a. Store the actual count of bytes read on stop(), and do fseek to this offset on next parse() (would not work with non-seekable streams thought) b. Better: store the leftover of chunk on stop() and prepend it to data read on next parse()

    Sadly, neither of this is possible to implement by extending classes due to the privates.

    opened by klkvsk 3
  • Codes creates a symlink to phpunit, but phpunit is missing

    Codes creates a symlink to phpunit, but phpunit is missing

    When installing this code with compser for Magento 2 it creates a symlink to phpunit at vendor/bin/phpunit. But since phpunit is not required in the composer.json it doesn't get isntalled. Because of this it creates a broken symlink for Magento to.

    opened by mckellip 0
  • Need more explanation on how to use the library without composer

    Need more explanation on how to use the library without composer

    I get this error after installing the library with composer:

    PHP Fatal error:  Uncaught Error: Class 'JsonStreamingParser\Listener\InMemoryListener' not found in /data/www/default/database/json_to_sql.php:14
    Stack trace:
    #0 {main}
      thrown in /data/www/default/database/json_to_sql.php on line 14
    

    json_to_sql.php

    require_once 'vendor/autoload.php';
    $json_file = dirname(__FILE__) . "file.json";
    
    # Load json data from the file into a variable
    Line 14 -> $listener = new \JsonStreamingParser\Listener\InMemoryListener();
    
    $stream = fopen($json_file, 'r');
    try {
        $parser = new \JsonStreamingParser\Parser($stream, $listener);
        $parser->parse();
        fclose($stream);
    } catch (Exception $e) {
        fclose($stream);
        throw $e;
    }
    
    $json_data = $listener->getJson();
    var_dump($json_data);
    

    I've installed the library with downloading the composer.json file and doing php composer.phar install.

    $ php composer.phar diagnose                                              
    Checking composer.json: OK
    Checking platform settings: OK
    Checking git settings: 
    
      [Symfony\Component\Process\Exception\RuntimeException]
      The Process class relies on proc_open, which is not available on your PHP installation.
    
    diagnose
    

    When I clone the repository it gives me the same error. I don't know if I have to do an include or something, there is no example.

    opened by ghost 1
  • No exception when json ends unexpectedly

    No exception when json ends unexpectedly

    The parser doesn't give an error when the input suddenly stops. The current behavior results in "nothing more to process" and you think you are done. When it happens halfway in one of the blocks, that block simply "disappear".

    This is easily to reproduce with several unit tests by cutting of a bit at the end of the test files. If you simply remove the "}" at the end of the file, the tests still work. The tests only start to fail when you remove the "}" at level 2.

    The Parser should throw an exception when the json is not correctly closed, as this might be an indication you are missing part of a (potentially big) file.

    opened by doppynl 0
Releases(v8.2.0)
Owner
Salsify
Salsify
Efficient, easy-to-use, and fast PHP JSON stream parser

JSON Machine Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP 5.6+. See TL;DR.

Filip Halaxa 801 Dec 28, 2022
📜 Modern Simple HTML DOM Parser for PHP

?? Simple Html Dom Parser for PHP A HTML DOM parser written in PHP - let you manipulate HTML in a very easy way! This is a fork of PHP Simple HTML DOM

Lars Moelleken 665 Jan 4, 2023
Simple URL parser

urlparser Simple URL parser This is a simple URL parser, which returns an array of results from url of kind /module/controller/param1:value/param2:val

null 1 Oct 29, 2021
Better Markdown Parser in PHP

Parsedown Better Markdown Parser in PHP - Demo. Features One File No Dependencies Super Fast Extensible GitHub flavored Tested in 5.3 to 7.3 Markdown

Emanuil Rusev 14.3k Jan 8, 2023
Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.

PHP Markdown PHP Markdown Lib 1.9.0 - 1 Dec 2019 by Michel Fortin https://michelf.ca/ based on Markdown by John Gruber https://daringfireball.net/ Int

Michel Fortin 3.3k Jan 1, 2023
Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.

league/commonmark league/commonmark is a highly-extensible PHP Markdown parser created by Colin O'Dell which supports the full CommonMark spec and Git

The League of Extraordinary Packages 2.4k Jan 1, 2023
A super fast, highly extensible markdown parser for PHP

A super fast, highly extensible markdown parser for PHP What is this? A set of PHP classes, each representing a Markdown flavor, and a command line to

Carsten Brandt 989 Dec 16, 2022
An HTML5 parser and serializer for PHP.

HTML5-PHP HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has w

null 1.2k Dec 31, 2022
A New Markdown parser for PHP5.4

Ciconia - A New Markdown Parser for PHP The Markdown parser for PHP5.4, it is fully extensible. Ciconia is the collection of extension, so you can rep

Kazuyuki Hayashi 357 Jan 3, 2023
Advanced shortcode (BBCode) parser and engine for PHP

Shortcode Shortcode is a framework agnostic PHP library allowing to find, extract and process text fragments called "shortcodes" or "BBCodes". Example

Tomasz Kowalczyk 358 Nov 26, 2022
A lightweight lexical string parser for BBCode styled markup.

Decoda A lightweight lexical string parser for BBCode styled markup. Requirements PHP 5.6.0+ Multibyte Composer Contributors "Marten-Plain" emoticons

Miles Johnson 194 Dec 27, 2022
Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Parsica The easiest way to build robust parsers in PHP.

null 0 Feb 22, 2022
This is a php parser for plantuml source file.

PlantUML parser for PHP Overview This package builds AST of class definitions from plantuml files. This package works only with php. Installation Via

Tasuku Yamashita 5 May 29, 2022
A PHP hold'em range parser

mattjmattj/holdem-range-parser A PHP hold'em range parser Installation No published package yet, so you'll have to clone the project manually, or add

Matthias Jouan 1 Feb 2, 2022
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 54 May 23, 2022
A simple PHP library for handling Emoji

Emoji Emoji images from unicode characters and names (i.e. :sunrise:). Built to work with Twemoji images. use HeyUpdate\Emoji\Emoji; use HeyUpdate\Emo

null 51 Jan 15, 2021
A simple Atom/RSS parsing library for PHP.

SimplePie SimplePie is a very fast and easy-to-use class, written in PHP, that puts the 'simple' back into 'really simple syndication'. Flexible enoug

SimplePie 1.5k Dec 18, 2022
This is a simple php project to help a friend how parse a xml file.

xml-parser-with-laravie Requirements PHP 7.4+ Composer 2+ How to to setup to test? This is very simple, just follow this commands git clone https://gi

Lucas Saraiva 2 Dec 3, 2021
A simple class that converts your URLs to link names ✨

CuteLinkNames A simple class that converts your URLs to link names ✨ ??

♚ PH⑦ de Soria™♛ 5 Dec 9, 2022