ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Last update: May 23, 2022


Java 7+ License

Build status

Github CI Build Status (MacOSX) AppVeyor CI Build Status (Windows) Circle CI Build Status (Linux) Travis-CI Build Status (Swift-Linux)

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Given day-job constraints, my time working on this project is limited so I'll have to focus first on fixing bugs rather than changing/improving the feature set. Likely I'll do it in bursts every few months. Please do not be offended if your bug or pull request does not yield a response! --parrt


Authors and major contributors

Useful information

You might also find the following pages useful, particularly if you want to mess around with the various target languages.

The Definitive ANTLR 4 Reference

Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language—ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features.

You can buy the book The Definitive ANTLR 4 Reference at amazon or an electronic version at the publisher's site.

You will find the Book source code useful.

Additional grammars

This repository is a collection of grammars without actions where the root directory name is the all-lowercase name of the language parsed by the grammar. For example, java, cpp, csharp, c, etc...

  • 1. New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    Fixes #276 .

    This used to be a WIP PR, but it's now ready for review.

    This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

    The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

    I bumped the UUID, since this changes the serialized ATN format.

    I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

    Reviewed by bhamiltoncx at 2017-01-27 00:59
  • 2. splitting version numbers for targets

    Hiya: @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos

    Eric has raised the point that it would be nice to be able to make quick patches to the various runtimes; e.g., there is a stopping bug now in the JavaScript target. He proposes something along these lines:

    • any change in the tool or the runtime algorithm bumps the middle version #: 4.9 -> 4.10 -> 4.11
    • any bug fix in a runtime we bump the last digit of that runtime only: 4.9 -> 4.9.1 -> 4.9.2
    • if bumping the java runtime for bug fix we also bump the tool since it contains the runtime

    This is in optimal as people have criticized me in the past for bumping, say, 4.6 to 4.7 for some minor changes. It also has the problem that 4.9.x will not mean the same thing in two different targets possibly, as each target will now have their own version number.

    Rather than break up all of the targets into separate repositories or similar, can you guys think of a better solution? Any suggestions? The goal here is to allow more rapid target releases, and independent of me having to do a major release of the tool.

    Reviewed by parrt at 2020-12-18 22:04
  • 3. Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

    I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

    I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

    Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

    To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

    Reviewed by bhamiltoncx at 2017-03-22 01:09
  • 4. initial discussion to start integration of new targets

    As promised, I am now ready to integrate the new ANTLR target languages you folks have been working on. This issue is meant to get everybody in sync, check status, and discuss the proper order of integration and resolve issues etc.

    There are two administrative details to get out of the way first:

    1. Please let me know if there is another github user that should be added to one of the categories. Or, of course, if you would like your user ID removed from this discussion.
    2. Nothing can be merged into antlr/antlr4 unless every single committer has added themselves to the contributors.txt file. It's onerous, particularly for simple commits, but it is requirement for anything merged into the master. Eclipse foundation lawyers tell me that we have one of the cleanest licenses out there and it contributes to ANTLR's widespread use because companies are not afraid to use the software. See the genesis of such heinous requirements in SCO v IBM. This means lead target authors have to go back through their committers list quickly and ask them to sign the contributors file with a new commit. Or, they can remove that commit and enter their own version of the functionality, being careful not to violate copyright on the previous.

    As we proceed, please keep in mind that I have a difficult role, balancing the needs of multiple targets and keeping discussions in the civil and practical zone. Decisions I make come from the perspective of over 25 years managing and leading this project. I look forward to incorporating your hard work into the main antlr repo.

    C++ current location

    • @mike-lischke
    • @DanMcLaughlin
    • @nburles
    • @davesisson

    Go current location, previous discussion

    • @pboyer

    Swift current location: unclear, previous discussion

    • @jeffreyguenther
    • @hanjoes
    • @janyou
    • @ewanmellor

    Likely interested/supporting humans (scraped from github issues):

    • @RYDB3RG
    • @wjkohnen
    • @willfaught
    • @parrt
    • @sharwell
    • @ericvergnaud
    Reviewed by parrt at 2016-11-04 13:18
  • 5. Add a new CharStream that converts the symbols to upper or lower case.

    This is useful for many of the case insensitive grammars found at which assume the input would be all upper or lower case. Related discussion can be found at

    It would be used like so:

    input, _ := antlr.NewFileStream("filename")
    in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.
    lexer := parser.NewAbnfLexer(in)

    While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

    I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

    Reviewed by bramp at 2017-10-06 14:56
  • 6. Swift Target

    I did a quick search and I didn't see anything written about this yet. What's the likelihood of a Swift target for ANTLR?

    There are C#, Javascript, and Python targets at the moment.

    What does it take to implement a target? Given that Swift is more Java-like, it seems like it should be possible. Maybe start with a code translator if there is one for (Java to Swift), and iterate towards a more idiomatic implementation.

    Reviewed by jeffreyguenther at 2015-06-30 21:09
  • 7. Clean up ATN serialization: rm UUID and shifting by value of 2

    • I think we don't need the UUID in the serialization, since it has not changed in a decade. We can bump the version number and remove the UU ID
    • I did some tests and there seems to be no reason to shift the values in the serialized ATN by 2 for the purposes of improving the UTF-8 encoding for the Java target.

    If you guys agree, we can make this small change for cleanup purposes. I'm happy to do it if you guys don't want to. The second fix will require changes to each target but it's trivial to fix.

    Reviewed by parrt at 2022-01-29 23:11
  • 8. Preparing for 4.9.3 release

    It's that time of year again! @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos Shall we do a 4.9.3 release?

    I went through and marked all of the merged PRs and related issues with 4.9.3 and try to tag them according to their target. Would you guys like to go through the PRs to see if there's something that should be merged quickly?

    Reviewed by parrt at 2021-10-11 18:27
  • 9. [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    @ericvergnaud Hi, I modified csproj options a bit, now I can get a working nuget package locally without the issue we described in #2021. I added .net 3.5 as a target to "main" csproj along with netstandard, since it's easier to keep track of requirements for both sets of api's when editing code and, ideally, both targets can be packed into a nuget package with a single command. Right now it's possible only on Windows via msbuild /t:pack or Visual Studio; unfortunately, due to, right now dotnet build pack does not work for .net 3.5 target the way it should, so I adjusted the existing script to create packages from .nuspec and different solutions for different targets.

    Reviewed by listerenko at 2017-09-25 06:32
  • 10. A few updates to the Unicode documentation.

    It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

    Reviewed by mike-lischke at 2017-03-18 12:02
  • 11. Replace edge representation in DFAState class with a fast hashmap (Java runtime).

    Note: PR comment is edited to reflect latest changes.

    This change fixes lexer performance degradation for non ASCII inputs.

    @parrt @sharwell @bhamiltoncx @ahmetaa

    Background: Existing code uses a DFAState[] to represent edges from the DFAState class to other states and LexerATNSimulator and ParserATNSimulator classes were responsible of creating this array, inserting and retrieving elements.

    Lexer edges are for deciding which state to go next when a character is read: state --character--> otherState

    Current implementation of lexer uses a 127 slot array to look up which state to go. This provides fast lookup for characters < 127 and a much slower path to determine next state for the rest of characters.

    Parser: Parser also uses the same DFAState class, but in this case edges are defined for tokens instead of characters: state --token--> otherState, so parsers uses same array initialized with the maxTokenType.

    Problems with current approach:

    1. The edge representation is not encapsulated properly and exposed publicly, lifetime of edges are controlled by the Lexer and Parser clients. resulting in a fragile and error prone structure.

    2. Because Lexer uses a limited sized lookup table, when it is exposed to input with non ASCII characters its performance degrades considerably. This is not an uncommon case, a lot of code contains non English comments, names, symbols, addresses etc. The slowdown is worse than 10x even if the input contains 10% non Ascii characters.

    3. Existing lookup tables by design wastes memory. Each Lexer DFAState object uses ~1KB memory in 64 bit systems. Parser DFAState objects uses 8*maxTokenType bytes each. For complex grammars maxTokenType could be >100. Most of these tables are sparse. (See the histogram below)

    4. Existing array based solution requires positive integer keys, so they are modified to handle -1 as a key, complicating access especially because edge access is not properly encapsulated.

    New Solution To represent DFAState transitions, we now use a small and fast <int, DFAState> map and remove the DFAState[] .

    The proposed map is a very limited hashmap with int keys. It uses open access with linear probing, also uses 2^n capacity for fast modulo look ups. I instrumented the map in a different branch and in most cases it finds the object at the first lookup.

    Thread safety Antlr lexer and parser threads shares the same DFAState graph so graph access must be thread safe. During parsing and lexing reads outnumber writes by a large margin as the number of inputs are processed. So I opted for cheap read-write lock trick as described here (see item 5) which offers good read performance, some reads may get a bit stale data but this is not a problem in antlrs case. (See @sharwell's comment below)

    Performance Performance is an important aspect of this patch, I tried to address performance degeneration while not regressing fast cases.

    1. I downloaded source code of some open source projects (Java8 SDK, Linux kernel etc) . For each language, I copied content of all source files into a separate file. You can download these from here

    2. Tokenized each file and only counted tokens to isolate pure tokenizing performance. An example benchmark can be found here

    3. Tried tests a few times and on 2 different CPU architectures (AMD Ryzen 1700x and Intel Xeon E5-1650)


    New approach still keeps performance close to existing approach.

    • When input is pure ASCII performance is similar on AMD Ryzen, up to ~30% regression on Intel Xeon. Newer Intel architectures and Arm may get different results.

    • For non ASCII input, completely removes the performance regression and works as fast as pure ASCII input.

    Performance numbers are not very stable and changes depends on many factors like JVM parameters, active cpu governor, if tests are done together or separately etc.

    Also keep in mind that, small differences in pure tokenizing performance might not matter much. To test this, I repeated performance benchmarks but this time did a little bit more meaningful tasks with read tokens, e.g. counted individual token types, token lengths and checked tokens if they contain only spaces. For pure ASCII inputs, small performance differences between old and new diminishes considerably.

    I did not particularly focus on parsing, similar to Lexer with pure ASCII input, I do not expect a big regression on performance because parsing usually involves more work on each state transition, diminishing the importance of a few cycles spent on finding next state. (as explained above). I welcome more benchmarks focused on parsing as well.

    Memory Usage Because its capacity grows only when needed and starting size is quite small (2), hashmaps may use memory more efficiently. This is not a big concern in case of Lexer, as even for complex grammars Lexers tend to have <500 DFAState and <10K edges.

    Java8 Lexer: Total DFA states: 348 Total DFA edges: 5805

    The histogram of number of edges of all Lexer DFAState nodes:

    Edge count histogram: (0..4]: 85 41.67% (4..8]: 20 53.33% (8..10]: 17 58.06% (10..16]: 12 66.39% (16..64]: 49 93.61% (64..100]: 20 99.17% (100..200]: 3 100.00%

    Edges on Parser nodes tend to be even more sparse and number of state objects may get much bigger. I did not specifically measure the memory usage for parser state node, but if max number of tokens are bigger than a certain amount, there could be gains there. For complex grammars it is not uncommon to have >100 tokens.

    Reviewed by mdakin at 2017-09-17 17:55
  • 12. Ensure that only the tokens needed are fetched

    Update the Go and Java runtimes to match the C++ runtime for syncing to only the tokens needed to read the text from the supplied interval.

    The fill() methods search for EOF as a signal to conclude the token syncing; however, when there is an errant input, EOF may be missing and this can result in fill() re-reading the entire token stream for each error encountered in the input. As such fill() has been replaced with sync() calls which provide equivalent functionality.

    Signed-off-by: Tristan Swadell [email protected]

    Reviewed by TristonianJones at 2022-05-18 19:43
  • 13. Java 2 Security issue running with Hibernate

    I am testing using org.antlr:antlr4-runtime:4.9.1 in Hibernate 6.0.0 and I encountered a Java 2 Security issue with ANTLR:

    Current Java 2 Security policy reported a potential violation of Java 2 Security Permission. The application needs to have permissions addedPermission: 
    ("java.lang.RuntimePermission" "getenv.TURN_OFF_LR_LOOP_ENTRY_BRANCH_OPT")
    Stack: access denied ("java.lang.RuntimePermission" "getenv.TURN_OFF_LR_LOOP_ENTRY_BRANCH_OPT")java.base/

    I believe the issue is due to an incorrect behavior in accessing environment variables in ParserATNSimulator.getSafeEnv(String):

    public static String getSafeEnv(String envName) {
    	try {
    		return System.getenv(envName);
    	catch(SecurityException e) {
    		// use the default value
    	return null;

    Instead, you should properly use a doPriv using API:

    public static String getSafeEnv(String envName) {
    	return AccessController.doPrivileged(new PrivilegedAction<String>() {
    		public String run() {
    			return System.getenv(envName);

    #2069 seems to have also observed this issue, but the fix was not correct. You shouldn't just catch the security issue and do nothing.

    Reviewed by dazey3 at 2022-05-17 21:58
  • 14. Terrible Golang performance


    Google group:

    Example code:

    A simple rule such as:

    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2

    takes exponentially longer to parse the more 1 EQ 2 OR clauses there are. This does not happen in python (by my testing) or CSharp, Dart, Java (by stackoverflow comment).

    On my machine, # of lines vs parse time:

    11: 0.5s
    12: 1.2s
    13: 3.2s
    14: 8.1s
    15: 21.9s
    16: 57.5s

    Given that Python doesn't face this issue I can't imagine I'm doing something terrible in my grammar.

    Issue goes away if I put parens on things but that's not a real solution.

    On 4.10.1, first noticed with 4.9.1.

    Any help is greatly appreciated. Surprised I can't find others with this issue.

    Reviewed by movelazar at 2022-05-17 20:57
  • 15. ANTLR tool takes 6s to process 1772 line parser grammar

    issue was identified in plugin by @KitsuneAlex

    This FerrousParser.g4 takes 6s to process with antlr tool. It seems to be stuck in SLL(1) static analysis. Could be murmur hash or more likely LL1Analyzer.

    Screen Shot 2022-05-15 at 11 56 22 AM

    @sharwell This likely affects your optimized fork as well.

    Simple test rig:

    import org.antlr.runtime.ANTLRFileStream;
    import org.antlr.runtime.ANTLRStringStream;
    import org.antlr.v4.Tool;
    import org.antlr.v4.tool.Grammar;
    import org.antlr.v4.tool.ast.GrammarRootAST;
    public class TestANTLRParse {
        public static void main(String[] args) throws IOException {
            long start = System.nanoTime();
            long time_ns = System.nanoTime() - start;
            double parseTimeMS = time_ns/(1000.0*1000.0);
            System.err.println("Exec time to process "+args[0]+": "+parseTimeMS+"ms");
        private static void runANTLR(String grammarFileName) throws IOException {
            Tool antlr = new Tool();
            ANTLRStringStream in = new ANTLRFileStream(grammarFileName);
            GrammarRootAST grammarRootAST = antlr.parse(grammarFileName, in);
            // Create a grammar from the AST so we can figure out what type it is
            Grammar g = antlr.createGrammar(grammarRootAST);
            antlr.process(g, false);

    I see this on M1 mac mini:

    Exec time to process /Users/parrt/tmp/FerrousParser.g4: 5984.826209ms
    Reviewed by parrt at 2022-05-15 18:54
  • 16. [JavaScript runtime] Bad field name, bad comments

    In this SO question, I took a look at the JavaScript runtime to see if buildParseTrees is defined for a parser. Indeed it is, but I noticed there is one applied occurrence of the field erroneously spelled _buildParseTrees. Of course, in the warped world of JS, this is not caught either by compiler or runtime.

    Also, there is no "getBuildParseTree()" noted here and here.

    I realize "code reviews" are old school, but relying on a terrible programming language like Javascript to find errors is not going to work.

    Reviewed by kaby76 at 2022-05-15 12:08
An object oriented PHP driver for FFMpeg binary

php-ffmpeg An Object-Oriented library to convert video/audio files with FFmpeg / AVConv. Check another amazing repo: PHP FFMpeg extras, you will find

May 25, 2022
A full PHP implementation of Minecraft's Named Binary Tag (NBT) format.

php-nbt A full PHP implementation of Minecraft's Named Binary Tag (NBT) format. In contrast to other implementations, this library provides full suppo

Apr 11, 2022
🛬🧾 A PHP8 TacView ACMI file format parser

A PHP8 TacView ACMI file format parser This package offers parsing support for TacView ACMI flight recordings version 2.1, it follows the standard des

Jan 18, 2022
High-quality cloud service for text translation. 100+ Languages
High-quality cloud service for text translation. 100+ Languages

CloudAPI.Stream High-quality cloud service for text translation. 100+ Languages Class initialization $CAS = new CAS; Key inst

May 8, 2022
PHP library that provides a filesystem abstraction layer − will be a feast for your files!

Gaufrette Gaufrette provides a filesystem abstraction layer. Why use Gaufrette? Imagine you have to manage a lot of medias in a PHP project. Lets see

May 11, 2022
PHP runtime & extensions header files for PhpStorm

phpstorm-stubs STUBS are normal, syntactically correct PHP files that contain function & class signatures, constant definitions, etc. for all built-in

May 17, 2022
Tailwind plugin to generate purge-safe.txt files
Tailwind plugin to generate purge-safe.txt files

Tailwind plugin to generate safelist.txt files With tailwind-safelist-generator, you can generate a safelist.txt file for your theme based on a set of

May 14, 2022
This small PHP package assists in the loading and parsing of VTT files.

VTT Transcriptions This small PHP package assists in the loading and parsing of VTT files. Usage use Laracasts\Transcriptions\Transcription; $transcr

May 4, 2022
Associate files with Eloquent models
Associate files with Eloquent models

Associate files with Eloquent models This package can associate all sorts of files with Eloquent models. It provides a simple API to work with. To lea

May 24, 2022
This package used to upload files using laravel-media-library before saving model.
This package used to upload files using laravel-media-library before saving model.

Laravel Media Uploader This package used to upload files using laravel-media-library before saving model. In this package all uploaded media will be p

May 8, 2022
A python program to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks.

I'm writing a python script to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks called ReCut. So far there are two

Oct 27, 2021
A Flysystem proxy adapter that enables compression and encryption of files and streams on the fly

Slam / flysystem-compress-and-encrypt-proxy Compress and Encrypt files and streams before saving them to the final Flysystem destination. Installation

Mar 10, 2022
FileGator - a free, open-source, self-hosted web application for managing files and folders.
FileGator - a free, open-source, self-hosted web application for managing files and folders.

FileGator - Powerful Multi-User File Manager FileGator is a free, open-source, self-hosted web application for managing files and folders. You can man

May 22, 2022
FileNamingResolver - A lightweight library which helps to resolve a file/directory naming of uploaded files using various naming strategies

FileNamingResolver - A lightweight library which helps to resolve a file/directory naming of uploaded files using various naming strategies

May 19, 2022
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

php-text-analysis PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP l

May 15, 2022
A pure PHP library for reading and writing word processing documents

Master: Develop: PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. Th

May 17, 2022
A language detection library for PHP. Detects the language from a given text string.

language-detection Build Status Code Coverage Version Total Downloads Minimum PHP Version License This library can detect the language of a given text

May 17, 2022