Note: PR comment is edited to reflect latest changes.
This change fixes lexer performance degradation for non ASCII inputs.
@parrt @sharwell @bhamiltoncx @ahmetaa
Existing code uses a DFAState to represent edges from the DFAState class to other states and LexerATNSimulator and ParserATNSimulator classes were responsible of creating this array, inserting and retrieving elements.
Lexer edges are for deciding which state to go next when a character is read:
state --character--> otherState
Current implementation of lexer uses a 127 slot array to look up which state to go. This provides fast lookup for characters < 127 and a much slower path to determine next state for the rest of characters.
Parser: Parser also uses the same DFAState class, but in this case edges are defined for tokens instead of characters: state --token--> otherState, so parsers uses same array initialized with the maxTokenType.
Problems with current approach:
The edge representation is not encapsulated properly and exposed
publicly, lifetime of edges are controlled by the Lexer and Parser
clients. resulting in a fragile and error prone structure.
Because Lexer uses a limited sized lookup table, when it is exposed to input with non ASCII characters its performance degrades considerably. This is not an uncommon case, a lot of code contains non English comments, names, symbols, addresses etc. The slowdown is worse than 10x even if the input contains 10% non Ascii characters.
Existing lookup tables by design wastes memory. Each Lexer DFAState object uses ~1KB memory in 64 bit systems. Parser DFAState objects uses 8*maxTokenType bytes each. For complex grammars maxTokenType could be >100. Most of these tables are sparse. (See the histogram below)
Existing array based solution requires positive integer keys, so they are modified to handle -1 as a key, complicating access especially because edge access is not properly encapsulated.
To represent DFAState transitions, we now use a small and fast
<int, DFAState> map and remove the DFAState .
The proposed map is a very limited hashmap with int keys. It uses open access with linear probing, also uses 2^n capacity for fast modulo look ups. I instrumented the map in a different branch and in most cases it finds the object at the first lookup.
Antlr lexer and parser threads shares the same DFAState graph so graph access must be thread safe. During parsing and lexing reads outnumber writes by a large margin as the number of inputs are processed. So I opted for cheap read-write lock trick as described here (see item 5) which offers good read performance, some reads may get a bit stale data but this is not a problem in antlrs case. (See @sharwell's comment below)
Performance is an important aspect of this patch, I tried to address
performance degeneration while not regressing fast cases.
I downloaded source code of some open source projects (Java8 SDK,
Linux kernel etc) . For each language, I copied content of all source
files into a separate file. You can download these from here
Tokenized each file and only counted tokens to isolate pure tokenizing performance. An example benchmark can be found here
Tried tests a few times and on 2 different CPU architectures
(AMD Ryzen 1700x and Intel Xeon E5-1650)
New approach still keeps performance close to existing approach.
When input is pure ASCII performance is similar on AMD Ryzen, up to ~30% regression
on Intel Xeon. Newer Intel architectures and Arm may get different results.
For non ASCII input, completely removes the performance regression and works as fast as pure ASCII input.
Performance numbers are not very stable and changes depends on many factors like JVM parameters, active cpu governor, if tests are done together or separately etc.
Also keep in mind that, small differences in pure tokenizing performance might not matter much. To test this, I repeated performance benchmarks but this time did a little bit more meaningful tasks with read tokens, e.g. counted individual token types, token lengths and checked tokens if they contain only spaces. For pure ASCII inputs, small performance differences between old and new diminishes considerably.
I did not particularly focus on parsing, similar to Lexer with pure ASCII input, I do not expect a big regression on performance because parsing usually involves more work on each state transition, diminishing the importance of a few cycles spent on finding next state. (as explained above). I welcome more benchmarks focused on parsing as well.
Because its capacity grows only when needed and starting size is quite small (2),
hashmaps may use memory more efficiently. This is not a big concern in case of Lexer, as even
for complex grammars Lexers tend to have <500 DFAState and <10K edges.
Total DFA states: 348
Total DFA edges: 5805
The histogram of number of edges of all Lexer DFAState nodes:
Edge count histogram:
(0..4]: 85 41.67%
(4..8]: 20 53.33%
(8..10]: 17 58.06%
(10..16]: 12 66.39%
(16..64]: 49 93.61%
(64..100]: 20 99.17%
(100..200]: 3 100.00%
Edges on Parser nodes tend to be even more sparse and number of state objects may get much bigger.
I did not specifically measure the memory usage for parser state node, but if max number of tokens are bigger than a certain amount, there could be gains there. For complex grammars it is not uncommon to have >100 tokens.