One great feature of LPeg is that it's binary-safe, meaning that (unlike regular expressions) it can be safely used to parse binary data! This makes it an excellent tool for parsing binary protocols, especially network communication protocols, such as the Action Message Format (used by Adobe Flash for making remote calls and even in FLV movie files). I'll leave it to you to explore the possibilities...
Beware that from here on, I assume that you know your way around Lua, LPeg and how they work.
The problem
That being said, this article is actually about an unusual roadblock I hit while using LPeg to build a Lua-based AMF parser, and the various solutions I found and/or came up with to overcome it (you didn't think that I mentioned AMF before by accident, did you?).
The issue is LPeg's implementation of repetitive patterns: in particular, its inability to match (or capture) a fixed number of occurrences of a certain pattern, although it can match a minimum or a maximum number of such occurrences, which is perfect for stream-oriented parsing (such as parsing programming languages) but insufficient for binary data.
Just to clarify, here's a small list of LPeg patterns which correspond to the typical PCRE repetitive constructs (in each case we're trying to match the string 'cloth'):
Nr. | Matching occurrences of 'cloth' | PCRE pattern | LPeg pattern |
1 | 0 or more (at least 0) | [cci_text]/(cloth)*/[/cci_text] | [cci_lua]lpeg.P'cloth'^0[/cci_lua] |
2 | 1 or more (at least 1) | [cci_text]/(cloth)+/[/cci_text] | [cci_lua]lpeg.P'cloth'^1[/cci_lua] |
3 | X or more (at least X) | [cci_text]/(cloth){X,}/[/cci_text] | [cci_lua]lpeg.P'cloth'^X[/cci_lua] |
4 | 1 or less (at most 1) | [cci_text]/(cloth)?/[/cci_text] | [cci_lua]lpeg.P'cloth'^-1[/cci_lua] |
5 | X or less (at most X) | [cci_text]/(cloth){,X}/[/cci_text] | [cci_lua]lpeg.P'cloth'^-X[/cci_lua] |
6 | precisely X (no more, no less) | [cci_text]/(cloth){X,X}/[/cci_text] | [cci_lua]-- not implemented --[/cci_lua] |
7 | anywhere between X and Y | [cci_text]/(cloth){X,Y}/[/cci_text] | [cci_lua]-- not implemented --[/cci_lua] |
For cases 6 and 7, LPeg does not offer any simple constructs so we have to find a complex one. But let's put case 7 aside for a while, and try to tackle case 6, then we'll see...
Attempt 1 (brute force)
The first solution was rather straight-forward: we explicitly compose (i.e. append) a pattern for every occurrence we need to match. I call this the brute force method (for obvious reasons). For example, to match the word 'cloth' exactly three times we would use the pattern [cci_lua]lpeg.P'cloth' * lpeg.P'cloth' * lpeg.P'cloth'[/cci_lua].
Naturally, this is quite inflexible: what if we need to match 183 occurrences? To achieve that, we use a pattern generation function. Here's a complete working example:
[cc_lua]
local lpeg = require'lpeg'
-- Version 1.
-- Generates a big pattern where every interation has
-- a corresponding instance of the initial pattern. The
-- resulting pattern captures all occurrences in a table.
function multiply_1(item, count)
local set = lpeg.P(true)
while count > 0 do
set = set * item
count = count - 1
end
return lpeg.Ct(set)
end
-- Generate subject ('cloth,cloth,...').
local subject = string.rep('cloth,', 10)
-- A capture of the word 'cloth' optionally followed by a comma.
local item = lpeg.C(lpeg.P'cloth') * lpeg.P','^-1
-- Match exactly 10 items.
local result = lpeg.match(multiply_1(item, 10), subject)
-- Display capture table.
print(unpack(result))
[/cc_lua]
This function adds a composition operation to the big pattern for each occurrence. One method of improving this is to reduce the amount of compositions to the a bare minimum. We do this by duplicating (i.e. growing in powers of 2) the pattern until the required size is reached (also used by LPeg's regexp emulator):
[cc_lua]
-- Version 1(b).
-- Generates a (potentially large) pattern using only
-- the minimum possible amount of compositions.
function multiply_1b(item, count)
local set = lpeg.P(true)
while count >= 1 do
if count % 2 >= 1 then
set = set * item
end
item = item * item
count = count / 2
end
return lpeg.Ct(set)
end
[/cc_lua]
Looks good, however, this method presents a serious problem: the resulting pattern can get very large, forcing the limits of the Lua stack. It works well enough for small jobs, but it is unwise to generate more than a few hundred repetitions using this method (and even that may be pushing it). This means that the brute force method is unusable for heavy binary parsing, where we would often need to match even thousands of pattern repetitions.
We need to do better...
Attempt 2 (capture folding)
The next approach is a bit more complex and makes use of the more advanced features of LPeg: the lpeg.Cf() and lpeg.Cmt() captures.
In essence, we match at most the desired number of occurrences (using the [cci_lua]^-N[/cci_lua] operation described in the table at the top) and on every match (i.e. iteration) we execute a folding function (via [cci_lua]lpeg.Cf()[/cci_lua]) that adds the item to the result table and also decreases a counter. At the end, we execute a checking function (via [cci_lua]lpeg.Cmt()[/cci_lua]) which will test if the counter reached 0 (thus ensuring exact matching) and will fail if not.
[cc_lua]
-- Version 2.
-- Somewhat better approach, which starts with an empty table and gradually
-- folds every occurring capture of the initial pattern into this table, at
-- the same time decrementing the counter. Eventually, it also checks if the
-- counter reached 0 and fails otherwise.
function multiply_2(item, count)
return lpeg.Cmt(
-- Fold capture.
lpeg.Cf(
lpeg.Ct'' * item^-count,
-- Append capture.
function(set, item)
count = count - 1
table.insert(set, item)
return set
end),
-- Final check.
function(s, i, set)
if count > 0 then
return false
end
return i, set
end)
end
[/cc_lua]
Note that the capture starts with an empty table created from the empty string (i.e. [cci_lua]lpeg.Ct''[/cci_lua]). This table is the first parameter given to the folding function (i.e. [cci_lua]set[/cci_lua]); after all the iterations, it will contain all captured items and will then be given to the check-up function for evaluation. Eventually, if all goes well, it will be returned back as the final result of multiply().
Overall this method will behave much better than the first; the generated pattern is much more elegant. However, the generated pattern can still grow quite large in memory, even though it's not that obvious (i.e. I was surprised to find that it also crashed when generating more than a few thousand iterations).
The issue is the [cci_lua]item^-count[/cci_lua] bit. Checking the C code of the LPeg implementation, quickly reveals that whenever [cci_lua]count[/cci_lua] is not [cci_lua]0[/cci_lua] (i.e. which is a special case), the engine is forced to allocate [cci_lua]count[/cci_lua] objects in memory, and while these objects are lighter than a complete item pattern (as in the first solution) they are still scaling-up the problem badly!
Eventually, this method only allows generating repetitive patterns of a few thousand iterations (that is as far as I was able to push it). But, what if we need to handle a rogue pattern of 35 000 occurrences? What then? No!... we still need better!
Attempt 3 (classic loop)
The only thing left to try is a (slightly) more rudimentary technique. Basically, it is the same as the second approach but with a significant modification: instead of using [cci_lua]lpeg.Cf()[/cci_lua] to "accumulate" (i.e. fold) every occurrence of item to the result table, we just use a straight-forward loop that will advance through the input stream at every iteration.
That makes the pattern completely dynamic, i.e. totally non-proportional to the number of matched occurrences. As a result, this method allows for an unlimited number of iterations with no memory wastes.
[cc_lua]
-- Version 3.
-- Best method so far, capable of an unlimited number of iterations. It uses
-- an explicit loop to walk over all all iterations at match-time, so there
-- is no potentially huge pattern ever created. This method simply populates
-- the result table at each iteration, or fails the entire set if the count
-- is not reached precisely.
function multiply_3(item, count)
return lpeg.Cmt(lpeg.P(true),
function(s, i)
local set, offset = {}, i
for j = 1, count do
set[j], offset = lpeg.match(item * lpeg.Cp(), s, offset)
if not offset then
return false
end
end
return offset, set
end)
end
[/cc_lua]
When applied to a string, this will start matching at the current position (i.e. due to the [cci_lua]lpeg.P(true)[/cci_lua] construct). The [cci_lua]lpeg.Cmt()[/cci_lua] capture will trigger the iteration function immediately and there, in turn, will loop for 'count' number of times, matching the item at every iteration and appending it to the results table. At the same time, at each iteration, the offset is advanced by capturing the position immediately following the matched item (i.e. the [cci_lua]lpeg.Cp()[/cci_lua] construct). If anything goes wrong, then the entire match fails (i.e. via [cci_lua]return false[/cci_lua]).
This method works flawlessly (as far as I could tell). The generated pattern is minimal (memory-wise), the source code is very small, and the function is overall very fast (only slightly slower than the first two approaches, when matching small-sized patterns). But most importantly, with this version, matching huge numbers of item repetitions is trivial (e.g. tens of thousands, millions...).
Intervals
From here, implementing case 7 (i.e. [cci_lua]multiply(X, Y)[/cci_lua]) is quite easy...
One approach would be to use [cci_lua]multiply_3(item, X)[/cci_lua] to ensure the minimum number of occurrences (i.e. X) and then append the pattern [cci_lua]item^-(Y-X)[/cci_lua] which will optionally match the remaining occurrences up to the maximum amount allowed. This method will work, and can also be used with the first two solutions, but will still (potentially) waste memory when [cci_lua](Y-X)[/cci_lua] is large.
A better (and still easy) approach would be to modify the loop inside [cci_lua]multiply_3()[/cci_lua] to also accept an optional number of iterations (above the minimum). Here's how...
[cc_lua]
-- Version 3(b).
-- The final variant with support for interval matching
-- based on the third capturing approach (classic loop).
function multiply_3b(item, min, max)
return lpeg.Cmt(lpeg.P(true),
function(s, i)
local set, offset, attempt = {}, i, 0
for j = 1, max or min do
set[j], attempt = lpeg.match(item * lpeg.Cp(), s, offset)
if not attempt then
if j > min then
break
else
return false
end
end
offset = attempt
end
return offset, set
end)
end
[/cc_lua]
In conclusion...
So, that was... quite a trip! Anyway, in hope that someone will find all this useful, I close with the usual...
Enjoy! :)
P.S. I would be very surprised if anyone actually reads this ridiculous banter; if you are still here, I take my hat off to you, sir (or madam)!
Niciun comentariu:
Trimiteți un comentariu