-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INI: Rewrite parser to avoid regular expressions #5442
INI: Rewrite parser to avoid regular expressions #5442
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nitpicks
src/ini.cr
Outdated
key, _, value = line.partition('=') | ||
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line | ||
ini[section] ||= {} of String => String | ||
ini[section][key.rstrip] = value.strip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
section = ini[section]? || Hash(String, String).new
section[key.rstrip] = value.strip
ini[section] = section
is very slightly faster.
src/ini.cr
Outdated
ini[section] = {} of String => String | ||
else | ||
key, _, value = line.partition('=') | ||
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking if value == ""
since that's how partition
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, perhaps it's cleaner just to call index
manually...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value == ""
won't work because foo=
is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "="
.
src/ini.cr
Outdated
# | ||
# ``` | ||
# INI.parse("[foo]\na = 1") # => {"foo" => {"a" => "1"}} | ||
# ``` | ||
def self.parse(str) : Hash(String, Hash(String, String)) | ||
ini = {} of String => Hash(String, String) | ||
|
||
lines = str.lines.map(&.lstrip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not perform this lstrip
inside the each_with_index
. In fact, you can do without the lines
call (which means a very large array allocation on large files) entirely.
spec/std/ini_spec.cr
Outdated
@@ -3,14 +3,38 @@ require "ini" | |||
|
|||
describe "INI" do | |||
describe "parse from string" do | |||
it "fails on malformed section" do | |||
expect_raises(INI::ParseException, /unterminated section/) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a string here and in the following examples, no need for a regex (the point of this PR is reducing them^^).
src/ini.cr
Outdated
section = $1 | ||
|
||
lines.each_with_index(1) do |line, lineno| | ||
next if line.empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could all be a case
expression to make the code better understandable:
case line[0]?
when nil, '#', ';'
next
when '['
...
else
...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the empty case to be outside the case
, but apart from that, yes.
src/ini.cr
Outdated
|
||
if line[0] == '[' | ||
end_idx = line.index(']') | ||
raise ParseException.new("unterminated section", lineno, line.size - 1) unless end_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably also raise if end_idx
is not the last character in the line. [foo] bar
should be invalid.
src/ini.cr
Outdated
ini[section] = {} of String => String | ||
else | ||
key, _, value = line.partition('=') | ||
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value == ""
won't work because foo=
is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "="
.
src/ini.cr
Outdated
else | ||
key, _, value = line.partition('=') | ||
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line | ||
ini[section] ||= {} of String => String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed. All named sections are created by the previous section header and the default empty section could be initialized from the start (and deleted at the end if unused).
crystal needs benchmark suite like go. then every change (compiler, std) you can run all benchs and compare what happened. regression or improvement. benchs in prs are good, but forgotten |
957bdb6
to
3f2502c
Compare
I noticed that both parsers currently accept |
3f2502c
to
34aeeb5
Compare
src/ini.cr
Outdated
elsif line =~ /\[(.*)\]/ | ||
section = $1 | ||
|
||
str.lines.each_with_index(1) do |oline, lineno| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use String#each_line
instead. String#lines
still causes unnecessary allocation.
I'd prefer to use it directly without each_with_index
and do the line number counter directly. This is even slightly faster than iterator chaining str.each_line.each_with_index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please use more expressive variable names: oline
is completely cryptic. I can'd even imagine what this is supposed to mean. Just call it line
. There is no need to change the name after stripping.
section
could be renamed to current_section_name
(or similar) to get rid of sect
later.
src/ini.cr
Outdated
section = $1 | ||
|
||
str.lines.each_with_index(1) do |oline, lineno| | ||
line = oline.strip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply stripping whitespace renders the column indices in error messages meaningless. You'll have to skip and count whitespace to get a valid offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was going for with oline
(which is supposed to be original line), but I realized that's not sufficient. I'll turn skip-and-count into a private helper method, since it's needed in a few places.
34aeeb5
to
3198df5
Compare
Latest benchmark (https://gist.github.com/woodruffw/7b1001c8a29ef4796c48b2ad59ba6bf7):
|
src/ini.cr
Outdated
lineno += 1 | ||
next if line.empty? | ||
|
||
line, skip = strip_and_lskip(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this as a separate method? This is probably not going to be used anywhere else. It doesn't make a big difference, but I don't think it's necessary for such a simple task.
But I guess it should better be implemented directly without employing String.lstrip
and comparing the resulting string size. This is unneded extra work (mostly memory allocations).
Consider something like this:
offset = 0
line.each_char do |char|
break unless char.ascii_whitespace?
offset += 1
end
case line[offset]
# ...
when '['
end_idx = line.index(']', offset)
# ...
src/ini.cr
Outdated
|
||
line, skip = strip_and_lskip(line) | ||
|
||
case line[0]? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit the ?
because you've already checked that line
is not empty.
3198df5
to
e083c63
Compare
src/ini.cr
Outdated
next if line.empty? | ||
|
||
offset = 0 | ||
while line[offset].ascii_whitespace? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that was my initial suggestion, but it was wrong. Looping over line[offset]
is very inefficient because the character position needs to be translatet to byte position internally. I've edited my previous comment to use line.each_char
instead right after publishing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
e083c63
to
5d1b114
Compare
src/ini.cr
Outdated
raise ParseException.new("unterminated section", lineno, line.size) unless end_idx | ||
raise ParseException.new("data after section", lineno, end_idx + 1) unless end_idx == line.size - 1 | ||
|
||
current_section_name = line[1...end_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be line[(offset + 1)...end_idx]
src/ini.cr
Outdated
key, eq, value = line.partition('=') | ||
raise ParseException.new("expected declaration", lineno, key.size) if eq != "=" | ||
|
||
section = ini[current_section_name]? || Hash(String, String).new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just store the current_section
hash and then we only have to do a single hash op for each k=v, instead of 3 hash operations.
ini = Hash(String, Hash(String, String)).new
current_section_name = ""
current_section = ini[current_section_name] = Hash(String, String).new
then in [
just
current_section_name = line[1...end_idx]
current_section = ini[current_section_name] = Hash(String, String).new
and that simplifies else
to just
current_section[key.strip] = value.strip
and you're done!
Simpler and faster in every way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea.
This doesn't even need current_section_name
as a variable outside the loop.
ini = Hash(String, Hash(String, String)).new
current_section = ini[""] = Hash(String, String).new
I'd keep it in the [
section for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with
current_section = ini[current_section_name] ||= Hash(String, String).new
To allow for reopened sections, since the original parser allows them (but didn't test for them in the spec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
cb9f1f7
to
6685a5f
Compare
src/ini.cr
Outdated
end | ||
end | ||
|
||
ini.delete("") if ini[""].empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's consistent with the current parser's behavior: if there's nothing in the default/global section, it gets omitted entirely.
I can remove it, but some of the specs will need to be updated, e.g.:
Failure/Error: INI.parse("[section]").should eq({"section" => {} of String => String})
Expected: {"section" => {}}
got: {"" => {}, "section" => {}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I still need to be convinced either way on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the current behavior for a few reasons:
- Neither
inih
nor Python's ConfigParser includes the default section if it's empty. - Parsing an empty string would result in
{"" => {}}
rather than just{}
, which I personally find unintuitive. - As above,
empty?
would befalse
on the hash returned byparse("")
, which I also find unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should delete all empty sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of having empty sections is that they're at least explicit, e.g. if I really did want an empty default section, I could do this:
[]
[foo]
bar=baz
But inih
and other INI parsers don't seem to do that consistently, so I can delete all empty sections if you think that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@woodruffw but adding []
doesn't make an empty default section in your code, it'd be deleted regardless. But other empty sections with non-empty names wouldn't get deleted.
We shouldn't treat the default section differently is all i'm pointing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I was mixing it up with the old []
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be checked if the default section was just used implicitly or explicitly specified with []
and in the latter case, keep it even if empty.
But the current solution to delete all empty sections is probably fine as well.
`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.
6685a5f
to
dba92b8
Compare
It would probably be better to not go line by line but parse char by char. But then again, |
`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.
INI.parse
now parses an INI-formatted string into aHash
without using any regular expressions.Key differences:
#
and;
)INI::ParseException
in cases that were previously ignored (broken section definitions, lines that are neither comments nor declarations)key =
becomes{ "key" => "" }
)Other than the above, the implementation handles the same inputs as the previous regular-expression based one. I've added some test cases to round things out.
Benchmarks (source):