Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INI: Rewrite parser to avoid regular expressions #5442

Merged
merged 1 commit into from
Jan 2, 2018

Conversation

woodruffw
Copy link
Contributor

@woodruffw woodruffw commented Dec 23, 2017

INI.parse now parses an INI-formatted string into a Hash without using any regular expressions.

Key differences:

  • Comments are now handled explicitly (both # and ;)
  • Parsing now raises an INI::ParseException in cases that were previously ignored (broken section definitions, lines that are neither comments nor declarations)
  • Empty values are permitted (key = becomes { "key" => "" })

Other than the above, the implementation handles the same inputs as the previous regular-expression based one. I've added some test cases to round things out.

Benchmarks (source):

$ crystal build --release --no-debug ini_bench.cr
$ ./ini_bench
INI parse w/ RE  89.12k ( 11.22µs) (±25.86%)  2.03× slower
INI parse w/o RE 180.92k (  5.53µs) (±27.43%)       fastest

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks

src/ini.cr Outdated
key, _, value = line.partition('=')
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
ini[section] ||= {} of String => String
ini[section][key.rstrip] = value.strip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

section = ini[section]? || Hash(String, String).new
section[key.rstrip] = value.strip
ini[section] = section

is very slightly faster.

src/ini.cr Outdated
ini[section] = {} of String => String
else
key, _, value = line.partition('=')
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if value == "" since that's how partition works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, perhaps it's cleaner just to call index manually...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value == "" won't work because foo= is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "=".

src/ini.cr Outdated
#
# ```
# INI.parse("[foo]\na = 1") # => {"foo" => {"a" => "1"}}
# ```
def self.parse(str) : Hash(String, Hash(String, String))
ini = {} of String => Hash(String, String)

lines = str.lines.map(&.lstrip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not perform this lstrip inside the each_with_index. In fact, you can do without the lines call (which means a very large array allocation on large files) entirely.

@@ -3,14 +3,38 @@ require "ini"

describe "INI" do
describe "parse from string" do
it "fails on malformed section" do
expect_raises(INI::ParseException, /unterminated section/) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a string here and in the following examples, no need for a regex (the point of this PR is reducing them^^).

src/ini.cr Outdated
section = $1

lines.each_with_index(1) do |line, lineno|
next if line.empty?
Copy link
Member

@straight-shoota straight-shoota Dec 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could all be a case expression to make the code better understandable:

case line[0]?
when nil, '#', ';'
  next
when '['
  ...
else
  ...
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the empty case to be outside the case, but apart from that, yes.

src/ini.cr Outdated

if line[0] == '['
end_idx = line.index(']')
raise ParseException.new("unterminated section", lineno, line.size - 1) unless end_idx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also raise if end_idx is not the last character in the line. [foo] bar should be invalid.

src/ini.cr Outdated
ini[section] = {} of String => String
else
key, _, value = line.partition('=')
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value == "" won't work because foo= is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "=".

src/ini.cr Outdated
else
key, _, value = line.partition('=')
raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
ini[section] ||= {} of String => String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed. All named sections are created by the previous section header and the default empty section could be initialized from the start (and deleted at the end if unused).

@larubujo
Copy link
Contributor

crystal needs benchmark suite like go. then every change (compiler, std) you can run all benchs and compare what happened. regression or improvement. benchs in prs are good, but forgotten

@woodruffw woodruffw force-pushed the ini-parser-without-re branch from 957bdb6 to 3f2502c Compare December 23, 2017 17:12
@woodruffw
Copy link
Contributor Author

I noticed that both parsers currently accept [] as a valid section name (i.e., an explicit version of the default section). Is that behavior intentional in the original parser?

@woodruffw woodruffw force-pushed the ini-parser-without-re branch from 3f2502c to 34aeeb5 Compare December 23, 2017 17:28
src/ini.cr Outdated
elsif line =~ /\[(.*)\]/
section = $1

str.lines.each_with_index(1) do |oline, lineno|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use String#each_line instead. String#lines still causes unnecessary allocation.
I'd prefer to use it directly without each_with_index and do the line number counter directly. This is even slightly faster than iterator chaining str.each_line.each_with_index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please use more expressive variable names: oline is completely cryptic. I can'd even imagine what this is supposed to mean. Just call it line. There is no need to change the name after stripping.
section could be renamed to current_section_name (or similar) to get rid of sect later.

src/ini.cr Outdated
section = $1

str.lines.each_with_index(1) do |oline, lineno|
line = oline.strip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply stripping whitespace renders the column indices in error messages meaningless. You'll have to skip and count whitespace to get a valid offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was going for with oline (which is supposed to be original line), but I realized that's not sufficient. I'll turn skip-and-count into a private helper method, since it's needed in a few places.

@woodruffw woodruffw force-pushed the ini-parser-without-re branch from 34aeeb5 to 3198df5 Compare December 23, 2017 21:31
@woodruffw
Copy link
Contributor Author

woodruffw commented Dec 23, 2017

Latest benchmark (https://gist.github.com/woodruffw/7b1001c8a29ef4796c48b2ad59ba6bf7):

INI parse w/ RE  87.17k ( 11.47µs) (±26.56%)  1.91× slower
INI parse w/o RE 166.86k (  5.99µs) (±30.73%)       fastest

src/ini.cr Outdated
lineno += 1
next if line.empty?

line, skip = strip_and_lskip(line)
Copy link
Member

@straight-shoota straight-shoota Dec 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this as a separate method? This is probably not going to be used anywhere else. It doesn't make a big difference, but I don't think it's necessary for such a simple task.

But I guess it should better be implemented directly without employing String.lstrip and comparing the resulting string size. This is unneded extra work (mostly memory allocations).
Consider something like this:

offset = 0
line.each_char do |char|
  break unless char.ascii_whitespace?
  offset += 1
end

case line[offset]
# ...
when '['
  end_idx = line.index(']', offset)
# ...

src/ini.cr Outdated

line, skip = strip_and_lskip(line)

case line[0]?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit the ? because you've already checked that line is not empty.

@woodruffw woodruffw force-pushed the ini-parser-without-re branch from 3198df5 to e083c63 Compare December 23, 2017 21:59
src/ini.cr Outdated
next if line.empty?

offset = 0
while line[offset].ascii_whitespace?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was my initial suggestion, but it was wrong. Looping over line[offset] is very inefficient because the character position needs to be translatet to byte position internally. I've edited my previous comment to use line.each_char instead right after publishing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@woodruffw woodruffw force-pushed the ini-parser-without-re branch from e083c63 to 5d1b114 Compare December 23, 2017 23:53
src/ini.cr Outdated
raise ParseException.new("unterminated section", lineno, line.size) unless end_idx
raise ParseException.new("data after section", lineno, end_idx + 1) unless end_idx == line.size - 1

current_section_name = line[1...end_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be line[(offset + 1)...end_idx]

src/ini.cr Outdated
key, eq, value = line.partition('=')
raise ParseException.new("expected declaration", lineno, key.size) if eq != "="

section = ini[current_section_name]? || Hash(String, String).new
Copy link
Contributor

@RX14 RX14 Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just store the current_section hash and then we only have to do a single hash op for each k=v, instead of 3 hash operations.

ini = Hash(String, Hash(String, String)).new
current_section_name = ""
current_section = ini[current_section_name] = Hash(String, String).new

then in [ just

current_section_name = line[1...end_idx]
current_section = ini[current_section_name] = Hash(String, String).new

and that simplifies else to just

current_section[key.strip] = value.strip

and you're done!

Simpler and faster in every way :)

Copy link
Member

@straight-shoota straight-shoota Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea.

This doesn't even need current_section_name as a variable outside the loop.

ini = Hash(String, Hash(String, String)).new
current_section = ini[""] = Hash(String, String).new

I'd keep it in the [ section for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with

current_section = ini[current_section_name] ||= Hash(String, String).new

To allow for reopened sections, since the original parser allows them (but didn't test for them in the spec).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@woodruffw woodruffw force-pushed the ini-parser-without-re branch 2 times, most recently from cb9f1f7 to 6685a5f Compare December 24, 2017 18:03
src/ini.cr Outdated
end
end

ini.delete("") if ini[""].empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's consistent with the current parser's behavior: if there's nothing in the default/global section, it gets omitted entirely.

I can remove it, but some of the specs will need to be updated, e.g.:

Failure/Error: INI.parse("[section]").should eq({"section" => {} of String => String})

       Expected: {"section" => {}}
            got: {"" => {}, "section" => {}}

Copy link
Contributor

@RX14 RX14 Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I still need to be convinced either way on this issue.

Copy link
Contributor Author

@woodruffw woodruffw Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current behavior for a few reasons:

  • Neither inih nor Python's ConfigParser includes the default section if it's empty.
  • Parsing an empty string would result in {"" => {}} rather than just {}, which I personally find unintuitive.
  • As above, empty? would be false on the hash returned by parse(""), which I also find unintuitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should delete all empty sections.

Copy link
Contributor Author

@woodruffw woodruffw Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of having empty sections is that they're at least explicit, e.g. if I really did want an empty default section, I could do this:

[]

[foo]
bar=baz

But inih and other INI parsers don't seem to do that consistently, so I can delete all empty sections if you think that's better.

Copy link
Contributor

@RX14 RX14 Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@woodruffw but adding [] doesn't make an empty default section in your code, it'd be deleted regardless. But other empty sections with non-empty names wouldn't get deleted.

We shouldn't treat the default section differently is all i'm pointing out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I was mixing it up with the old [] behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be checked if the default section was just used implicitly or explicitly specified with [] and in the latter case, keep it even if empty.
But the current solution to delete all empty sections is probably fine as well.

`INI.parse` now parses an INI-formatted string
into a `Hash` without using any regular expressions.
@woodruffw woodruffw force-pushed the ini-parser-without-re branch from 6685a5f to dba92b8 Compare December 24, 2017 23:34
@asterite
Copy link
Member

asterite commented Jan 2, 2018

It would probably be better to not go line by line but parse char by char. But then again, INI should probably be a shard, not in the std.

@RX14 RX14 added this to the Next milestone Jan 2, 2018
@RX14 RX14 merged commit 68a7ce8 into crystal-lang:master Jan 2, 2018
@woodruffw woodruffw deleted the ini-parser-without-re branch January 2, 2018 16:43
lukeasrodgers pushed a commit to lukeasrodgers/crystal that referenced this pull request Jan 7, 2018
`INI.parse` now parses an INI-formatted string
into a `Hash` without using any regular expressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants