INI: Rewrite parser to avoid regular expressions #5442

woodruffw · 2017-12-23T05:53:08Z

INI.parse now parses an INI-formatted string into a Hash without using any regular expressions.

Key differences:

Comments are now handled explicitly (both # and ;)
Parsing now raises an INI::ParseException in cases that were previously ignored (broken section definitions, lines that are neither comments nor declarations)
Empty values are permitted (key = becomes { "key" => "" })

Other than the above, the implementation handles the same inputs as the previous regular-expression based one. I've added some test cases to round things out.

Benchmarks (source):

$ crystal build --release --no-debug ini_bench.cr
$ ./ini_bench
INI parse w/ RE  89.12k ( 11.22µs) (±25.86%)  2.03× slower
INI parse w/o RE 180.92k (  5.53µs) (±27.43%)       fastest

RX14

Just a few nitpicks

RX14 · 2017-12-23T10:33:17Z

src/ini.cr

+        key, _, value = line.partition('=')
+        raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
+        ini[section] ||= {} of String => String
+        ini[section][key.rstrip] = value.strip


section = ini[section]? || Hash(String, String).new section[key.rstrip] = value.strip ini[section] = section

is very slightly faster.

RX14 · 2017-12-23T10:33:52Z

src/ini.cr

        ini[section] = {} of String => String
+      else
+        key, _, value = line.partition('=')
+        raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line


How about checking if value == "" since that's how partition works?

Actually, perhaps it's cleaner just to call index manually...

value == "" won't work because foo= is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "=".

RX14 · 2017-12-23T10:35:27Z

src/ini.cr

  #
  # ```
  # INI.parse("[foo]\na = 1") # => {"foo" => {"a" => "1"}}
  # ```
  def self.parse(str) : Hash(String, Hash(String, String))
    ini = {} of String => Hash(String, String)
-
+    lines = str.lines.map(&.lstrip)


Why not perform this lstrip inside the each_with_index. In fact, you can do without the lines call (which means a very large array allocation on large files) entirely.

straight-shoota · 2017-12-23T12:46:42Z

spec/std/ini_spec.cr

@@ -3,14 +3,38 @@ require "ini"

 describe "INI" do
  describe "parse from string" do
+    it "fails on malformed section" do
+      expect_raises(INI::ParseException, /unterminated section/) do


You could use a string here and in the following examples, no need for a regex (the point of this PR is reducing them^^).

straight-shoota · 2017-12-23T12:54:51Z

src/ini.cr

-        section = $1
+
+    lines.each_with_index(1) do |line, lineno|
+      next if line.empty?


This could all be a case expression to make the code better understandable:

case line[0]? when nil, '#', ';' next when '[' ... else ... end

I'd prefer the empty case to be outside the case, but apart from that, yes.

straight-shoota · 2017-12-23T12:57:25Z

src/ini.cr

+
+      if line[0] == '['
+        end_idx = line.index(']')
+        raise ParseException.new("unterminated section", lineno, line.size - 1) unless end_idx


This should probably also raise if end_idx is not the last character in the line. [foo] bar should be invalid.

straight-shoota · 2017-12-23T13:01:48Z

src/ini.cr

        ini[section] = {} of String => String
+      else
+        key, _, value = line.partition('=')
+        raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line


value == "" won't work because foo= is a valid key-value pair.
But the currently discarded 2nd return value could be checked to equal "=".

straight-shoota · 2017-12-23T13:03:49Z

src/ini.cr

+      else
+        key, _, value = line.partition('=')
+        raise ParseException.new("expected declaration", lineno, key.size - 1) if key == line
+        ini[section] ||= {} of String => String


This shouldn't be needed. All named sections are created by the previous section header and the default empty section could be initialized from the start (and deleted at the end if unused).

larubujo · 2017-12-23T14:31:42Z

crystal needs benchmark suite like go. then every change (compiler, std) you can run all benchs and compare what happened. regression or improvement. benchs in prs are good, but forgotten

woodruffw · 2017-12-23T17:15:00Z

I noticed that both parsers currently accept [] as a valid section name (i.e., an explicit version of the default section). Is that behavior intentional in the original parser?

straight-shoota · 2017-12-23T21:00:10Z

src/ini.cr

-      elsif line =~ /\[(.*)\]/
-        section = $1
+
+    str.lines.each_with_index(1) do |oline, lineno|


You should use String#each_line instead. String#lines still causes unnecessary allocation.
I'd prefer to use it directly without each_with_index and do the line number counter directly. This is even slightly faster than iterator chaining str.each_line.each_with_index.

And please use more expressive variable names: oline is completely cryptic. I can'd even imagine what this is supposed to mean. Just call it line. There is no need to change the name after stripping.
section could be renamed to current_section_name (or similar) to get rid of sect later.

straight-shoota · 2017-12-23T21:05:36Z

src/ini.cr

-        section = $1
+
+    str.lines.each_with_index(1) do |oline, lineno|
+      line = oline.strip


Simply stripping whitespace renders the column indices in error messages meaningless. You'll have to skip and count whitespace to get a valid offset.

Yeah, that's what I was going for with oline (which is supposed to be original line), but I realized that's not sufficient. I'll turn skip-and-count into a private helper method, since it's needed in a few places.

woodruffw · 2017-12-23T21:33:48Z

Latest benchmark (https://gist.github.com/woodruffw/7b1001c8a29ef4796c48b2ad59ba6bf7):

INI parse w/ RE  87.17k ( 11.47µs) (±26.56%)  1.91× slower
INI parse w/o RE 166.86k (  5.99µs) (±30.73%)       fastest

straight-shoota · 2017-12-23T21:47:38Z

src/ini.cr

+      lineno += 1
+      next if line.empty?
+
+      line, skip = strip_and_lskip(line)


Why do you need this as a separate method? This is probably not going to be used anywhere else. It doesn't make a big difference, but I don't think it's necessary for such a simple task.

But I guess it should better be implemented directly without employing String.lstrip and comparing the resulting string size. This is unneded extra work (mostly memory allocations).
Consider something like this:

offset = 0 line.each_char do |char| break unless char.ascii_whitespace? offset += 1 end case line[offset] # ... when '[' end_idx = line.index(']', offset) # ...

straight-shoota · 2017-12-23T21:48:26Z

src/ini.cr

+
+      line, skip = strip_and_lskip(line)
+
+      case line[0]?


You can omit the ? because you've already checked that line is not empty.

straight-shoota · 2017-12-23T23:19:43Z

src/ini.cr

+      next if line.empty?
+
+      offset = 0
+      while line[offset].ascii_whitespace?


Sorry, that was my initial suggestion, but it was wrong. Looping over line[offset] is very inefficient because the character position needs to be translatet to byte position internally. I've edited my previous comment to use line.each_char instead right after publishing it.

straight-shoota · 2017-12-24T13:08:44Z

src/ini.cr

+        raise ParseException.new("unterminated section", lineno, line.size) unless end_idx
+        raise ParseException.new("data after section", lineno, end_idx + 1) unless end_idx == line.size - 1
+
+        current_section_name = line[1...end_idx]


This should be line[(offset + 1)...end_idx]

RX14 · 2017-12-24T13:12:16Z

src/ini.cr

+        key, eq, value = line.partition('=')
+        raise ParseException.new("expected declaration", lineno, key.size) if eq != "="
+
+        section = ini[current_section_name]? || Hash(String, String).new


We can just store the current_section hash and then we only have to do a single hash op for each k=v, instead of 3 hash operations.

ini = Hash(String, Hash(String, String)).new current_section_name = "" current_section = ini[current_section_name] = Hash(String, String).new

then in [ just

current_section_name = line[1...end_idx] current_section = ini[current_section_name] = Hash(String, String).new

and that simplifies else to just

current_section[key.strip] = value.strip

and you're done!

Simpler and faster in every way :)

Great idea.

This doesn't even need current_section_name as a variable outside the loop.

ini = Hash(String, Hash(String, String)).new current_section = ini[""] = Hash(String, String).new

I'd keep it in the [ section for better readability.

I went with

current_section = ini[current_section_name] ||= Hash(String, String).new

To allow for reopened sections, since the original parser allows them (but didn't test for them in the spec).

RX14 · 2017-12-24T20:16:01Z

src/ini.cr

      end
    end
+
+    ini.delete("") if ini[""].empty?


I'm not sure this is what we want.

It's consistent with the current parser's behavior: if there's nothing in the default/global section, it gets omitted entirely.

I can remove it, but some of the specs will need to be updated, e.g.:

Failure/Error: INI.parse("[section]").should eq({"section" => {} of String => String}) Expected: {"section" => {}} got: {"" => {}, "section" => {}}

Hmm, I still need to be convinced either way on this issue.

I prefer the current behavior for a few reasons:

Neither inih nor Python's ConfigParser includes the default section if it's empty.

Parsing an empty string would result in {"" => {}} rather than just {}, which I personally find unintuitive.

As above, empty? would be false on the hash returned by parse(""), which I also find unintuitive.

Then we should delete all empty sections.

The benefit of having empty sections is that they're at least explicit, e.g. if I really did want an empty default section, I could do this:

[] [foo] bar=baz

But inih and other INI parsers don't seem to do that consistently, so I can delete all empty sections if you think that's better.

@woodruffw but adding [] doesn't make an empty default section in your code, it'd be deleted regardless. But other empty sections with non-empty names wouldn't get deleted.

We shouldn't treat the default section differently is all i'm pointing out.

Ah, you're right. I was mixing it up with the old [] behavior.

It could be checked if the default section was just used implicitly or explicitly specified with [] and in the latter case, keep it even if empty.
But the current solution to delete all empty sections is probably fine as well.

`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.

asterite · 2018-01-02T13:40:37Z

It would probably be better to not go line by line but parse char by char. But then again, INI should probably be a shard, not in the std.

`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.

RX14 requested changes Dec 23, 2017

View reviewed changes

straight-shoota reviewed Dec 23, 2017

View reviewed changes

woodruffw force-pushed the ini-parser-without-re branch from 957bdb6 to 3f2502c Compare December 23, 2017 17:12

woodruffw force-pushed the ini-parser-without-re branch from 3f2502c to 34aeeb5 Compare December 23, 2017 17:28

straight-shoota reviewed Dec 23, 2017

View reviewed changes

woodruffw force-pushed the ini-parser-without-re branch from 34aeeb5 to 3198df5 Compare December 23, 2017 21:31

straight-shoota reviewed Dec 23, 2017

View reviewed changes

woodruffw force-pushed the ini-parser-without-re branch from 3198df5 to e083c63 Compare December 23, 2017 21:59

straight-shoota reviewed Dec 23, 2017

View reviewed changes

woodruffw force-pushed the ini-parser-without-re branch from e083c63 to 5d1b114 Compare December 23, 2017 23:53

straight-shoota reviewed Dec 24, 2017

View reviewed changes

RX14 requested changes Dec 24, 2017

View reviewed changes

woodruffw force-pushed the ini-parser-without-re branch 2 times, most recently from cb9f1f7 to 6685a5f Compare December 24, 2017 18:03

RX14 requested changes Dec 24, 2017

View reviewed changes

INI: Rewrite parser to avoid regular expressions

dba92b8

`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.

woodruffw force-pushed the ini-parser-without-re branch from 6685a5f to dba92b8 Compare December 24, 2017 23:34

RX14 approved these changes Dec 24, 2017

View reviewed changes

asterite approved these changes Jan 2, 2018

View reviewed changes

RX14 added kind:refactor performance topic:stdlib labels Jan 2, 2018

RX14 added this to the Next milestone Jan 2, 2018

RX14 merged commit 68a7ce8 into crystal-lang:master Jan 2, 2018

woodruffw deleted the ini-parser-without-re branch January 2, 2018 16:43

lukeasrodgers pushed a commit to lukeasrodgers/crystal that referenced this pull request Jan 7, 2018

INI: Rewrite parser to avoid regular expressions (crystal-lang#5442)

34b35f7

`INI.parse` now parses an INI-formatted string into a `Hash` without using any regular expressions.

INI: Rewrite parser to avoid regular expressions #5442

INI: Rewrite parser to avoid regular expressions #5442

Conversation

woodruffw commented Dec 23, 2017 • edited Loading

RX14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

straight-shoota Dec 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larubujo commented Dec 23, 2017

woodruffw commented Dec 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodruffw commented Dec 23, 2017 • edited Loading

straight-shoota Dec 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

straight-shoota Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

woodruffw Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodruffw Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

RX14 Dec 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asterite commented Jan 2, 2018

woodruffw commented Dec 23, 2017 •

edited

Loading

straight-shoota Dec 23, 2017 •

edited

Loading

woodruffw commented Dec 23, 2017 •

edited

Loading

straight-shoota Dec 23, 2017 •

edited

Loading

RX14 Dec 24, 2017 •

edited

Loading

straight-shoota Dec 24, 2017 •

edited

Loading

RX14 Dec 24, 2017 •

edited

Loading

woodruffw Dec 24, 2017 •

edited

Loading

woodruffw Dec 24, 2017 •

edited

Loading

RX14 Dec 24, 2017 •

edited

Loading