Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #365

Closed
TurpentineDistillery opened this issue Nov 22, 2016 · 30 comments
Closed

Performance improvements #365

TurpentineDistillery opened this issue Nov 22, 2016 · 30 comments

Comments

@TurpentineDistillery
Copy link

TurpentineDistillery commented Nov 22, 2016

I propose some minor changes that yield substantial performance improvements.

diff --git a/src/json.hpp b/src/json.hpp
index a302bb0..0a8ab78 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -8755,10 +8755,10 @@ basic_json_parser_66:
                 {
                     // copy unprocessed characters to line buffer
                     m_line_buffer.clear();
-                    for (m_cursor = m_start; m_cursor != m_limit; ++m_cursor)
-                    {
-                        m_line_buffer.append(1, static_cast<const char>(*m_cursor));
-                    }
+                    m_line_buffer.append(
+                            reinterpret_cast<const typename string_t::value_type*>(m_start),
+                            static_cast<size_t>(m_limit - m_start));
+                    m_cursor = m_limit;
                 }

                 // append n characters to make sure that there is sufficient
@@ -8770,11 +8770,13 @@ basic_json_parser_66:
             {
                 // delete processed characters from line buffer
                 m_line_buffer.erase(0, static_cast<size_t>(offset_start));
+
                 // read next line from input stream
-                std::string line;
-                std::getline(*m_stream, line, '\n');
+                m_line_buffer_tmp.clear();
+                std::getline(*m_stream, m_line_buffer_tmp, '\n');
                 // add line with newline symbol to the line buffer
-                m_line_buffer += line + "\n";
+                m_line_buffer += m_line_buffer_tmp;
+                m_line_buffer.push_back('\n');
             }

             // set pointers
@@ -8861,9 +8863,15 @@ basic_json_parser_66:
             // iterate the result between the quotes
             for (const lexer_char_t* i = m_start + 1; i < m_cursor - 1; ++i)
             {
-                // process escaped characters
-                if (*i == '\\')
-                {
+                // number of non-escaped characters
+                const size_t n = static_cast<size_t>(std::find(i, m_cursor - 1, '\\') - i);
+
+                if(n != 0) {
+                    result.append(reinterpret_cast<const typename string_t::value_type*>(i), n);
+                    i += n-1; // -1 because will ++i
+                } else {
+                    // processing escaped character
+
                     // read next character
                     ++i;

@@ -8950,12 +8958,6 @@ basic_json_parser_66:
                         }
                     }
                 }
-                else
-                {
-                    // all other characters are just copied to the end of the
-                    // string
-                    result.append(1, static_cast<typename string_t::value_type>(*i));
-                }
             }

             return result;
@@ -9139,6 +9141,8 @@ basic_json_parser_66:
         std::istream* m_stream = nullptr;
         /// line buffer buffer for m_stream
         string_t m_line_buffer {};
+        /// used for filling m_line_buffer
+        string_t m_line_buffer_tmp {};
         /// the buffer pointer
         const lexer_char_t* m_content = nullptr;
         /// pointer to the beginning of the current symbol
Before
parse jeopardy.json                           1  1746079858 ns/op
parse canada.json                            50    49808100 ns/op
parse citm_catalog.json                      50    30050798 ns/op
parse twitter.json                          100    15523826 ns/op
parse numbers/floats.json                     5   539260495 ns/op
parse numbers/signed_ints.json                5   346619225 ns/op
parse numbers/unsigned_ints.json              5   348103862 ns/op
dump jeopardy.json                            1  1882424034 ns/op
dump jeopardy.json with indent                1  2333816314 ns/op

After:
parse jeopardy.json                           1  1392954141 ns/op
parse canada.json                            50    47828487 ns/op
parse citm_catalog.json                     100    19988139 ns/op
parse twitter.json                          100    10175643 ns/op
parse numbers/floats.json                     5   428099836 ns/op
parse numbers/signed_ints.json                5   232719693 ns/op
parse numbers/unsigned_ints.json              5   234741840 ns/op
dump jeopardy.json                            1  1871500322 ns/op
dump jeopardy.json with indent                1  2512231626 ns/op
@nlohmann
Copy link
Owner

Very nice!

@nlohmann
Copy link
Owner

I can confirm the improvements. I shall create a feature branch.

nlohmann added a commit that referenced this issue Nov 22, 2016
@nlohmann nlohmann added this to the Release 2.0.8 milestone Nov 22, 2016
@nlohmann
Copy link
Owner

nlohmann commented Nov 22, 2016

There are tests failing with Linux: https://travis-ci.org/nlohmann/json/builds/177884351

@gregmarr
Copy link
Contributor

dump jeopardy.json with indent
Any idea why this one is almost 10% slower?

@nlohmann
Copy link
Owner

@gregmarr When I ran the benchmarks, I did not notice the deviation. However, the numbers differed from run to run. I shall have a second look and post my results here.

Right now, I need to understand why the tests fail on Linux...

@TurpentineDistillery
Copy link
Author

TurpentineDistillery commented Nov 22, 2016

@gregmarr, this must be experimental variance.
BTW, adding the following snippet to benchmarking code may make the results more consistent.

#include <pthread.h>
#include <thread>
struct StartUp
{
    StartUp()
    {
#ifndef __llvm__
        // pin thread to a single CPU
        cpu_set_t cpuset;
        pthread_t thread;
        thread = pthread_self();
        CPU_ZERO(&cpuset);
        CPU_SET(std::thread::hardware_concurrency() - 1, &cpuset);
        pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
#endif
    }
};
StartUp startup;
#endif

Why errors though... The changes looked so functionally-invariant to me.

@gregmarr
Copy link
Contributor

Ah, I had assumed that this already did multiple runs and averaged the results to eliminate the variances.

@gregmarr
Copy link
Contributor

gregmarr commented Nov 22, 2016

I'm not sure if it's causing the failures here, but the reinterpret_cast might be a problem.

-               result.append(1, static_cast<typename string_t::value_type>(*i));

+               result.append(reinterpret_cast<const typename string_t::value_type*>(i), n);

The former essentially converts a character from the lexer type to the string type. If the two types are different sizes, the compiler converts from one to the other.

The latter treats a pointer to an array of lexer character types to a pointer to an array of string character types. If the two character types are different sizes, then this is a major bug as the string data is interpreted improperly. Additionally, if the string character type is larger than the lexer character type, then this is an array out of bounds access.

You probably want to use this overload instead:

template<class InputIt> basic_string& append(InputIt first, InputIt last);

            // find next escape character
            auto e = std::find(i, m_cursor - 1, '\\');

            if(e != i) {
                result.append(i, e);
                i = e-1; // -1 because will ++i
            } else {

Similarly, here:

                m_line_buffer.clear();
                m_line_buffer.append(
                        reinterpret_cast<const typename string_t::value_type*>(m_start),
                        static_cast<size_t>(m_limit - m_start));

should probably be

                m_line_buffer.assign(m_start, m_limit);

Note that with assign you don't need to clear first.

@nlohmann
Copy link
Owner

The tests seem to succeed, but the benchmarks are not that stable:

[~/.../Repositories/json] ./json_benchmarks_develop --benchtime 60
parse jeopardy.json                          50  2465207545 ns/op
parse canada.json                           500   127668921 ns/op
parse citm_catalog.json                    1000    60734985 ns/op
parse twitter.json                         5000    28945175 ns/op
parse numbers/floats.json                   100   976098389 ns/op
parse numbers/signed_ints.json              100   769079242 ns/op
parse numbers/unsigned_ints.json            100  1079325121 ns/op
dump jeopardy.json                           50  1805855545 ns/op
dump jeopardy.json with indent               50  1806103893 ns/op
./json_benchmarks_develop 1502.94s

[~/.../Repositories/json] ./json_benchmarks_after --benchtime 60
parse jeopardy.json                          50  2137069074 ns/op
parse canada.json                           500   146579088 ns/op
parse citm_catalog.json                    1000    69727750 ns/op
parse twitter.json                         2000    37405801 ns/op
parse numbers/floats.json                    50  1764961957 ns/op
parse numbers/signed_ints.json              100   858041927 ns/op
parse numbers/unsigned_ints.json            100   819768276 ns/op
dump jeopardy.json                           50  1474386834 ns/op
dump jeopardy.json with indent               50  1721278953 ns/op
./json_benchmarks_after 1334.51s

I won't merge this right now. I think there should be an improvement, but it seems to be time to change the benchmarking tool... Any ideas?

@TurpentineDistillery
Copy link
Author

TurpentineDistillery commented Nov 22, 2016

WRT benchmarking:
Adding pin-thread-to-cpu code snippet in benchmarks.cpp reduces the measurement variance dramatically for me (needs to be compiled with -pthread).
Also, the input files should be memory-mapped or in warm-cache as to remove the overhead of reading from disk during the first execution.

WRT changes:
The code in the if-clause under "//skip this part if we are already using the line buffer" is super-dangerous. We are modifying a string (m_line_buffer) from pointers-into-the-very-same string! Let's say the string's implementation decides to shrink or increase the underlying storage during the modification, or a programmer decides to call reserve( ) or something - the pointers are now invalidated while they're still being used. Another scenario is that if source and destination ranges overlap, some source bytes may be overwritten before they are read.

I'm not saying that there's an active bug, but that it is fraught with danger.

Perhaps make use of m_line_buffer_tmp there?

                    // copy unprocessed characters to line buffer
                    m_line_buffer_tmp.assign(m_start, m_limit);
                    std::swap(m_line_buffer, m_line_buffer_tmp);

@gregmarr
Copy link
Contributor

gregmarr commented Nov 23, 2016

I think the intent is that the code under skip this part if we are already using the line buffer is not executed when start points into m_line_buffer. Otherwise, this code would make no sense:

                // copy unprocessed characters to line buffer
                m_line_buffer.clear();
                for (m_cursor = m_start; m_cursor != m_limit; ++m_cursor)
                {
                    m_line_buffer.append(1, static_cast<const char>(*m_cursor));
                }

However, the current test doesn't really cover that properly. The m_start can point into m_line_buffer without being at the start.

@TurpentineDistillery
Copy link
Author

@gregmarr, the following innocent-looking change causes json_benchmarks to segfault for me, so here-be-dragons.

--- json.hpp.1	2016-11-23 08:05:24.107402000 -0500
+++ json.hpp	2016-11-23 08:07:30.069845000 -0500
@@ -8733,6 +8733,7 @@
                 if (m_start != reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()))
                 {
                     // copy unprocessed characters to line buffer
+                    m_line_buffer.reserve(static_cast<size_t>(n + m_limit - m_start));
                     m_line_buffer.assign(m_start, m_limit);
                     m_cursor = m_limit;
                 }

@nlohmann
WRT "reducing benchmark variance" commit:
When I was talking about factoring-out disk access I meant more along the lines of
a) Copy the file into a stringstream, and read from stringstream in the benchmarking loop.
b) OR turn the timer off during the first iteration of the loop, hoping that OS will keep the file's pages in RAM on subsequent iterations of the loop.

I'd hope this would do more good than pin-thread-to-cpu, but I have not tried it.

@nlohmann
Copy link
Owner

I'll have a look. It may take some time, because I got a new computer and will need some time to setup...

@gregmarr
Copy link
Contributor

That change is unnecessary but does point out that the check needs to be improved to detect more cases where the buffer is in use.

@TurpentineDistillery
Copy link
Author

I refactored the benchmarking code as follows; I think this would yield the timings as consistently as we can reasonably expect

#define BENCHPRESS_CONFIG_MAIN

#include <fstream>
#include <sstream>
#include <benchpress.hpp>
#include <json.hpp>

#include <pthread.h>
#include <thread>
struct StartUp
{
    StartUp()
    {
#ifndef __llvm__
        cpu_set_t cpuset;
        pthread_t thread;
        thread = pthread_self();
        CPU_ZERO(&cpuset);
        CPU_SET(std::thread::hardware_concurrency() - 1, &cpuset);
        pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
#endif
    }
};
StartUp startup;

enum class EMode { input, output_no_indent, output_with_indent };
static void bench(
        benchpress::context& ctx,
        const std::string& in_path,
        const EMode mode)
{
    // using stringstreams for benchmarking
    // to factor-out cold-cache disk access.

    std::stringstream istr;
    {
        std::ifstream input_file(in_path);
        istr << input_file.rdbuf();
        input_file.close();

        // read the stream once
        nlohmann::json j;
        j << istr;
        istr.clear(); // clear flags
        istr.seekg(0);
    }

    ctx.reset_timer();
    if(mode == EMode::input) {
        // benchmarking input
        for (size_t i = 0; i < ctx.num_iterations(); ++i)
        {
            istr.clear(); // clear flags
            istr.seekg(0);
            nlohmann::json j;
            j << istr;
        }
    } else {
        // benchmarking output
        nlohmann::json j;
        j << istr;
        std::stringstream ostr;
        ctx.reset_timer();
        for (size_t i = 0; i < ctx.num_iterations(); ++i)
        {
            if(mode == EMode::output_no_indent) {
                ostr << j;
            } else {
                ostr << std::setw(4) << j;
            }
            ostr.str(std::string()); // reset data;
        }
    }
}

#define BENCHMARK_I(mode, title, in_path)       \
BENCHMARK((title), [](benchpress::context* ctx) \
{                                               \
    bench(*ctx, (in_path), (mode));    \
})


BENCHMARK_I(EMode::input, "parse jeopardy.json",              "benchmarks/files/jeopardy/jeopardy.json")
BENCHMARK_I(EMode::input, "parse canada.json",                "benchmarks/files/nativejson-benchmark/canada.json");
BENCHMARK_I(EMode::input, "parse citm_catalog.json",          "benchmarks/files/nativejson-benchmark/citm_catalog.json");
BENCHMARK_I(EMode::input, "parse twitter.json",               "benchmarks/files/nativejson-benchmark/twitter.json");
BENCHMARK_I(EMode::input, "parse numbers/floats.json",        "benchmarks/files/numbers/floats.json");
BENCHMARK_I(EMode::input, "parse numbers/signed_ints.json",   "benchmarks/files/numbers/signed_ints.json");
BENCHMARK_I(EMode::input, "parse numbers/unsigned_ints.json", "benchmarks/files/numbers/unsigned_ints.json");
//unsigned_ints can be yanked, as it is largerly similar to signed_ints

BENCHMARK_I(EMode::output_no_indent,   "dump jeopardy.json",             "benchmarks/files/jeopardy/jeopardy.json");
BENCHMARK_I(EMode::output_with_indent, "dump jeopardy.json with indent", "benchmarks/files/jeopardy/jeopardy.json");
BENCHMARK_I(EMode::output_no_indent,   "dump numbers/floats.json",       "benchmarks/files/numbers/floats.json");
BENCHMARK_I(EMode::output_no_indent,   "dump numbers/signed_ints.json",  "benchmarks/files/numbers/signed_ints.json");

Observations: after @gregmarr's fixes the performance improvement is still there, except for parse jeopardy.json, whose performance is in-between that of develop branch (slowest) and the head of issue365 branch (my original proposal with the sloppy reinterpret_cast).

@nlohmann
Copy link
Owner

Indeed - the numbers of the modified benchmark program are much more stable. I'll commit an update in a minute. I added the code for the unsigned integers, because I was planning to split the integer conversion for a signed and an unsigned case.

@nlohmann
Copy link
Owner

Ok, I let some more benchmarks run and the code in https://github.com/nlohmann/json/tree/feature/issue365 is faster than the one in the development branch.

So where should we go from here? Do you still see potential errors?

@nlohmann nlohmann added the state: help needed the issue needs help to proceed label Nov 24, 2016
@TurpentineDistillery
Copy link
Author

I'll take a closer look at the code section that looked suspicious.

@TurpentineDistillery
Copy link
Author

Please see below.
Keep whichever #if branch is more to your liking.

index 8704134..33afb99 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -8719,23 +8719,47 @@ basic_json_parser_66:
         */
         void fill_line_buffer(size_t n = 0)
         {
+            assert(m_line_buffer.empty()
+                || m_content == reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()));
+
+            assert(m_line_buffer.empty()
+                || m_limit   == m_content + m_line_buffer.size());
+
+            assert(m_content <= m_start);
+            assert(m_start   <= m_cursor);
+            assert(m_cursor  <= m_limit);
+            assert(m_marker  <= m_limit || !m_marker);
+
             // number of processed characters (p)
-            const auto offset_start = m_start - m_content;
+            const size_t num_processed_chars = static_cast<size_t>(m_start - m_content);
+
             // offset for m_marker wrt. to m_start
             const auto offset_marker = (m_marker == nullptr) ? 0 : m_marker - m_start;
+
             // number of unprocessed characters (u)
             const auto offset_cursor = m_cursor - m_start;

             // no stream is used or end of file is reached
             if (m_stream == nullptr or m_stream->eof())
             {
-                // skip this part if we are already using the line buffer
-                if (m_start != reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()))
+
+#if 1 //https://stackoverflow.com/questions/28142011/can-you-assign-a-substring-of-a-stdstring-to-itself
+
+                // m_start may or may not be pointing into
+                // m_line_buffer at this point. We trust the
+                // standand library to do the right thing.
+                m_line_buffer.assign(m_start, m_limit);
+#else
+                if(m_line_buffer.empty())
                 {
-                    // copy unprocessed characters to line buffer
+                    // start using the buffer
                     m_line_buffer.assign(m_start, m_limit);
-                    m_cursor = m_limit;
+                } else {
+                    // delete processed characters from line buffer.
+                    // semantically this is the same as the .assign(...) above
+                    m_line_buffer.erase(0, num_processed_chars);
                 }
+#endif

                 // append n characters to make sure that there is sufficient
                 // space between m_cursor and m_limit
@@ -8745,7 +8769,8 @@ basic_json_parser_66:
             else
             {
                 // delete processed characters from line buffer
-                m_line_buffer.erase(0, static_cast<size_t>(offset_start));
+                m_line_buffer.erase(0, num_processed_chars);
+
                 // read next line from input stream
                 m_line_buffer_tmp.clear();
                 std::getline(*m_stream, m_line_buffer_tmp, '\n');
@@ -8756,7 +8781,7 @@ basic_json_parser_66:
             }

             // set pointers
-            m_content = reinterpret_cast<const lexer_char_t*>(m_line_buffer.c_str());
+            m_content = reinterpret_cast<const lexer_char_t*>(m_line_buffer.data());
             assert(m_content != nullptr);
             m_start  = m_content;
             m_marker = m_start + offset_marker;

@gregmarr
Copy link
Contributor

gregmarr commented Nov 25, 2016

The standard says that the assign from iterators is valid, as it is equivalent to creating a temporary string from the iterators and assigning from that temporary:

template<class InputIterator>
basic_string& assign(InputIterator first, InputIterator last);
    Effects: Equivalent to assign(basic_string(first, last)).

@TurpentineDistillery
Copy link
Author

@gregmarr, I see now.
My concern was that string::assign(...) may invalidate any related iterators, pointers, or references, so the effect would be equivalent to assigning from possibly invalidated iterators or pointers.
However, if the standard requires that as-if temporary is created before the call to assign(...), then that's safe.

@TurpentineDistillery
Copy link
Author

One more thing:

diff --git a/src/json.hpp b/src/json.hpp
index 8704134..8ced3e7 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -8843,7 +8843,9 @@ basic_json_parser_66:
                 auto e = std::find(i, m_cursor - 1, '\\');
                 if (e != i)
                 {
-                    result.append(i, e);
+                    for(auto k = i; k < e; k++) {
+                        result.push_back(static_cast<typename string_t::value_type>(*k));
+                    }
                     i = e - 1; // -1 because of ++i
                 }
                 else

The original intent with using reinterpret cast was to hint the compiler: "hey, you can optimize moving a contiguous chunk of bytes from here to there instead of shuffling them one-byte-at-a-time". Unfortunately, that turned out to be a problem.
append(i, e), on the other hand, while functionally correct, actually makes things worse performance-wise: under the hood the standard library implementation actually allocates a temporary string to do the work. Eeek! Using a simple loop instead gives additional 20% speedup for jeopardy.json

@gregmarr
Copy link
Contributor

@TurpentineDistillery I wasn't sure if assign was safe, so I went to the standard to check just before I posted that. :)

append(i, e), on the other hand, while functionally correct, actually makes things worse performance-wise:

Which STL are you testing with? That sounds like it could be a point of variance among libraries.

@TurpentineDistillery
Copy link
Author

@gregmarr, very likely the case.
That was with gcc 4.9.3

nlohmann added a commit that referenced this issue Nov 26, 2016
@nlohmann
Copy link
Owner

@TurpentineDistillery @gregmarr Thanks for the further insights. I pushed them to the feature branch (see 2773038).

@TurpentineDistillery
Copy link
Author

@nlohmann,
Which version of re2c are you using? I tried make re2c with version 0.15.3 and it generated different, more compact, lexer code.
I was thinking about splittig the number token-type into floating-point and integer cases, such that when it comes to parsing we'll know what type of a number we're dealing with a-priori, so that section of code can be simplified.

@nlohmann
Copy link
Owner

@TurpentineDistillery, I am using re2c 0.16.

I once started on this myself - maybe you can recycle some code from branch feature/unfolded_string_to_int - but it's been a while since I worked on this.

@TurpentineDistillery
Copy link
Author

I looked at the code in that branch, and it looked basically what I envisioned it would look like if I were to make the changes, so I don't think there's anything else for me to do there.

I reviewed the implementation of the parser. In the doc you expressed hope that LALR would be more efficient than recursive-descent. I'm not an authority on parsers, but I'm pretty sure that's not the case here. There's no place in the grammar where the recursive-descent may go "off-track" by more than one token, and that may happen only rarely. The above leads me to believe that the implementation of the parser is optimal from the theoretic standpoint.

Elsewhere you said that you think that the lexer code is near-optimal, but looking at the expansion of the auto-generated re2c code I can neither confirm or deny that : ). re2c folks say " the generated code is as good as a carefully tuned hand-crafted C/C++ lexer", but I dunno...

@nlohmann
Copy link
Owner

@TurpentineDistillery Thanks for checking back. I shall merge the branch. If anything else comes up, we can have a fresh look.

There is one reason I would like to have an LALR parser: By using a parser generator such as Bison, the actual written code would be concise and easy to verify compared to the recursive descent. In that sense, I am happy to use re2c, because it is easy to check the used regular expressions. However, I haven't found any tool that generates code that is easy to put into a header-only library. At the same time, my confidence in the current parser is quite high.

I had some deeper looks into the code generated by re2c when I tried to improve the code coverage. It is not straightforward, but in the end they work with a reduced DFA, and I haven't seen anything odd there.

@nlohmann
Copy link
Owner

Merged: a8522f3.

@nlohmann nlohmann removed the state: help needed the issue needs help to proceed label Nov 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants