Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak / double free in address_parse_normalize_string #160

Closed
brianmacy opened this issue Feb 9, 2017 · 7 comments
Closed

Memory leak / double free in address_parse_normalize_string #160

brianmacy opened this issue Feb 9, 2017 · 7 comments

Comments

@brianmacy
Copy link

==26728== 1,077 bytes in 32 blocks are definitely lost in loss record 1,344 of 2,063
==26728==    at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==26728==    by 0x4FCCE938: utf8proc_map (utf8proc.c:610)
==26728==    by 0x4FCD35F1: normalize_string_utf8 (normalize.c:35)
==26728==    by 0x4FCD366E: normalize_string_latin (normalize.c:57)
==26728==    by 0x4FCDAF1C: address_parser_normalize_string (address_parser.c:153)
==26728==    by 0x4FCDAF1C: address_parser_parse (address_parser.c:812)
==26728==    by 0x4FCB6D8B: parse_address (libpostal.c:1043)

utf8proc_map uses realloc and parse_address has no way to deal with the pointer changing and the original being freed. If the pointer does change then parse_address has freed the address passed in without the client knowing and the new pointer is never freed.

Below is a potential fix.

diff --git a/src/address_parser.c b/src/address_parser.c
index 187bada..253b182 100644
--- a/src/address_parser.c
+++ b/src/address_parser.c
@@ -809,10 +809,12 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c
return NULL;
}

  • char *normalized = address_parser_normalize_string(address);
  • char *cloned_address = strdup(address); // must do this as this call uses realloc and
  •                                        // we have no way to return a changed pointer
    
  • char *normalized = address_parser_normalize_string(cloned_address);
    bool is_normalized = normalized != NULL;
    if (!is_normalized) {
  •    normalized = address;
    
  •    normalized = cloned_address;
    

    }

    averaged_perceptron_t *model = parser->model;
    @@ -887,7 +889,7 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c
    char **single_label = malloc(sizeof(char *));
    single_label[0] = label;
    char **single_component = malloc(sizeof(char *));

  •        single_component[0] = strdup(normalized);
    
  •        single_component[0] = normalized; // safe as we cloned the input
    
           response->num_components = 1;
           response->labels = single_label;
    

@@ -899,6 +901,8 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c
}
}

  • free(normalized);

  • cstring_array *token_labels = cstring_array_new_size(tokens->n);

    char *prev_label = NULL;

@albarrentine
Copy link
Contributor

albarrentine commented Feb 9, 2017

Thanks for reporting. This version frees up the memory without making an additional copy: 38c6c26.

The use of realloc in utf8proc_map itself is not a problem per se because the function takes a double-pointer utf8proc_uint8_t **dstptr as an argument, so it can reset the pointer that was passed in if say the caller preallocated some memory and the function needed more. When we call it in libpostal, the initial pointer is NULL (realloc on NULL behaves like malloc), so it's always either going to return NULL or a pointer to a new string.

@brianmacy
Copy link
Author

My first patch was similar in logic. Here is the issue I believe you still have...
normalize_string_latin returns a non-NULL pointer in either of 2 cases:

  • have_utf8proc_options and options & NORMALIZE_STRING_REPLACE_HYPHENS

In the case of have_utf8proc_options, it does indeed return the realloc'd pointer which is changed or not. So in one case it returns the pointer to the original address owned by the caller of parse_address. In the other case, it freed that pointer and returns a new one.

I believe the code you added will still allow utf8proc_map to free the client's buffer... potentially causing heap corruption. If it doesn't do that then it is still possible that pointer you are deleting is the client's buffer.

Or there is a third option, that I'm missing something but valgrind seems to agree.

@brianmacy
Copy link
Author

Ah... nvm. I understand now. So the issue still occurs with this option. The code modifies the passed in string (the client buffer) and then returns a pointer to it. That is the pointer I believe you are deleting.

if (options & NORMALIZE_STRING_REPLACE_HYPHENS) {
    string_replace(str, '-', ' ');
    normalized = str;
}

@brianmacy
Copy link
Author

I didn't want to do it this way but it may be the best option...

diff --git a/src/address_parser.c b/src/address_parser.c
index 187bada..b2e836a 100644
--- a/src/address_parser.c
+++ b/src/address_parser.c
@@ -895,10 +895,15 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c

         token_array_destroy(tokens);
         tokenized_string_destroy(tokenized_str);
  •        if (normalized != address)
    
  •          free(normalized);
           return response;
       }
    

    }

  • if (normalized != address)

  •  free(normalized);
    
  • cstring_array *token_labels = cstring_array_new_size(tokens->n);

    char *prev_label = NULL;

@albarrentine
Copy link
Contributor

With the current implementation of string_trim, that would free the caller's buffer if there are leading spaces in the input (thinking more of the expand_address calls, as the options used by the parser are more limited). The string_trim implementation, as well as that of string_replace (also currently modifies the pointer in-place, although it does not change the pointer's location) make copies in the upcoming parser-data release, and as such the functions in normalize.c clean up after all of their potential internal copies.

To reiterate re: utf8proc_map, it does not write to the input buffer. The pointer it's realloc-ing is not str (which is only read from), but uint8_t *utf8proc_normalized which is initialized as NULL and passed in as a double pointer so utf8proc_map can set/reset what it points to (if the user passed in a pre-allocated, non-NULL buffer, which we do not, and as noted before realloc-ing NULL is essentially just malloc).

Also, note that str in normalize_string_utf8 is not a double-pointer, so the effect of setting it is limited to the function's scope. The goal here is to chain several function calls depending on the options passed in, so we set str to the output of the last function for the next function to work on.

In my tests the pushed commits fixed all valgrind errors in the command-line client for a variety of inputs. Is this hypotethical or is there some input string that's causing this error for you?

@brianmacy
Copy link
Author

brianmacy commented Feb 10, 2017 via email

@albarrentine
Copy link
Contributor

Only need to free memory if more than one allocation takes place internally. If the options were simply NORMALIZE_STRING_TRIM, only string_trim gets called and allocates memory, so only one copy would be made in normalize_string_utf8, and that gets returned to be freed by the caller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants