Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More principled handling of string<->numeric conversions. #1796

Merged
merged 1 commit into from
Oct 27, 2017

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Oct 27, 2017

This refactor comes after a belated realization that our numeric
conversion routines (principally from_string<> and format) are
"locale-dependent" via their use of strtod, as well as atof() scattered
through the code.

In particular, this means that when software using this code is running
with a locale in which the decimal mark is something other than '.'
(e.g., many European locales that use ',' for decimal mark), these
functions will incorrectly parse text floating point numbers that use
the standard period as decimal mark. For example, if the locale is
"fr_FR", atof("123.45") will return 123.0, rather than the correct
123.45.

After much debate, and a few implementations, we are declaring that
since the whole point of OIIO is about reading and writing persistent
data (image files!), that in order to be consistent, ALL of OIIO's
string parsing and formatting, including that done by utility library
functions, will be locale-independent by using the "C" locale (in
particular, we will use the usual North American default of '.' for
decimal separator).

Additionally:

  • Added a new stof(), stoi(), stoul() to replace the clunkier template
    syntax of from_string<>. As explained, these are hard-coded to "C" locale,
    not the native/global locale.

  • Much more attention to string_view safety, finding a number of subtle
    bugs (including Strutil::parse_int, etc.) where it was definitely not
    safe to assume that the string_view passed was the null-terminated edge
    of the string, and so, for example if a string_view pointed to the
    characters "12" but the very next characters were "345" parse_int would
    return 12345 instead of the correct 12.

  • Strutil::format and ustring::format (and our copy of tinyformat
    underneath) were touched up so that the versions of format() that return
    strings will use the "C" locale, versions added that return strings and
    use an explicitly-passed locale, and the versions that append their
    results to existing streams continue to operate as before by honoring
    the existing locale conventions of the streams they are using.
    We may revise this later depending on what the upstream tinyformat
    chooses to do about the issue.

  • Several places where we assembled strings using std::ostringstream,
    I forced the stream to use classic "C" locale in cases where it was
    likely to have any floating-point data output.

  • For maketx and oiiotool, globally set the locale to classic. In these
    cases, we control the whole app, so this is safe to do.

If we ever decide that we want versions of these functions that take
explicit locales, it's not hard to add.

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 27, 2017

This PR is a counter-proposal to #1785. In this one, we are not adding any API calls that take explicit locales, instead just changing the documentation on existing APIs to specify that OIIO uses "C" locale exclusively for formatting numerical values. Internally, of course we still play some games with locales to get that right. And we still fix all the bugs discovered with string_view and whatnot that were part of the other proposal.

OIIO_CHECK_EQUAL (Strutil::stoi("+123"), 123);
OIIO_CHECK_EQUAL (Strutil::stoi(" 123 "), 123);
OIIO_CHECK_EQUAL (Strutil::stoi("123.45"), 123);
OIIO_CHECK_EQUAL (Strutil::stoi("12345678901234567890"), std::numeric_limits<int>::max());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some overflow cases too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an overflow case on line 392. What did you mean?

// Locale-independent quickie ASCII digit and alphanum tests, good enough
// for our parsing.
inline int isupper (char c) { return c >= 'A' && c <= 'Z'; }
inline int islower (char c) {return c >= 'a' && c <= 'z'; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: missing space

@fpsunflower
Copy link
Contributor

LGTM

Eventually I would love to see us roll our own strtod/strtof -- but they are harder to write, so this seems like a good start.

This refactor comes after a belated realization that our numeric
conversion routines (principally from_string<> and format) are
"locale-dependent" via their use of strtod, as well as atof() scattered
through the code.

In particular, this means that when software using this code is running
with a locale in which the decimal mark is something other than '.'
(e.g., many European locales that use ',' for decimal mark), these
functions will incorrectly parse text floating point numbers that use
the standard period as decimal mark.  For example, if the locale is
"fr_FR", atof("123.45") will return 123.0, rather than the correct
123.45.

After much debate, and a few implementations, we are declaring that
since the whole point of OIIO is about reading and writing persistent
data (image files!), that in order to be consistent, ALL of OIIO's
string parsing and formatting, including that done by utility library
functions, will be locale-independent by using the "C" locale (in
particular, we will use the usual North American default of '.' for
decimal separator).

Additionally:

* Added a new stof(), stoi(), stoul() to replace the clunkier template
syntax of from_string<>. As explained, these are hard-coded to "C" locale,
not the native/global locale.

* Much more attention to string_view safety, finding a number of subtle
bugs (including Strutil::parse_int, etc.)  where it was definitely not
safe to assume that the string_view passed was the null-terminated edge
of the string, and so, for example if a string_view pointed to the
characters "12" but the very next characters were "345" parse_int would
return 12345 instead of the correct 12.

* Strutil::format and ustring::format (and our copy of tinyformat
underneath) were touched up so that the versions of format() that return
strings will use the "C" locale, versions added that return strings and
use an explicitly-passed locale, and the versions that append their
results to existing streams continue to operate as before by honoring
the existing locale conventions of the streams they are using.
We may revise this later depending on what the upstream tinyformat
chooses to do about the issue.

* Several places where we assembled strings using std::ostringstream,
I forced the stream to use classic "C" locale in cases where it was
likely to have any floating-point data output.

* For maketx and oiiotool, globally set the locale to classic. In these
cases, we control the whole app, so this is safe to do.

If we ever decide that we want versions of these functions that take
explicit locales, it's not hard to add.
@lgritz
Copy link
Collaborator Author

lgritz commented Oct 27, 2017

We can easily drop in an alternate strtod implementation if we find a suitable one. Suitable means:

  • 100% sure it matches system strtod in all edge cases, never has LSB errors even in unusual circumstances.

  • Outperforms the current scheme.

  • Is straightforward to modify to work directly on a string_view, without risking introducing bugs in the process.

@lgritz lgritz merged commit 89f855c into AcademySoftwareFoundation:master Oct 27, 2017
@lgritz lgritz deleted the lg-string2 branch October 27, 2017 22:39
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Nov 1, 2017
…twareFoundation#1796)

This refactor comes after a belated realization that our numeric
conversion routines (principally from_string<> and format) are
"locale-dependent" via their use of strtod, as well as atof() scattered
through the code.

In particular, this means that when software using this code is running
with a locale in which the decimal mark is something other than '.'
(e.g., many European locales that use ',' for decimal mark), these
functions will incorrectly parse text floating point numbers that use
the standard period as decimal mark.  For example, if the locale is
"fr_FR", atof("123.45") will return 123.0, rather than the correct
123.45.

After much debate, and a few implementations, we are declaring that
since the whole point of OIIO is about reading and writing persistent
data (image files!), that in order to be consistent, ALL of OIIO's
string parsing and formatting, including that done by utility library
functions, will be locale-independent by using the "C" locale (in
particular, we will use the usual North American default of '.' for
decimal separator).

Additionally:

* Added a new stof(), stoi(), stoul() to replace the clunkier template
syntax of from_string<>. As explained, these are hard-coded to "C" locale,
not the native/global locale.

* Much more attention to string_view safety, finding a number of subtle
bugs (including Strutil::parse_int, etc.)  where it was definitely not
safe to assume that the string_view passed was the null-terminated edge
of the string, and so, for example if a string_view pointed to the
characters "12" but the very next characters were "345" parse_int would
return 12345 instead of the correct 12.

* Strutil::format and ustring::format (and our copy of tinyformat
underneath) were touched up so that the versions of format() that return
strings will use the "C" locale, versions added that return strings and
use an explicitly-passed locale, and the versions that append their
results to existing streams continue to operate as before by honoring
the existing locale conventions of the streams they are using.
We may revise this later depending on what the upstream tinyformat
chooses to do about the issue.

* Several places where we assembled strings using std::ostringstream,
I forced the stream to use classic "C" locale in cases where it was
likely to have any floating-point data output.

* For maketx and oiiotool, globally set the locale to classic. In these
cases, we control the whole app, so this is safe to do.

If we ever decide that we want versions of these functions that take
explicit locales, it's not hard to add.
@devernay
Copy link
Contributor

Would be great if this could be backported to 1.8 and maybe 1.7 too (I'm stuck on 1.7 on Windows because of #1795).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants