Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Convert Currencies stored as text during formula calculations #3165

Closed
1 of 8 tasks
fdjohnston opened this issue Nov 11, 2022 · 8 comments
Closed
1 of 8 tasks

Comments

@fdjohnston
Copy link
Contributor

This is:

- [ ] a bug report
- [x] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

Hi, it's me again. Just ran into a situation similar to the percentage case we just worked through, but this time with currencies. My specific case from the Formula Parser logs is:

Evaluating 1.07 + "$0.00"
Evaluation Result is a a #VALUE! error

My initial thoughts are we could do something similar to what we just did for percentages, but for currencies. I did some playing around in Excel and it seems that it is much more restrictive about what is considered to be a currency than it is with percentages, though I wonder how localization settings are playing into this. I expect one would have to handle cases with the currency symbol before or after the value, probably very similar to what we ran into with percentages. I haven't thought this all the way through, but I wonder if we could even get away with turning the symbol into a capture group in the regexp and depending on if it's a currency symbol or a % have the convertToNumberIfPercent() function omit the division by 100. There's no way it could be that simple, but I wonder if that is a direction that bears exploring.

I going to assume that I'm underestimating the complexity of this like I did with percentages, but I thought it was worth pointing out.

As before, I'd be happy to contribute a PR if this is something the maintainers think is worth fixing, or would this just be opening a can of worms?

What features do you think are causing the issue

  • Reader
  • Writer
  • Styles
  • Data Validations
  • Formula Calculations
  • Charts
  • AutoFilter
  • Form Elements
@MarkBaker
Copy link
Member

We would certainly take a similar approach (have a look at PR #3164 to see how I've extracted that logic into a Calculation Engine Helper).

Excel is a lot stricter about what does and doesn't qualify as a currency; and that is very much locale-based: I'm based in Europe, and not in the United States, so....
image!image
But it doesn't take locale decimal and thousands separators into consideration
image
It would take a lot more effort to identify all the variations that are accepted, and under which circumstances.


Excel also handles strings containing recognised date and time formats
image
And again, there are a lot of idiosyncracies in the precise formats that are recognised as date/time values


With any of these changes, we would want it to work for a majority of people, rather than with a very specific format that would trigger those developers who used different formats to raise errors.

@fdjohnston
Copy link
Contributor Author

I agree, we would want a solution that worked for as many locales as possible (ideally all of them). To further complicate things, I'm in English speaking Canada, so our currency notation is the same as the US, but in Quebec the dollar sign is displayed after the value. Even though we're in the same country, my Excel thinks a trailing dollar sign is invalid:
image

I did a bit of experimenting with the regular expression from my previous PR. By leveraging StringHelper::getCurrencyCode() we could do something like this:

        $match = [];
        $currencyCode = StringHelper::getCurrencyCode();
        $STRING_REGEXP_PERCENT = '~^(?:(?: *(?<PrefixedSign>[-+])? *(?<PrefixSymbol>[\%\\' . $currencyCode .']) *(?<PrefixedSign2>[-+])? *(?<PrefixedValue>[0-9]+\.?[0-9*]*(?:E[-+]?[0-9]*)?) *)|(?: *(?<PostfixedSign>[-+])? *(?<PostfixedValue>[0-9]+\.?[0-9]*(?:E[-+]?[0-9]*)?) *(?<PostfixedSymbol>[\%\\' . $currencyCode .']) *))$~i';
        if (preg_match($STRING_REGEXP_PERCENT, $operand, $match, PREG_UNMATCHED_AS_NULL)) {
            //Calculate the percentage
            $sign = ($match['PrefixedSign'] ?? $match['PrefixedSign2'] ?? $match['PostfixedSign']) ?? '';
            $symbol = $match['PrefixSymbol'] ?? $match['PostfixedSymbol'];
            $operand = (float) ($sign . ($match['PostfixedValue'] ?? $match['PrefixedValue'])) / ($symbol === '%' ? 100 : 1);

            return true;
        }

The big issue with this approach is that it will match some false positives. We might be able to use the p_cs_precedes and n_cs_precedes values from localeconv() to manipulate the regex before it runs. Then the big question will be how closely return values from localeconv() match what Excel does internally.

@MarkBaker
Copy link
Member

You wouldn't be able to concatenate the $currencyCode value directly into the regexp, because there's at least one very common currency code that has a specific meaning when it appears in a regexp; you'd have to preg_quote() it first.

The Euro symbol can occur before or after the value; and for some Euro countries, it can be before for currency and after for accounting formats:
image
However, even in the Netherlands there is a difference in the placement of the sign in relation to the currency symbol between currency and accounting formats:
image
and both are handled correctly when used in a formula
For France:
image
and for the Republic of Ireland
image

@fdjohnston
Copy link
Contributor Author

In my example regexp I threw a \ in before the concatenation to handle $ but your point is well taken; safer to use preg_quote() for sure.

I hadn't considered accounting formats. Unfortunately localeconv() only gives us information about currency formats, not accounting formats. For the Netherlands:

>>> localeconv()
=> [
     "decimal_point" => ",",
     "thousands_sep" => ".",
     "int_curr_symbol" => "EUR",
     "currency_symbol" => b"€",
     "mon_decimal_point" => ",",
     "mon_thousands_sep" => ".",
     "positive_sign" => "",
     "negative_sign" => "-",
     "int_frac_digits" => 2,
     "frac_digits" => 2,
     "p_cs_precedes" => 1,
     "p_sep_by_space" => 1,
     "n_cs_precedes" => 1,
     "n_sep_by_space" => 1,
     "p_sign_posn" => 4,
     "n_sign_posn" => 4,
     "grouping" => [
       3,
     ],
     "mon_grouping" => [
       3,
     ],
   ]

Without doing some kind of exhaustive enumeration process with Excel I can't think of a way to list all the different formats by locale.
One option would be to accept a currency symbol before or after the number, and acknowledge that this will create some false positives in some cases... but that feels dirty.

Looking around in the PHPSpreadsheet code I noticed that in src/PhpSpreadsheet/Style/NumberFormat.php only USD and EUR accounting formats are currently available. Perhaps a good starting place would be handle those two cases for now, and expand on them later?

@MarkBaker
Copy link
Member

The reason that only USD and EUR formats are available is because those were pre-defined formats in MS Excel in the early years; Excel 95 didn't store them as a string mask in the file, but as an id for an entry in a pre-defined table.

Earlier this year, I looked at using PHP's Intl extension to see if I could build a Number Format Wizard based around that. The ICU database holds all the information for both accounting and currency (as well as date, and other formats); but Intl may not be available in PHP if the build didn't enable it; and even then accounting is only available from PHP 7.4.1, and if built against ICU >= 53.0
Open/LibreOffice keeps this much simpler: it handles the percentages in the same way as Excel, but doesn't seem to handle currency strings at all

@fdjohnston
Copy link
Contributor Author

That makes sense. Thanks for the context.

Well, I'll leave this one up to you. As I mentioned above, I'd be happy to build out a PR using a regular expression that handles a general case, and then we can go back and forth a bit tightening it up around certain edge cases. Or do you think it's not worth perusing? The idea of a number format Wizard is interesting, but as you point out, would only work for certain builds. Perhaps a more general regexp based approach could be used to start, and then replaced with a more precise Intl based approach later?
Let me know.

@MarkBaker
Copy link
Member

I'd say go ahead with the more simplistic approach, submit a PR with the code that you posted earlier using the currency code from locale, and the before and after position regexp,and we'll experiment to see how many false positives we can generate.

Given how long it took before anybody even noticed that string percentages weren't being converted to numbers, I don't think it affects many people, so I don't think that false positives will be too much of a problem.

Make sure that you do a pull from master first, so that you have the refactoring that I did, and do the changes on a feature branch

@fdjohnston
Copy link
Contributor Author

Sounds good. I'll submit a PR later this week. I'll use a new form from master so that I get your changes.
Thanks again.

MarkBaker added a commit that referenced this issue Dec 21, 2022
### Added

- Extended flag options for the Reader `load()` and Writer `save()` methods
- Apply Row/Column limits (1048576 and XFD) in ReferenceHelper [PR #3213](#3213)
- Allow the creation of In-Memory Drawings from a string of binary image data, or from a stream. [PR #3157](#3157)
- Xlsx Reader support for Pivot Tables [PR #2829](#2829)
- Permit Date/Time Entered on Spreadsheet to be calculated as Float [Issue #1416](#1416) [PR #3121](#3121)

### Changed

- Nothing

### Deprecated

- Direct update of Calculation::suppressFormulaErrors is replaced with setter.
- Font public static variable defaultColumnWidths replaced with constant DEFAULT_COLUMN_WIDTHS.
- ExcelError public static variable errorCodes replaced with constant ERROR_CODES.
- NumberFormat constant FORMAT_DATE_YYYYMMDD2 replaced with existing identical FORMAT_DATE_YYYYMMDD.

### Removed

- Nothing

### Fixed

- Fixed handling for `_xlws` prefixed functions from Office365 [Issue #3245](#3245) [PR #3247](#3247)
- Conditionals formatting rules applied to rows/columns are removed [Issue #3184](#3184) [PR #3213](#3213)
- Treat strings containing currency or accounting values as floats in Calculation Engine operations [Issue #3165](#3165) [PR #3189](#3189)
- Treat strings containing percentage values as floats in Calculation Engine operations [Issue #3155](#3155) [PR #3156](#3156) and [PR #3164](#3164)
- Xlsx Reader Accept Palette of Fewer than 64 Colors [Issue #3093](#3093) [PR #3096](#3096)
- Use Locale-Independent Float Conversion for Xlsx Writer Custom Property [Issue #3095](#3095) [PR #3099](#3099)
- Allow setting AutoFilter range on a single cell or row [Issue #3102](#3102) [PR #3111](#3111)
- Xlsx Reader External Data Validations Flag Missing [Issue #2677](#2677) [PR #3078](#3078)
- Reduces extra memory usage on `__destruct()` calls [PR #3092](#3092)
- Additional properties for Trendlines [Issue #3011](#3011) [PR #3028](#3028)
- Calculation suppressFormulaErrors fix [Issue #1531](#1531) [PR #3092](#3092)
- Permit Date/Time Entered on Spreadsheet to be Calculated as Float [Issue #1416](#1416) [PR #3121](#3121)
- Incorrect Handling of Data Validation Formula Containing Ampersand [Issue #3145](#3145) [PR #3146](#3146)
- Xlsx Namespace Handling of Drawings, RowAndColumnAttributes, MergeCells [Issue #3138](#3138) [PR #3136](#3137)
- Generation3 Copy With Image in Footer [Issue #3126](#3126) [PR #3140](#3140)
- MATCH Function Problems with Int/Float Compare and Wildcards [Issue #3141](#3141) [PR #3142](#3142)
- Fix ODS Read Filter on number-columns-repeated cell [Issue #3148](#3148) [PR #3149](#3149)
- Problems Formatting Very Small and Very Large Numbers [Issue #3128](#3128) [PR #3152](#3152)
- XlsxWrite preserve line styles for y-axis, not just x-axis [PR #3163](#3163)
- Xlsx Namespace Handling of Drawings, RowAndColumnAttributes, MergeCells [Issue #3138](#3138) [PR #3137](#3137)
- More Detail for Cyclic Error Messages [Issue #3169](#3169) [PR #3170](#3170)
- Improved Documentation for Deprecations - many PRs [Issue #3162](#3162)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants