Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert percentages stored as strings to numerics in formula calculations #3156

Merged
merged 10 commits into from
Nov 10, 2022
Merged

Convert percentages stored as strings to numerics in formula calculations #3156

merged 10 commits into from
Nov 10, 2022

Conversation

fdjohnston
Copy link
Contributor

@fdjohnston fdjohnston commented Nov 4, 2022

This is:

- [x] a bugfix
- [ ] a new feature
- [ ] refactoring
- [ ] additional unit tests

Checklist:

Why this change is needed?

This is a pull request to address #3155

A new function was added that checks it a string contains a percentage, and if so, converts it to a numeric which can be used in formula calculations.

Further to #3155, I noticed that percentages stored as strings in Excel were still being used in formulas; Excel is apparently converting them behind the scenes.  The same spreadsheet imported into PHPSpreadsheet was outputting a #VALUE! error.

Added a function, modeled on the `convertToNumberIfFraction()` function, to detect if a string contains a percentage, and if so, convert it to a numeric.  If the last character of the string is a % sign, the function will string the % sign and divide the value by 100, overwriting the original param with the new value.

New helper function was added to `StringHelper.php` along the original faction helper.
Adding two test cases:
1) Test the function in `StringHelper.php`.  Modeled this test on the test for the `convertToNumberIfFraction` function.
2) Test a spreadsheet with a string percentage in a formula to see if it  calculates the formula correctly.
PHPStan failed because of a binary operation "/" between string and 100.0.  Fixed this by casting the result of `str_replace` to a float.
*/
public static function convertToNumberIfPercent(string &$operand): bool
{
if (strpos($operand, '%', -1) !== false) {
Copy link
Member

@MarkBaker MarkBaker Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this check should be tightened up, because a string that contains only '%' will return a true here, even with no actual numeric value to convert. MS Excel will treat this as a string, and it won't be treated as 0.0%, but will give a #VALUE! error if used in an arithmetic formula.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And MS Excel also treats a string containing a leading percentage sign (e.g. '%2.5') as a numeric percentage value in numeric calculations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may also be whitespace before a leadingg percentage sign, or after a trailing percentage sign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feedback, thanks. Was only considering my particular case.
Just pushed a commit that switches to a regular expression. I've also attached an Excel sheet that tests a bunch of different permutations and combinations.
Let me know what you think.
Excel Test Cases.xlsx

Also just realized that I'm working on master on my fork instead of a new branch; sorry about that.

As suggested, the initial check has been tightened up using a regular expression.
Expressing allows for leading and trailing spaces, leading and trailing percentage symbols, with spaces mixed throughout.
Added 57 permutations of leading and trailing spaces, leading and trailing percent symbols, decimals, and negatives.  All seem to be working.  I will also attach an Excel worksheet that verifies the behaviour of all 57 test cases aligns with Excel.
*/
public static function convertToNumberIfPercent(string &$operand): bool
{
if (preg_match(self::STRING_REGEXP_PERCENT, $operand, $match)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of little tricks that we can do with the regexp to extract the numeric value directly, rather than doing the str_replace() call.

~^(?: *-? *\% *(?<PrefixedValue>[0-9]+\.?[0-9*]*) *| *\-? *(?<PostfixedValue>[0-9]+\.?[0-9]*) *\% *)$~

To avoid capture of the complete string, I've set that as a non-capture group, because we're only interested in whether the % character appears, we don't need to actually capture it or any padding around it; and then I've set capture groups for the actual numeric part of the string. I've also named these groups (PrefixedValue and PostfixedValue) so that they can be referenced by name in the $matches array.

Then, if a match is identified, we can extract the numeric part directly by name, whether it's prefixed by the % or postfixed.

$poperand = (float) ($match['PostfixedValue'] ?? $match['PrefixedValue']) / 100.0;

The null coalescence identifies whether the number is prefixed or postfixed (using the named groups from $matches); and then does the division by 100 and casts the result to a float. There's no function call overhead; just PHP operations, so it should be more efficient if it's being called in a loop thatr checks a lot of cell values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great trick with the regexp. I agree that eliminating the str_replace would be a big improvement.
Unfortunately the changes to the regexp break the negative percentage cases. I'll see if I can run with your idea and get it working for negative values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes! I was being a bit to quick and casual with the regexp changes. If the capture group is expanded to encompass the sign as well, then we're potentially also capturing whitespace too, which would break other things.

One option would be to have additional pre/postfix capture groups for the sign, and a match there would trigger a * -1 operation

~^(?: *(?<PrefixedNegative>-?) *\% *(?<PrefixedValue>[0-9]+\.?[0-9*]*) *| *(?<PrefixedNegative>\-?) *(?<PostfixedValue>[0-9]+\.?[0-9]*) *\% *)$~

and then

$sign = $match['PostfixedNegative'] ?? $match['PrefixedNegative'];
$poperand *= ($sign === '-') ? -1.0 : 1.0;

So we're still using only operations, and no function calls.

Just to be bloody minded and awkward, MS Excel recognises a % prefix with the sign before or after the %, so %-2.5 and -%2.5 are both recognised as numeric values. Fortunately with a postfix %, the sign must be before the numeric part.
And, of course, the sign could be a + as well as a -; and the numeric part could be in scientific format, -2.5E-4%.

But if you get the basic negative working, I'll merge and then do some more tweaking myself to handle a few of those edge cases; and refactor it all into a separate helper.

I did say right from the start that this was a lot more complicated once you start to look at it in detail. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did warn me! I can't imagine what the code for Excel's formula parser must look like.
I'll take a crack at improving the regexp like you suggest.

Should we be concerned about any of the other spreadsheet applications? I'm focused entirely on Excel because it was the root of my issue, but I suppose it's worth considering how Open Office or Libre Office works? Won't that be a nightmare if they handle these various combinations of %, +, and - differently.... I don't have either installed at the moment but I'll download a copy tonight and see how they behave.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about other spreadsheet applications: once it's merged, I'll take your code and do some reworking for the additional Excel cases that I know about while I'm refactoring it all into a helper class, and I'll do a systematic check using Open/Libre Office, Gnumeric and even WPS while I'm doing that.

There are sometimes discrepancies between the different Office Suite spreadsheets, which we handle using the compatibility mode - mostly in the implementation of functions - but we treat Excel itself as the default. The worst is that they aren't consistent between versions, even MS Excel has made changes between versions. We don't cater for that; but we always strive for the latest version... and that means retesting everything whenever there's a new major release of Excel/LibreOffice/Gnumeric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good. I was able to get it working for combinations of +, -, and scientific notation by expanding on the approach you suggested.
I broke out all the different pieces of the final calculation because it was getting a bit difficult to follow as single line, owing to all the null coalescing checks. We could simplify it by moving all those checks inline.
Note that, as I mention in the commit, I did have to add the PREG_UNMATCHED_AS_NULL flag to the preg_match, otherwise I was getting '' back when a capture group wasn't found.
Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated my Excel sheet with various permutations.
Excel Test Cases.xlsx

Implemented the suggested named capture groups (nice trick) in the regexp to pull out only the values we need to perform the calculation.  Making use of the null coalescing operator to keep things nice and clean.  Note that I did have to add the `PREG_UNMATCHED_AS_NULL` to `preg_match` to ensure PHP returned null (since null coalescing only works on null and not '') when a match wasn't found for a capture group.

Also extended the regular expression to handle scientific notation (both positive and negative, and positive and negative exponents) as well as a negative symbol before or after a leading percentage sign.
Originally thought I was being clever by using zero in `pow` to get 1, however after thinking about it all day I think it's probably better to avoid the call to `pow` if the exponent is zero; one less function call that has to be made.

I didn't take the time to look into benchmarks of `pow` to the zero vs the ternary to multiply by 1, but my gut tells me this is likely slightly faster.
@MarkBaker
Copy link
Member

Of course, you don't need to use the pow() function at all, PHP has a power operator that can be used instead

But I'm sure that casting to float will convert a string like 1.23e-4 to a float value without needing that additional complexity, and that the regexp can be simplified to extract just sign and numeric value rather than sign, mantissa, exponent sign and exponent.
Also, E and e can both be used for scientific format, so the regexp needs to be case-insensitive with the i switch

As was pointed out, the entire `pow` piece, as well as extracting exponent and exponent sign, was not necessary as casting to `float` will correctly handle scientific notation in a string.
Also added case-insensitive flag to the regexp to handle upper and lower case `e`.
@fdjohnston
Copy link
Contributor Author

Yep, you're right. Thanks for your patience on this PR. Really appreciate your taking the time to point this stuff out.
I think this most recent push should take care of everything.

@MarkBaker MarkBaker merged commit 6f71efc into PHPOffice:master Nov 10, 2022
@MarkBaker
Copy link
Member

Thank you for your work on this PR; it is appreciated.
I know that I migfht have been asking you to do more than you needed for your own use case, but that makes it more beneficial to other developers that may encounter the same issue

@fdjohnston
Copy link
Contributor Author

Thanks for your patience working through the PR with me. I learned a couple new tricks with regular expressions. Happy to put in the extra work to improve PHPSpreadsheet. This is a fantastic library and I appreciate all the work that you and the other maintainers do to keep it up to date. Was happy to be able to contribute.

MarkBaker added a commit that referenced this pull request Dec 21, 2022
### Added

- Extended flag options for the Reader `load()` and Writer `save()` methods
- Apply Row/Column limits (1048576 and XFD) in ReferenceHelper [PR #3213](#3213)
- Allow the creation of In-Memory Drawings from a string of binary image data, or from a stream. [PR #3157](#3157)
- Xlsx Reader support for Pivot Tables [PR #2829](#2829)
- Permit Date/Time Entered on Spreadsheet to be calculated as Float [Issue #1416](#1416) [PR #3121](#3121)

### Changed

- Nothing

### Deprecated

- Direct update of Calculation::suppressFormulaErrors is replaced with setter.
- Font public static variable defaultColumnWidths replaced with constant DEFAULT_COLUMN_WIDTHS.
- ExcelError public static variable errorCodes replaced with constant ERROR_CODES.
- NumberFormat constant FORMAT_DATE_YYYYMMDD2 replaced with existing identical FORMAT_DATE_YYYYMMDD.

### Removed

- Nothing

### Fixed

- Fixed handling for `_xlws` prefixed functions from Office365 [Issue #3245](#3245) [PR #3247](#3247)
- Conditionals formatting rules applied to rows/columns are removed [Issue #3184](#3184) [PR #3213](#3213)
- Treat strings containing currency or accounting values as floats in Calculation Engine operations [Issue #3165](#3165) [PR #3189](#3189)
- Treat strings containing percentage values as floats in Calculation Engine operations [Issue #3155](#3155) [PR #3156](#3156) and [PR #3164](#3164)
- Xlsx Reader Accept Palette of Fewer than 64 Colors [Issue #3093](#3093) [PR #3096](#3096)
- Use Locale-Independent Float Conversion for Xlsx Writer Custom Property [Issue #3095](#3095) [PR #3099](#3099)
- Allow setting AutoFilter range on a single cell or row [Issue #3102](#3102) [PR #3111](#3111)
- Xlsx Reader External Data Validations Flag Missing [Issue #2677](#2677) [PR #3078](#3078)
- Reduces extra memory usage on `__destruct()` calls [PR #3092](#3092)
- Additional properties for Trendlines [Issue #3011](#3011) [PR #3028](#3028)
- Calculation suppressFormulaErrors fix [Issue #1531](#1531) [PR #3092](#3092)
- Permit Date/Time Entered on Spreadsheet to be Calculated as Float [Issue #1416](#1416) [PR #3121](#3121)
- Incorrect Handling of Data Validation Formula Containing Ampersand [Issue #3145](#3145) [PR #3146](#3146)
- Xlsx Namespace Handling of Drawings, RowAndColumnAttributes, MergeCells [Issue #3138](#3138) [PR #3136](#3137)
- Generation3 Copy With Image in Footer [Issue #3126](#3126) [PR #3140](#3140)
- MATCH Function Problems with Int/Float Compare and Wildcards [Issue #3141](#3141) [PR #3142](#3142)
- Fix ODS Read Filter on number-columns-repeated cell [Issue #3148](#3148) [PR #3149](#3149)
- Problems Formatting Very Small and Very Large Numbers [Issue #3128](#3128) [PR #3152](#3152)
- XlsxWrite preserve line styles for y-axis, not just x-axis [PR #3163](#3163)
- Xlsx Namespace Handling of Drawings, RowAndColumnAttributes, MergeCells [Issue #3138](#3138) [PR #3137](#3137)
- More Detail for Cyclic Error Messages [Issue #3169](#3169) [PR #3170](#3170)
- Improved Documentation for Deprecations - many PRs [Issue #3162](#3162)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants