Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML-Reader: support rich text #4001

Closed
SlowFox71 opened this issue Apr 25, 2024 · 1 comment · Fixed by #4007
Closed

XML-Reader: support rich text #4001

SlowFox71 opened this issue Apr 25, 2024 · 1 comment · Fixed by #4007

Comments

@SlowFox71
Copy link

This is:

- [ x] a feature request

What is the expected behavior?

Accept formatting instructions, if present

What is the current behavior?

Everything is parsed as text

What are the steps to reproduce?

Read the attached file with XML reader and save as XLSX.

What features do you think are causing the issue

  • [x ] Reader

Does an issue affect all spreadsheet file formats? If not, which formats are affected?

XML reader only

I implemented the desired behaviour in Xml::loadIntoExisting() in a rather brute-force way like follows (untested); there might be a much better way to extract the inner content of the SimpleXMLElement:

    case 'String':
        $type = DataType::TYPE_STRING;
        
        $rich = $cellData->children('http://www.w3.org/TR/REC-html40');
        if ($rich) {
            # in case of HTML content we extract the payload
            # and convert it into a rich text object
            $content = $cellData->asXML(); 
            $start = strpos($content, '<', 1);
            $end = strrpos($content, '<');
            $content = substr($content, $start, $end-$start);
                                           
            $html = new Helper\Html();
            $cellValue = $html->toRichTextObject($content);
         }
    
                                break;

richtext_xml.txt

@oleibman
Copy link
Collaborator

PR #4003 addresses several relatively simple Xml Reader issues. This one is a lot more complicated than the others, and will require more thought.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue May 1, 2024
Fix PHPOffice#4001. Thanks to @SlowFox71 who reported the problem and developed most of the solution. This PR adds Rich Text support to the XML reader. The Xml Spreadsheet stores Rich Text as Html tags, children of the ss:Data tag using a specific namespace. These can be parsed into a RichText object using existing method Helper/Html::toRichTextObject. There are 2 items which need special attention.

First, for attributes like bold or italic, Excel uses the appropriate Html tag (e.g. `<B>`). However, for an attribute like color, Excel uses `<Font html:Color="#FF0000">`, with a prefix on the Color tag. PhpSpreadsheet's Html parser cannot cope with the prefix. The parser is changed to strip `html:` from attribute names for the Font tag.

The example cited by the user used a `<BR />` to indicate a line break in the data. However, it appears that, at least some of the time, Excel will instead use `&#10;` to indicate a line break. The existing parser reduces one or more whitespace characters in the text to a single space, and so `&#10;` will wind up disappearing. I am not sure why the existing code does this, but I do know that I am not willing to break it. Instead, I've added an optional boolean parameter `$preserveWhiteSpace` to `toRichTextObject`. If false (default), the existing logic will be used; but if true, substitution for whitespace characters in the text will not happen.
@oleibman oleibman mentioned this issue May 1, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants