Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of ElementTree changes HTML output #558

Closed
matthewhegarty opened this issue Jul 23, 2023 · 13 comments
Closed

Addition of ElementTree changes HTML output #558

matthewhegarty opened this issue Jul 23, 2023 · 13 comments

Comments

@matthewhegarty
Copy link
Contributor

matthewhegarty commented Jul 23, 2023

I noticed that this change breaks the CI build in django-import-export. This is for cases when we are escaping harmful characters such as script tags.

If there is already an escaped character in content, then the use of ElementTree will double escape, which will not format correctly. This test (added to HTMLTests) will reproduce the issue:

    def test_html_export_with_special_chars(self):
        self.founders = tablib.Dataset(headers=self.headers, title='Founders')
        self.founders.append(('J & J', 'A', 90))
        self.assertIn("J & J", self.founders.html)

(this test passes in v3.3.0)

This test fails because the '&' string is exported as '&' So if someone has 'J & J' in their content (i.e. rendered in escape & amp;), they will now get "double escaping" 'J & amp;amp; J'.

@claudep
Copy link
Contributor

claudep commented Jul 23, 2023

I understand this is a breaking change (if not reverted, it should be documented). However I think that security-wise, it's a much better default behavior than the previous one. Maybe we should add an option to the format so that the client code can specify that it already cared for escaping the content (and thus avoid the double-escaping)?

@matthewhegarty
Copy link
Contributor Author

As long as the escaping is done we can handle it ok (so no need for an additional option), however it would be ideal if we could opt into the new ElementTree format when we are ready. Perhaps to do that you could add a deprecation path using a settings flag so that clients can opt into ElementTree? I think without this you could well get complaints that exports have changed when the next release goes out. Then it could be documented and the change explained (it took me a little while to figure out why our CI build was broken).

@matthewhegarty
Copy link
Contributor Author

Hi Claude - are you ok with me submitting a PR to try to make this new dependency on ElementTree optional? Then we can plan the deprecation in django-import-export. Please let me know and I'll pull it together.

@claudep
Copy link
Contributor

claudep commented Sep 11, 2023

Submitting PRs doesn't require any permission 😄

@matthewhegarty
Copy link
Contributor Author

Understood... Just wanted to double-check it was something you would consider before I started.

@matthewhegarty
Copy link
Contributor Author

I looked into whether we could control escaping of text elements using a parameter or similar, but this is not possible (certain characters always get escaped).

@matthewhegarty
Copy link
Contributor Author

@claudep would be grateful if you could take a look at #562

@claudep
Copy link
Contributor

claudep commented Sep 11, 2023

Thanks for suggesting the pull request. However, I'm not thrilled at all by generating HTML by hand as a string (both performance and security-wise). Hopefully other contributors will chime in and tell us their feelings.

Did you try on the django-import-export side to conditionally escape based on tablib version? Would it be doable?

@matthewhegarty
Copy link
Contributor Author

Thanks Claude, I did think there must be a good reason why the HTML has not been generated by hand previously, as it seems like the simplest approach considering we are outputting a table. Perhaps escaping the values by default (i.e. with html.escape()) would allay the security concerns, because then you would always end up with an HTML table containing safe values, and this passes all current tests. If escaping by default, I can't see how this would create any security issues, and would allow users to disable escaping if they needed to.

I cannot see performance being an issue except maybe for very large datasets, and even then I don't think it would be too dissimilar to existing versions.

Did you try on the django-import-export side to conditionally escape based on tablib version? Would it be doable?

It would not be doable because the current version of tablib will always escape (because it uses ElementTree). We'd have to find a way to conditionally load a previous version of tablib (i.e. with MarkupPy) to ensure we didn't break any existing implementations.

It might be that we have to release our new version with the ElementTree dependency, and see if anyone complains.

@claudep
Copy link
Contributor

claudep commented Sep 11, 2023

I see django-import-export supports html export without escaping content. Do you think that there are valid use cases for that capability? Is deprecating this an option for you?

@matthewhegarty
Copy link
Contributor Author

Do you think that there are valid use cases for that capability?

I think there is a valid use case, which is if users already have escaped content in their db, and they want to export without double escaping. However, I don't know if anyone does this, and I think it is unlikely that they do.

Is deprecating this an option for you?

Yes, I think this is the way to go. I will pin the latest release of tablib in import-export, and then add a note to advise that escaping will be mandatory in a future release. If anyone complains, I can think again, but otherwise I would then be free to upgrade to the latest version of tablib in future.

Thanks for your help with this. Feel free to close the PR if you wish.

@matthewhegarty
Copy link
Contributor Author

For reference, the import-export PR is here

@hugovk
Copy link
Member

hugovk commented Sep 12, 2023

I think it might be easier all around to pin to 3.5.0 until you're ready to upgrade, and that the next Tablib should be 4.0.0, and mention the change clearly in the release notes.

Thank you though for testing the main branch in your CI, it's rare but very welcome to get pre-release feedback! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants