Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset.xhtml and user input of special characters #8018

Closed
donsizemore opened this issue Jul 22, 2021 · 9 comments · Fixed by #8242
Closed

dataset.xhtml and user input of special characters #8018

donsizemore opened this issue Jul 22, 2021 · 9 comments · Fixed by #8242

Comments

@donsizemore
Copy link
Contributor

dataset.xhtml in Dataverse-5.5 allows any old user to insert special characters into dataset metadata on dataset creation, which render the dataset metadata un-editable thereafter. On our instance nothing is written to server.log; instead the browser console conveys the rendering complaint when the user attempts to edit metadata. Here's a sample broken dataset on demo.dataverse.org:

Screen Shot 2021-07-22 at 13 26 16

Tough to say which UTF characters render the XML invalid, but it would be nice to sanitize them on user input.

@qqmyers
Copy link
Member

qqmyers commented Jul 22, 2021

FWIW: Don found #4442 which looks like the same (but closed) issue.

@donsizemore
Copy link
Contributor Author

To reproduce on a Mac:

Edit -> Emoji & Symbols, Customize List, enable Unicode, double-click U+000C to copy to clipboard

@pdurbin
Copy link
Member

pdurbin commented Oct 15, 2021

Today we saw a case of \x0C (form feed) preventing edits.

@donsizemore
Copy link
Contributor Author

This morning we removed a \U+0002

@sekmiller sekmiller self-assigned this Oct 18, 2021
@djbrooke
Copy link
Contributor

  • This causes a response in the Ajax that causes the page to not load. Either handle this or strip them out. Preference is for the former, but the latter may be easier.
  • Flyway or other script would be needed if we wanted to change the data instead of handling it better.
  • Is there a blacklist of characters? Yes, we have two above
  • We should use the validation logic in the application so that we get the fix for the UI and API.

@pdurbin pdurbin self-assigned this Oct 29, 2021
@pdurbin
Copy link
Member

pdurbin commented Oct 29, 2021

I can easily replicate the issue with a "big list of naughty strings" file I found at https://github.com/minimaxir/big-list-of-naughty-strings

Steps:

Same XML error as above:

XML Parsing Error: not well-formed
Location: http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/YR38GT&version=DRAFT
Line Number 340, Column 2:

Screenshot:

Screen Shot 2021-10-29 at 4 20 39 PM

@pdurbin
Copy link
Member

pdurbin commented Nov 2, 2021

I've been hacking around with a dataset with form feed (\f) in the title. In order to get the "Edit Dataset -> Metadata" button to display the page, I found that if I hard-code a value ("FIXME" below) in the right spot, I could get the page to load.

This isn't a solution, of course, but I thought I'd mentioned it.

I'm moving on to detecting special characters and providing a way to remove them.

$ git diff
diff --git a/src/main/webapp/datasetFieldForEditFragment.xhtml b/src/main/webapp/datasetFieldForEditFragment.xhtml
index 26fb70342d..ca7237d9d7 100644
--- a/src/main/webapp/datasetFieldForEditFragment.xhtml
+++ b/src/main/webapp/datasetFieldForEditFragment.xhtml
@@ -11,7 +11,7 @@
         <!-- input text start UPDATE: UI:REMOVE applied due to duplicate ID errors, left code as reference incase test scripts complain -->
         <span id="pre-input-#{dsf.datasetFieldType.name}" class="pre-input-tag"/></ui:remove>
     <ui:fragment rendered="#{cvoc==null}">
-        <p:inputText value="#{dsfv.valueForEdit}" id="inputText" pt:aria-required="#{dsf.required}"
+        <p:inputText value="FIXME" id="inputText" pt:aria-required="#{dsf.required}"
                  styleClass="form-control #{dsfv.datasetField.datasetFieldType.name == 'title' and DatasetPage.editMode == 'CREATE'  ? 'datasetfield-title' : ''}"
                  rendered="#{!dsfv.datasetField.datasetFieldType.controlledVocabulary
                              and (dsfv.datasetField.datasetFieldType.fieldType == 'TEXT'

@scolapasta helped me find the above by digging around in Chrome devtools. Some screenshots:

Screen Shot 2021-11-02 at 3 36 36 PM
Screen Shot 2021-11-02 at 5 01 09 PM

@pdurbin
Copy link
Member

pdurbin commented Nov 3, 2021

Per standup the plan is something like this:

  • Write Flyway SQL scripts to remove bad characters from metadata (dataset description, for example).
  • Add a constraint to prevent bad characters from getting into future metadata.
  • As before, we play to maintain a blacklist/blocklist of characters we don't want (formfeed, etc.)
  • No need for any APIs.

V4.11.0.1__5565-sanitize-directory-labels.sql seems to have a good example to start with:

-- strip (and replace with a .) any characters that are no longer allowed in the directory labels:
UPDATE filemetadata SET directoryLabel = regexp_replace(directoryLabel, '\.\.+', '.', 'g');

@qqmyers
Copy link
Member

qqmyers commented Nov 3, 2021

Having just been bitten, a pre-emptive note - presumably the release notes on this would need to recommend re-index as well to get bad values out of solr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants