Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTD-5730-eu-xml-bad-chars #2287

Merged
merged 4 commits into from
Jan 7, 2025
Merged

LTD-5730-eu-xml-bad-chars #2287

merged 4 commits into from
Jan 7, 2025

Conversation

depsiatwal
Copy link
Contributor

https://uktrade.atlassian.net/browse/LTD-5730

This is attempting to prevent an issue that manifests itself when the EU XML failed causing a P1.

This is the because the end user address typically has ASCII -- Nonprintable Characters
which has accidentally made it's way in during cut/paste in the address field.

I have for now added the validation to address/name as these have been the most problematic.


class SpecialCharacterStringValidator:
message = "There's a invalid special charactor in this field."
regex_string = r"^[\000-\031]"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking for comments around this as this may not be the best regex
doesn't cover everything in the table. But for some reason couldn't get the more generic ones to work as expected.

@depsiatwal
Copy link
Contributor Author

Since These Characters can't be seen by the end user.
Rather than validation errors this has been changed to silently remove these bad characters.

@depsiatwal depsiatwal force-pushed the LTD-5730-eu-xml-bad-chars branch from 0a2ff7a to 84faff4 Compare December 31, 2024 09:42
def clean_address(self):
address = self.cleaned_data["address"]
return remove_non_printable_characters(address)

def get_layout_fields(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be adding these for exporter address and name inputs too? Does that have an impact on the xml data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those form components are different during registration .
I guess potentially it's possible but never seen the issue and the risk is much lower since only non uk based companies can free enter the address.

I guess the risk here comes from the fact they copy and paste party details from other forms during Licence submission.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. You're probably doing the right thing in only fixing one thing at a time. Gives comfort that you've investigated though.



def remove_non_printable_characters(str):
return "".join([c for c in str if ord(c) > 31 or ord(c) in [9, 10, 13]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning the data is the best approach I can think of as the user probably wouldn't know how to react to the validation message.

If I understand this part correctly though it seems to me that the tab, newline and carriage return ascii are being removed completely rather than being replaced by something that can be parsed in python e.g spaces, /n or /r

I know that this is to output correctly in xml but I'm think our data should show as an accurate representation of what the user inputs maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's the other way round anything less 31 with the exception of [9, 10, 13] is removed from printable chars.
If you have a look in the tests it confirms these chars stay but everything else is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything less than 31 and 9, 10 and 13 is being removed but they aren't being replaced with anything. Your test does confirm this but what I'm saying is that it might better represent user input if they are replaced with equivalent python parseable version of what the ascii char represents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No still not correct anything less then 31 is removed (because these are meaningless unix chars i.e /x02

how we want [9, 10, 13] to stay as this are carriage return linefeed , tab etc and hence stays as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest including the character instead of ascii value so it is clear what the character is?
eg ord('\n') for 10

If that is not possible for all characters then add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes added comment to make it clearer what we are doing here.

form = parties.EndUserAddressForm(request=request, data=data)

form.is_valid()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an assert missing for the expected variable here here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this test doesn't check if form is valid as everything is valid but we need to call .is_valid() so I can access cleaned_data and check the output.

Copy link
Contributor

@markj0hnst0n markj0hnst0n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand better what's going on with this now. Looks like it does the job.

@depsiatwal depsiatwal merged commit f20b5fd into dev Jan 7, 2025
12 checks passed
@depsiatwal depsiatwal deleted the LTD-5730-eu-xml-bad-chars branch January 7, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants