-
Notifications
You must be signed in to change notification settings - Fork 2
Spike: Downstream effects of adding characters to dataset titles #1975
Comments
Might be worth checking these special characters aren't going to cause problems with DynamoDB entry, lambda, etc... |
There is simply way too many caveat to cater for with this approach. Different After exploring this, I feel we should further restrict what user is allowed to submit (more so that relaxing the rules here), to prevent unexpected and unforeseen behaviour downstream. Case in point, do we want to allow user to submit Chinese characters and emojis in their dataset title? What happens if they do so? Do we want to tests this across all components? EDIT: Examples
Perhaps a better approach would be have a list of characters we want to allow (e.g. macrons), and work our way backward to figure out how to cater for these edge cases. My recommendation after reading this thread is to reconsider and revisit Proposal 4. We would then build an index of some sort that map each unique id to the appropriate dataset. User should be able to browse and search this index and filter down the dataset they are interested in. S3 is built as an object store, not so much file / directory browser. The concept of having a Setting up an index also has an added benefit of allowing name change (e.g. Wanganui to Whanganui). This would be a tedious task on S3 as we would need to copy (and rename) then delete each object, which would be a pain since our bucket is versioned enabled. Another benefit with this approach is flexibility and scalability. When a data set is added Perhaps a discussion should be had here. Would appreciate @AmrouEmad and @l0b0 thought on this. Edit: Not to mention that s3 pagination should be taken into consideration when we allow user to browse geostore s3 bucket. Might not be a problem now, but it will be once we try to scale. |
Do you have any examples? I'm familiar with some tools not dealing with spaces (especially at the start or end of a filename), but I've not encountered one where multiple spaces are problematic. Also, do you mean multiple consecutive spaces, like |
Probably not, but only because we're building this for the needs of Toitū Te Whenua LINZ. But the point is a good one, because this is not really different from allowing macrons. Macrons (like emojis and other characters) even have additional complexities, since they have multiple representations which look and are semantically identical: "a" followed by U+0304 ( |
Yeah, this is a tricky one. We've gone through several iterations already:
IMO we should've stuck with the original idea of using UUIDs to make sure people use the index rather than S3. But we never got to the point where the index was productionised, so we've had to compromise and spend heaps of time rearchitecting, rather than spending time creating a usable index. |
I should probably edit this for clarity.
This. Example problem. The bug has already been fixed, but we should probably avoid allowing something like this to go through. Escaping this on multiple layers will be a pain to deal with. I found another thread alerting to this issue, but seem to have lost the url. Will update when I find it again. |
I would vote in favour of revisiting this idea again... 😄 |
Attached is a brief summary of what I found in this spike, so that it is documented. This is a non comprehensive list. There is simply too many edge case to cover with this approach. Most of the concerns have already been covered in the discussion above. |
Thanks Jim. Closing this for now. And adding a link to the issue to actually make the changes. |
Enabler
So that we know more about what 'safe' AWS character, macronated characters and slashes will do in downstream systems, we want to do 1 day's research and testing into what the characters in s3 prefixes that we want to add in #1974 and #1928 will do for typical data analyst downstream users on Windows, Linux and Mac.
See Confluence page for current list of characters that the data managers want on top of what the data managers have asked for, it would be useful to be as permissive as possible with the characters.
They typically use standard file system tools and GIS browsers (QGIS, ArcGIS) and analysis tools (GDAL, ArcGIS) and python for scripting.
Should see if anyone has already tested this somewhere in a blog post etc, or AWS docs.
If still not satisfied that risks are mitigated we could test some things ourselves.
Some resources:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
https://en.wikipedia.org/wiki/Filename
https://stackoverflow.com/questions/4814040/allowed-characters-in-filename
https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability
http://handbook.datalad.org/en/0.15/intro/filenaming.html
Acceptance Criteria
Additional context
Tasks
Definition of Ready
Definition of Done
CODING guidelines
increase 10% every year
validated, imported and stored within 24 hours
maintenance windows < 4 hours and does not include operational support
< 12 hours
The text was updated successfully, but these errors were encountered: