Spike: Downstream effects of adding characters to dataset titles #1975

billgeo · 2022-08-30T22:06:05Z

Enabler

So that we know more about what 'safe' AWS character, macronated characters and slashes will do in downstream systems, we want to do 1 day's research and testing into what the characters in s3 prefixes that we want to add in #1974 and #1928 will do for typical data analyst downstream users on Windows, Linux and Mac.

See Confluence page for current list of characters that the data managers want on top of what the data managers have asked for, it would be useful to be as permissive as possible with the characters.

They typically use standard file system tools and GIS browsers (QGIS, ArcGIS) and analysis tools (GDAL, ArcGIS) and python for scripting.

Should see if anyone has already tested this somewhere in a blog post etc, or AWS docs.
If still not satisfied that risks are mitigated we could test some things ourselves.

Some resources:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
https://en.wikipedia.org/wiki/Filename
https://stackoverflow.com/questions/4814040/allowed-characters-in-filename
https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability
http://handbook.datalad.org/en/0.15/intro/filenaming.html

Acceptance Criteria

Research and decisions are captured in Confluence page
Update Allow more characters in dataset titles #1974 and Allow slashes in dataset titles and s3 urls #1928 if needed
Checking these special characters aren't going to cause problems with DynamoDB entry, lambda, etc..
Check these special characters aren't going to cause problems with Windows, Mac, Linux typical use cases

Additional context

Tasks

...
...

Definition of Ready

Definition of Done

Jimlinz · 2022-08-30T23:45:28Z

Might be worth checking these special characters aren't going to cause problems with DynamoDB entry, lambda, etc...

Jimlinz · 2022-10-05T04:58:09Z

There is simply way too many caveat to cater for with this approach. Different OS have different rules on what filename and directory name is allowed. While we are able to include / (forward slash) as part of s3 object name, most OS won't allow this, which begs the question - how do user even get to create this in the first place to add to Geostore?

After exploring this, I feel we should further restrict what user is allowed to submit (more so that relaxing the rules here), to prevent unexpected and unforeseen behaviour downstream. Case in point, do we want to allow user to submit Chinese characters and emojis in their dataset title? What happens if they do so? Do we want to tests this across all components?

EDIT: Examples

full stop (period) shouldn't be allow at the start or end of the filename.
spaces should be limited to one (multiple spaces could be problematic with some ~~os or~~ applications) [edit: removed os]

Perhaps a better approach would be have a list of characters we want to allow (e.g. macrons), and work our way backward to figure out how to cater for these edge cases.

My recommendation after reading this thread is to reconsider and revisit Proposal 4. We would then build an index of some sort that map each unique id to the appropriate dataset. User should be able to browse and search this index and filter down the dataset they are interested in. S3 is built as an object store, not so much file / directory browser. The concept of having a path or directory doesn't even exist on s3. It is merely a façade to make things a bit more user friendly. We probably shouldn't be building on top of this; rather, we should adhere to how it was originally built and intended, by decoupling the concept of storage and index.

Setting up an index also has an added benefit of allowing name change (e.g. Wanganui to Whanganui). This would be a tedious task on S3 as we would need to copy (and rename) then delete each object, which would be a pain since our bucket is versioned enabled. Another benefit with this approach is flexibility and scalability. When a data set is added e.g. elevation_canterbury_canterbury_2016_dem_1m user can only browse under Canterbury. With a proper index (with tagging allowed), user should be able browse for Waitaha (Canterbury) or Christchurch.

Perhaps a discussion should be had here. Would appreciate @AmrouEmad and @l0b0 thought on this.

Edit: Not to mention that s3 pagination should be taken into consideration when we allow user to browse geostore s3 bucket. Might not be a problem now, but it will be once we try to scale.

l0b0 · 2022-10-07T01:59:15Z

multiple spaces could be problematic with some os or applications

Do you have any examples? I'm familiar with some tools not dealing with spaces (especially at the start or end of a filename), but I've not encountered one where multiple spaces are problematic. Also, do you mean multiple consecutive spaces, like foo[space][space]bar, or just more than one space anywhere, as in foo[space]bar[space]baz?

l0b0 · 2022-10-07T02:21:45Z

Case in point, do we want to allow user to submit Chinese characters and emojis in their dataset title?

Probably not, but only because we're building this for the needs of Toitū Te Whenua LINZ. But the point is a good one, because this is not really different from allowing macrons. Macrons (like emojis and other characters) even have additional complexities, since they have multiple representations which look and are semantically identical: "a" followed by U+0304 (ā in HTML) is, for human consumption purposes, identical to "ā".

l0b0 · 2022-10-07T02:31:13Z

My recommendation after reading this thread is to reconsider and revisit Proposal 4. We would then build an index of some sort that map each unique id to the appropriate dataset. User should be able to browse and search this index and filter down the dataset they are interested in. S3 is built as an object store, not so much file / directory browser. The concept of having a path or directory doesn't even exist on s3. It is merely a façade to make things a bit more user friendly. We probably shouldn't be building on top of this; rather, we should adhere to how it was originally built and intended, by decoupling the concept of storage and index.

Setting up an index also has an added benefit of allowing name change (e.g. Wanganui to Whanganui). This would be a tedious task on S3 as we would need to copy (and rename) then delete each object, which would be a pain since our bucket is versioned enabled. Another benefit with this approach is flexibility and scalability. When a data set is added e.g. elevation_canterbury_canterbury_2016_dem_1m user can only browse under Canterbury. With a proper index (with tagging allowed), user should be able browse for Waitaha (Canterbury) or Christchurch.

Yeah, this is a tricky one. We've gone through several iterations already:

UUIDs only, as recommended by @imincik and myself.
Replaced UUIDs with ULIDs, to make them ordered by time.
Added the dataset title to the ULID, to make it possible to browse the S3 bucket without any index.
Removed the ULID, at which point we're back to where renaming a dataset is a major operation, and we can't allow special characters in dataset names because of filename issues.

IMO we should've stuck with the original idea of using UUIDs to make sure people use the index rather than S3. But we never got to the point where the index was productionised, so we've had to compromise and spend heaps of time rearchitecting, rather than spending time creating a usable index.

Jimlinz · 2022-10-07T02:51:41Z

spaces should be limited to one (multiple spaces could be problematic with some ~~os or~~ applications)

I should probably edit this for clarity.

do you mean multiple consecutive spaces, like foo[space][space]bar

This.

Example problem. The bug has already been fixed, but we should probably avoid allowing something like this to go through. Escaping this on multiple layers will be a pain to deal with.

I found another thread alerting to this issue, but seem to have lost the url. Will update when I find it again.

Jimlinz · 2022-10-07T02:54:42Z

IMO we should've stuck with the original idea of using UUIDs to make sure people use the index rather than S3.

I would vote in favour of revisiting this idea again... 😄

Jimlinz · 2022-10-11T01:05:36Z

Attached is a brief summary of what I found in this spike, so that it is documented. This is a non comprehensive list. There is simply too many edge case to cover with this approach. Most of the concerns have already been covered in the discussion above.

safe_char.txt

billgeo · 2022-10-12T02:05:26Z

Thanks Jim. Closing this for now. And adding a link to the issue to actually make the changes.

billgeo added enabler story Enable to team to improve needs refinement Needs to be discussed by the team labels Aug 30, 2022

billgeo changed the title ~~Spike: Downstream effects of adding charactes to Dataset titles.~~ Spike: Downstream effects of adding characters to dataset titles Aug 31, 2022

billgeo removed the needs refinement Needs to be discussed by the team label Sep 5, 2022

mfwightman moved this to Backlog in Data Infrastructure Squad Sep 13, 2022

mfwightman added this to Data Infrastructure Squad Sep 13, 2022

mfwightman moved this from 📋 Backlog to 🔖 Ready in Data Infrastructure Squad Sep 20, 2022

AmrouEmad self-assigned this Sep 26, 2022

AmrouEmad moved this from 🔖 Ready to 🏗 Doing / Implementing in Data Infrastructure Squad Sep 29, 2022

billgeo mentioned this issue Sep 29, 2022

Allow more characters in dataset titles #1974

Open

35 tasks

Jimlinz assigned Jimlinz and unassigned AmrouEmad Oct 4, 2022

mfwightman moved this from 🏗 Doing / Implementing to 👀 Reviewing in Data Infrastructure Squad Oct 11, 2022

billgeo closed this as completed Oct 12, 2022

billgeo moved this from 👀 Reviewing to ✅ Closed in Data Infrastructure Squad Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Downstream effects of adding characters to dataset titles #1975

Spike: Downstream effects of adding characters to dataset titles #1975

billgeo commented Aug 30, 2022 •

edited

Loading

Jimlinz commented Aug 30, 2022

Jimlinz commented Oct 5, 2022 •

edited

Loading

l0b0 commented Oct 7, 2022 •

edited

Loading

l0b0 commented Oct 7, 2022

l0b0 commented Oct 7, 2022

Jimlinz commented Oct 7, 2022

Jimlinz commented Oct 7, 2022

Jimlinz commented Oct 11, 2022 •

edited

Loading

billgeo commented Oct 12, 2022

Spike: Downstream effects of adding characters to dataset titles #1975

Spike: Downstream effects of adding characters to dataset titles #1975

Comments

billgeo commented Aug 30, 2022 • edited Loading

Enabler

Acceptance Criteria

Additional context

Tasks

Definition of Ready

Definition of Done

Jimlinz commented Aug 30, 2022

Jimlinz commented Oct 5, 2022 • edited Loading

l0b0 commented Oct 7, 2022 • edited Loading

l0b0 commented Oct 7, 2022

l0b0 commented Oct 7, 2022

Jimlinz commented Oct 7, 2022

Jimlinz commented Oct 7, 2022

Jimlinz commented Oct 11, 2022 • edited Loading

billgeo commented Oct 12, 2022

billgeo commented Aug 30, 2022 •

edited

Loading

Jimlinz commented Oct 5, 2022 •

edited

Loading

l0b0 commented Oct 7, 2022 •

edited

Loading

Jimlinz commented Oct 11, 2022 •

edited

Loading