Skip to content
This repository has been archived by the owner on Nov 23, 2023. It is now read-only.

Spike: Downstream effects of adding characters to dataset titles #1975

Closed
35 tasks
billgeo opened this issue Aug 30, 2022 · 9 comments
Closed
35 tasks

Spike: Downstream effects of adding characters to dataset titles #1975

billgeo opened this issue Aug 30, 2022 · 9 comments
Assignees
Labels
documentation Document something enabler story Enable to team to improve

Comments

@billgeo
Copy link
Contributor

billgeo commented Aug 30, 2022

Enabler

So that we know more about what 'safe' AWS character, macronated characters and slashes will do in downstream systems, we want to do 1 day's research and testing into what the characters in s3 prefixes that we want to add in #1974 and #1928 will do for typical data analyst downstream users on Windows, Linux and Mac.

See Confluence page for current list of characters that the data managers want on top of what the data managers have asked for, it would be useful to be as permissive as possible with the characters.

They typically use standard file system tools and GIS browsers (QGIS, ArcGIS) and analysis tools (GDAL, ArcGIS) and python for scripting.

Should see if anyone has already tested this somewhere in a blog post etc, or AWS docs.
If still not satisfied that risks are mitigated we could test some things ourselves.

Some resources:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
https://en.wikipedia.org/wiki/Filename
https://stackoverflow.com/questions/4814040/allowed-characters-in-filename
https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability
http://handbook.datalad.org/en/0.15/intro/filenaming.html

Acceptance Criteria

Additional context

Tasks

  • ...
  • ...

Definition of Ready

  • This story is ready to work on
    • Negotiable (team can decide how to design and implement)
    • Valuable (from a user perspective)
    • Estimate value applied (agreed by team)
    • Small (so as to fit within an iteration)
    • Testable (in principle, even if there isn't a test for it yet)
    • Environments are ready to meet definition of done
    • Resources required to implement will be ready
    • Everyone understands and agrees with the tasks to complete the story
    • Release value (e.g. Iteration 3) applied
    • Sprint value (e.g. Aug 1 - Aug 15) applied

Definition of Done

  • This story is done:
    • Acceptance criteria completed
    • Automated tests are passing
    • Code is peer reviewed and pushed to master
    • Deployed successfully to test environment
    • Checked against
      CODING guidelines
    • Relevant new tasks are added to backlog and communicated to the team
    • Important decisions recorded in the issue ticket
    • Readme/Changelog/Diagrams are updated
    • Product Owner has approved acceptance criteria as complete
    • Meets non-functional requirements:
      • Scalability (data): Can scale to 300TB of data and 100,000,000 files and ability to
        increase 10% every year
      • Scability (users): Can scale to 100 concurrent users
      • Cost: Data can be stored at < 0.5 NZD per GB per year
      • Performance: A large dataset (500 GB and 50,000 files - e.g. Akl aerial imagery) can be
        validated, imported and stored within 24 hours
      • Accessibility: Can be used from LINZ networks and the public internet
      • Availability: System available 24 hours a day and 7 days a week, this does not include
        maintenance windows < 4 hours and does not include operational support
      • Recoverability: RPO of fully imported datasets < 4 hours, RTO of a single 3 TB dataset
        < 12 hours
@billgeo billgeo added enabler story Enable to team to improve needs refinement Needs to be discussed by the team labels Aug 30, 2022
@Jimlinz
Copy link
Contributor

Jimlinz commented Aug 30, 2022

Might be worth checking these special characters aren't going to cause problems with DynamoDB entry, lambda, etc...

@billgeo billgeo changed the title Spike: Downstream effects of adding charactes to Dataset titles. Spike: Downstream effects of adding characters to dataset titles Aug 31, 2022
@billgeo billgeo removed the needs refinement Needs to be discussed by the team label Sep 5, 2022
@mfwightman mfwightman moved this from 📋 Backlog to 🔖 Ready in Data Infrastructure Squad Sep 20, 2022
@AmrouEmad AmrouEmad self-assigned this Sep 26, 2022
@AmrouEmad AmrouEmad moved this from 🔖 Ready to 🏗 Doing / Implementing in Data Infrastructure Squad Sep 29, 2022
@Jimlinz Jimlinz assigned Jimlinz and unassigned AmrouEmad Oct 4, 2022
@Jimlinz
Copy link
Contributor

Jimlinz commented Oct 5, 2022

There is simply way too many caveat to cater for with this approach. Different OS have different rules on what filename and directory name is allowed. While we are able to include / (forward slash) as part of s3 object name, most OS won't allow this, which begs the question - how do user even get to create this in the first place to add to Geostore?

After exploring this, I feel we should further restrict what user is allowed to submit (more so that relaxing the rules here), to prevent unexpected and unforeseen behaviour downstream. Case in point, do we want to allow user to submit Chinese characters and emojis in their dataset title? What happens if they do so? Do we want to tests this across all components?

EDIT: Examples

  • full stop (period) shouldn't be allow at the start or end of the filename.
  • spaces should be limited to one (multiple spaces could be problematic with some os or applications) [edit: removed os]

Perhaps a better approach would be have a list of characters we want to allow (e.g. macrons), and work our way backward to figure out how to cater for these edge cases.

My recommendation after reading this thread is to reconsider and revisit Proposal 4. We would then build an index of some sort that map each unique id to the appropriate dataset. User should be able to browse and search this index and filter down the dataset they are interested in. S3 is built as an object store, not so much file / directory browser. The concept of having a path or directory doesn't even exist on s3. It is merely a façade to make things a bit more user friendly. We probably shouldn't be building on top of this; rather, we should adhere to how it was originally built and intended, by decoupling the concept of storage and index.

Setting up an index also has an added benefit of allowing name change (e.g. Wanganui to Whanganui). This would be a tedious task on S3 as we would need to copy (and rename) then delete each object, which would be a pain since our bucket is versioned enabled. Another benefit with this approach is flexibility and scalability. When a data set is added e.g. elevation_canterbury_canterbury_2016_dem_1m user can only browse under Canterbury. With a proper index (with tagging allowed), user should be able browse for Waitaha (Canterbury) or Christchurch.

Perhaps a discussion should be had here. Would appreciate @AmrouEmad and @l0b0 thought on this.

Edit: Not to mention that s3 pagination should be taken into consideration when we allow user to browse geostore s3 bucket. Might not be a problem now, but it will be once we try to scale.

@l0b0
Copy link
Contributor

l0b0 commented Oct 7, 2022

  • multiple spaces could be problematic with some os or applications

Do you have any examples? I'm familiar with some tools not dealing with spaces (especially at the start or end of a filename), but I've not encountered one where multiple spaces are problematic. Also, do you mean multiple consecutive spaces, like foo[space][space]bar, or just more than one space anywhere, as in foo[space]bar[space]baz?

@l0b0
Copy link
Contributor

l0b0 commented Oct 7, 2022

Case in point, do we want to allow user to submit Chinese characters and emojis in their dataset title?

Probably not, but only because we're building this for the needs of Toitū Te Whenua LINZ. But the point is a good one, because this is not really different from allowing macrons. Macrons (like emojis and other characters) even have additional complexities, since they have multiple representations which look and are semantically identical: "a" followed by U+0304 (a&#x304; in HTML) is, for human consumption purposes, identical to "ā".

@l0b0
Copy link
Contributor

l0b0 commented Oct 7, 2022

My recommendation after reading this thread is to reconsider and revisit Proposal 4. We would then build an index of some sort that map each unique id to the appropriate dataset. User should be able to browse and search this index and filter down the dataset they are interested in. S3 is built as an object store, not so much file / directory browser. The concept of having a path or directory doesn't even exist on s3. It is merely a façade to make things a bit more user friendly. We probably shouldn't be building on top of this; rather, we should adhere to how it was originally built and intended, by decoupling the concept of storage and index.

Setting up an index also has an added benefit of allowing name change (e.g. Wanganui to Whanganui). This would be a tedious task on S3 as we would need to copy (and rename) then delete each object, which would be a pain since our bucket is versioned enabled. Another benefit with this approach is flexibility and scalability. When a data set is added e.g. elevation_canterbury_canterbury_2016_dem_1m user can only browse under Canterbury. With a proper index (with tagging allowed), user should be able browse for Waitaha (Canterbury) or Christchurch.

Yeah, this is a tricky one. We've gone through several iterations already:

  1. UUIDs only, as recommended by @imincik and myself.
  2. Replaced UUIDs with ULIDs, to make them ordered by time.
  3. Added the dataset title to the ULID, to make it possible to browse the S3 bucket without any index.
  4. Removed the ULID, at which point we're back to where renaming a dataset is a major operation, and we can't allow special characters in dataset names because of filename issues.

IMO we should've stuck with the original idea of using UUIDs to make sure people use the index rather than S3. But we never got to the point where the index was productionised, so we've had to compromise and spend heaps of time rearchitecting, rather than spending time creating a usable index.

@Jimlinz
Copy link
Contributor

Jimlinz commented Oct 7, 2022

spaces should be limited to one (multiple spaces could be problematic with some os or applications)

I should probably edit this for clarity.

do you mean multiple consecutive spaces, like foo[space][space]bar

This.

Example problem. The bug has already been fixed, but we should probably avoid allowing something like this to go through. Escaping this on multiple layers will be a pain to deal with.

I found another thread alerting to this issue, but seem to have lost the url. Will update when I find it again.

@Jimlinz
Copy link
Contributor

Jimlinz commented Oct 7, 2022

IMO we should've stuck with the original idea of using UUIDs to make sure people use the index rather than S3.

I would vote in favour of revisiting this idea again... 😄

@Jimlinz
Copy link
Contributor

Jimlinz commented Oct 11, 2022

Attached is a brief summary of what I found in this spike, so that it is documented. This is a non comprehensive list. There is simply too many edge case to cover with this approach. Most of the concerns have already been covered in the discussion above.

safe_char.txt

@mfwightman mfwightman moved this from 🏗 Doing / Implementing to 👀 Reviewing in Data Infrastructure Squad Oct 11, 2022
@billgeo
Copy link
Contributor Author

billgeo commented Oct 12, 2022

Thanks Jim. Closing this for now. And adding a link to the issue to actually make the changes.

@billgeo billgeo closed this as completed Oct 12, 2022
@billgeo billgeo moved this from 👀 Reviewing to ✅ Closed in Data Infrastructure Squad Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Document something enabler story Enable to team to improve
Development

No branches or pull requests

4 participants