Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement split and join for String #2752

Closed
darkdrag00nv2 opened this issue Aug 27, 2023 · 14 comments
Closed

Implement split and join for String #2752

darkdrag00nv2 opened this issue Aug 27, 2023 · 14 comments
Assignees

Comments

@darkdrag00nv2
Copy link
Contributor

darkdrag00nv2 commented Aug 27, 2023

Issue to be solved

A few more functions can be implemented for Strings.

Suggested Solution

This issue proposes adding two functions:

  1. split(_ s: String, _ delimiter: String): [String]: Returns a string array after splitting the provided string based on the provided delimiter. Delimiter can be made optional and , can be used as default.
  2. join(_ strs: [String], _ separator: String): String: Returns a string value after concatenating elements of the provided array of string with the given separator. Separator can be made optional and , can be used as default.

Taken from https://github.com/green-goo-dao/flow-utils/blob/main/cadence/contracts/StringUtils.cdc.

Will file issues for more functions in the future.

Epic: #1972

@turbolent
Copy link
Member

Good ideas!

  • join seems easy to implement
  • split is actually not as straight-forward to implement, given the semantics of strings and characters in Cadence:
    In Cadence, string equality is semantic and not byte-wise. That means that functions that search, like split, but also e.g. a replace function, need to consider this. I explained this in more detail in a previous PR that attempted to add a replace function: added replace to string #1353 (comment)

@darkdrag00nv2
Copy link
Contributor Author

Nice, thank you for sharing.

You mentioned on the replace PR that unicode normalization might be used in these cases. I agree with that too. Seems like we could leverage the unicode/norm package from the Go standard library: https://go.dev/blog/normalization. Along with the conversion to normalized form, it also provides an iterator (norm.Iter) over them which would be useful to map the characters back to the original string.

I'll give it a shot.

@darkdrag00nv2
Copy link
Contributor Author

Regarding which normalized mode to use, I think any of the two canonical equivalence modes - NFC and NFD could be used. I don't think we should use the two Compatibility equivalence modes since I assume we would like to differentiate between say a superscript and subscript letter.

@turbolent
Copy link
Member

Yeah, for String and Character equality, we already use NFC via norm.NFC.

@darkdrag00nv2
Copy link
Contributor Author

@turbolent @SupunS
Just a note, seems like in Dart - the replace function doesn't take the equivalence into account.

You can try out the following on https://dartpad.dev/.

print('Caf\u00E9'.replaceAll(RegExp(r'\u00E9'), 'X'));  // CafX
print('Caf\u0065\u0301'.replaceAll(RegExp(r'\u0065\u0301'), 'X'));  // CafX
print('Caf\u0065\u0301'.replaceAll(RegExp(r'\u00E9'), 'X'));  // Café

The same with Kotlin: https://pl.kotl.in/_F5jCJfXN

fun main() {
    val s = "Caf\u00E9"
    println(s)  // Café
    println(s.replace("\u0065\u0301", "X"))  // Café
    println(s.replace("\u00E9", "X"))  // CafX
}

Seems like only Swift handles this. This makes me wonder why we want to handle equivalence in Cadence. If backwards compatibility is an issue, could we leverage stable cadence as a way to make a breaking change?

@darkdrag00nv2
Copy link
Contributor Author

darkdrag00nv2 commented Sep 3, 2023

The same in Rust actually. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4f656607f30f56e8bd4340f022f36fa4

fn main() {
    let s = String::from("Caf\u{00E9}");
    println!("{}", s);  // Café
    println!("{}", s.replace("\u{0065}\u{0301}", "X"));  // Café
    println!("{}", s.replace("\u{00E9}", "X"));  // CafX
}
let s = String::from("Caf\u{00E9}");
let s2 = String::from("Caf\u{0065}\u{0301}");
println!("{}", s == s2);  // false

So I am wondering why we want to treat equivalent characters as the same (Swift) instead of treating them at the byte level like (Go, Rust, Kotlin, Dart). Is there any reason for it?

@SupunS
Copy link
Member

SupunS commented Sep 8, 2023

The reason would be because Cadence String type is unicode-compliant. And the equality of \u{0065}\u{0301} and \u{00E9} is originating from the Unicode-spec, rather than the implementing language. So Cadence is only following the Unicode spec in this case.

@turbolent
Copy link
Member

turbolent commented Sep 8, 2023

Reposting from the Discord discussion:

Cadence aims to make programs safer, by preventing bugs. One area of bugs is string handling.
Most languages focus on performance instead of safety, so only provide some primitive functionality for strings.
However, developers still need to properly handle strings, and only providing primitive functionality does not solve the problem.

For example, in the extreme case, e.g in C, there are no strings at all, and the developer must handle all nuances of string operations (encodings).

Some languages have some sort of "string" type, but again, keep the type simple for performance reasons.
For example, in JavaScript or Java, at least assume a particular encoding (UTF-16), and some languages, like Go, provide language support and standard library functions to operate on higher-level concepts, like Unicode codepoints (Go's rune).
These simple types and functions are still often used incorrectly, leading to all sorts of bugs.
Most of the time the developers are not even aware of the issues, they often just appear as UX issues for users.
For example, "👩‍💻🎉".length in JS is 7, but I doubt a normal user would assume that, and it's likely surprising for many developers.

Swift, and also Cadence, take it a step further, and provide safe string types that operate on the level where operations on strings reflect expectations for humans.
I'd like Cadence to keep providing safe types and functions, it's the very reason why it exists. Other languages burden the developer with these problems (here, handling Unicode properly), and I bet developers would prefer focusing on the logic and functionality of their programs, than wrangle with all sorts of string manipulation gotchas.

I know that means the burden is then on the Cadence implementation ("us"), but that's the price we have to pay.

@turbolent
Copy link
Member

As long as both input and replacement are normalized, it should be possible to use Go's strings.Replace functions: https://go.dev/play/p/MTPs4rl-f3V

@turbolent turbolent assigned darkdrag00nv2 and unassigned turbolent Sep 8, 2023
@bluesign
Copy link
Contributor

bluesign commented Sep 10, 2023

As long as both input and replacement are normalized, it should be possible to use Go's strings.Replace functions: https://go.dev/play/p/MTPs4rl-f3V

I am noob in this subject, but why don't we normalize at String level? Can't we just say Cadence normalizes string. (when you create the string) Doesn't this solve all remaining problems?

@darkdrag00nv2
Copy link
Contributor Author

Can't we just say Cadence normalizes string.

@bluesign
Yes, if we can say that Cadence normalizes string, then the problem becomes really easy.

For the implementation of replace, right now I am struggling to understand how to map back from the normalized string to the original. This way I could use normalization for comparison but keep the original string in places where there was no replacement. Otherwise source.replace(target, replacement) call would always end up normalizing source even if there was no match for target in it.

The Swift implement is not that easy to read - https://github.com/apple/swift-corelibs-foundation/blob/9f53cc551e065d73b327a80147895822bc8f89e0/CoreFoundation/String.subproj/CFString.c#L3084

@turbolent
Copy link
Member

turbolent commented Sep 11, 2023

@bluesign

I am noob in this subject, but why don't we normalize at String level? Can't we just say Cadence normalizes string. (when you create the string) Doesn't this solve all remaining problems?

I'm a noob here, too, few can say they're experts in this area.

Yes, we already do normalize, that's what I tried to point out above. However, we do not (yet) normalize at creation, but only lazily when needed (e.g. to check equality).

We should keep the lazy initialization of a normalized version, so we do not have to pay for it unless necessary, but also store the result, by replacing the original (currently the normalization is recomputed each time and thrown away after use).

I'll open a PR for it.

@turbolent turbolent mentioned this issue Sep 11, 2023
6 tasks
@turbolent
Copy link
Member

Opened #2777. Note that this against the Stable Cadence feature branch, as it requires a storage migration.

@darkdrag00nv2
Copy link
Contributor Author

Both split and join are implemented. join was added to master branch while split was added to the stable-cadence branch and will eventually make its way to the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants