Implement `split` and `join` for String #2752

darkdrag00nv2 · 2023-08-27T16:00:26Z

Issue to be solved

A few more functions can be implemented for Strings.

Suggested Solution

This issue proposes adding two functions:

split(_ s: String, _ delimiter: String): [String]: Returns a string array after splitting the provided string based on the provided delimiter. Delimiter can be made optional and , can be used as default.
join(_ strs: [String], _ separator: String): String: Returns a string value after concatenating elements of the provided array of string with the given separator. Separator can be made optional and , can be used as default.

Taken from https://github.com/green-goo-dao/flow-utils/blob/main/cadence/contracts/StringUtils.cdc.

Will file issues for more functions in the future.

Epic: #1972

The text was updated successfully, but these errors were encountered:

turbolent · 2023-08-29T18:53:27Z

Good ideas!

join seems easy to implement
split is actually not as straight-forward to implement, given the semantics of strings and characters in Cadence:
In Cadence, string equality is semantic and not byte-wise. That means that functions that search, like split, but also e.g. a replace function, need to consider this. I explained this in more detail in a previous PR that attempted to add a replace function: added replace to string #1353 (comment)

darkdrag00nv2 · 2023-08-30T17:46:17Z

Nice, thank you for sharing.

You mentioned on the replace PR that unicode normalization might be used in these cases. I agree with that too. Seems like we could leverage the unicode/norm package from the Go standard library: https://go.dev/blog/normalization. Along with the conversion to normalized form, it also provides an iterator (norm.Iter) over them which would be useful to map the characters back to the original string.

I'll give it a shot.

darkdrag00nv2 · 2023-08-30T17:51:02Z

Regarding which normalized mode to use, I think any of the two canonical equivalence modes - NFC and NFD could be used. I don't think we should use the two Compatibility equivalence modes since I assume we would like to differentiate between say a superscript and subscript letter.

turbolent · 2023-09-01T21:22:13Z

Yeah, for String and Character equality, we already use NFC via norm.NFC.

darkdrag00nv2 · 2023-09-02T18:31:04Z

@turbolent @SupunS
Just a note, seems like in Dart - the replace function doesn't take the equivalence into account.

You can try out the following on https://dartpad.dev/.

print('Caf\u00E9'.replaceAll(RegExp(r'\u00E9'), 'X'));  // CafX
print('Caf\u0065\u0301'.replaceAll(RegExp(r'\u0065\u0301'), 'X'));  // CafX
print('Caf\u0065\u0301'.replaceAll(RegExp(r'\u00E9'), 'X'));  // Café

The same with Kotlin: https://pl.kotl.in/_F5jCJfXN

fun main() {
    val s = "Caf\u00E9"
    println(s)  // Café
    println(s.replace("\u0065\u0301", "X"))  // Café
    println(s.replace("\u00E9", "X"))  // CafX
}

Seems like only Swift handles this. This makes me wonder why we want to handle equivalence in Cadence. If backwards compatibility is an issue, could we leverage stable cadence as a way to make a breaking change?

darkdrag00nv2 · 2023-09-03T10:32:03Z

The same in Rust actually. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4f656607f30f56e8bd4340f022f36fa4

fn main() {
    let s = String::from("Caf\u{00E9}");
    println!("{}", s);  // Café
    println!("{}", s.replace("\u{0065}\u{0301}", "X"));  // Café
    println!("{}", s.replace("\u{00E9}", "X"));  // CafX
}

let s = String::from("Caf\u{00E9}");
let s2 = String::from("Caf\u{0065}\u{0301}");
println!("{}", s == s2);  // false

So I am wondering why we want to treat equivalent characters as the same (Swift) instead of treating them at the byte level like (Go, Rust, Kotlin, Dart). Is there any reason for it?

SupunS · 2023-09-08T18:40:17Z

The reason would be because Cadence String type is unicode-compliant. And the equality of \u{0065}\u{0301} and \u{00E9} is originating from the Unicode-spec, rather than the implementing language. So Cadence is only following the Unicode spec in this case.

turbolent · 2023-09-08T19:25:04Z

Reposting from the Discord discussion:

Cadence aims to make programs safer, by preventing bugs. One area of bugs is string handling.
Most languages focus on performance instead of safety, so only provide some primitive functionality for strings.
However, developers still need to properly handle strings, and only providing primitive functionality does not solve the problem.

For example, in the extreme case, e.g in C, there are no strings at all, and the developer must handle all nuances of string operations (encodings).

Some languages have some sort of "string" type, but again, keep the type simple for performance reasons.
For example, in JavaScript or Java, at least assume a particular encoding (UTF-16), and some languages, like Go, provide language support and standard library functions to operate on higher-level concepts, like Unicode codepoints (Go's rune).
These simple types and functions are still often used incorrectly, leading to all sorts of bugs.
Most of the time the developers are not even aware of the issues, they often just appear as UX issues for users.
For example, "👩‍💻🎉".length in JS is 7, but I doubt a normal user would assume that, and it's likely surprising for many developers.

Swift, and also Cadence, take it a step further, and provide safe string types that operate on the level where operations on strings reflect expectations for humans.
I'd like Cadence to keep providing safe types and functions, it's the very reason why it exists. Other languages burden the developer with these problems (here, handling Unicode properly), and I bet developers would prefer focusing on the logic and functionality of their programs, than wrangle with all sorts of string manipulation gotchas.

I know that means the burden is then on the Cadence implementation ("us"), but that's the price we have to pay.

turbolent · 2023-09-08T23:11:20Z

As long as both input and replacement are normalized, it should be possible to use Go's strings.Replace functions: https://go.dev/play/p/MTPs4rl-f3V

bluesign · 2023-09-10T13:27:47Z

As long as both input and replacement are normalized, it should be possible to use Go's strings.Replace functions: https://go.dev/play/p/MTPs4rl-f3V

I am noob in this subject, but why don't we normalize at String level? Can't we just say Cadence normalizes string. (when you create the string) Doesn't this solve all remaining problems?

darkdrag00nv2 · 2023-09-11T19:07:50Z

Can't we just say Cadence normalizes string.

@bluesign
Yes, if we can say that Cadence normalizes string, then the problem becomes really easy.

For the implementation of replace, right now I am struggling to understand how to map back from the normalized string to the original. This way I could use normalization for comparison but keep the original string in places where there was no replacement. Otherwise source.replace(target, replacement) call would always end up normalizing source even if there was no match for target in it.

The Swift implement is not that easy to read - https://github.com/apple/swift-corelibs-foundation/blob/9f53cc551e065d73b327a80147895822bc8f89e0/CoreFoundation/String.subproj/CFString.c#L3084

turbolent · 2023-09-11T19:39:48Z

@bluesign

I am noob in this subject, but why don't we normalize at String level? Can't we just say Cadence normalizes string. (when you create the string) Doesn't this solve all remaining problems?

I'm a noob here, too, few can say they're experts in this area.

Yes, we already do normalize, that's what I tried to point out above. However, we do not (yet) normalize at creation, but only lazily when needed (e.g. to check equality).

We should keep the lazy initialization of a normalized version, so we do not have to pay for it unless necessary, but also store the result, by replacing the original (currently the normalization is recomputed each time and thrown away after use).

I'll open a PR for it.

turbolent · 2023-09-11T21:33:19Z

Opened #2777. Note that this against the Stable Cadence feature branch, as it requires a storage migration.

darkdrag00nv2 · 2023-09-26T17:41:52Z

Both split and join are implemented. join was added to master branch while split was added to the stable-cadence branch and will eventually make its way to the master branch.

darkdrag00nv2 added Feature Feedback labels Aug 27, 2023

darkdrag00nv2 assigned turbolent Aug 27, 2023

darkdrag00nv2 mentioned this issue Sep 3, 2023

Introduce String.join function #2762

Merged

6 tasks

turbolent assigned darkdrag00nv2 and unassigned turbolent Sep 8, 2023

turbolent mentioned this issue Sep 11, 2023

Cache normalized string #2777

Closed

6 tasks

turbolent mentioned this issue Sep 12, 2023

Eagerly normalize String and Character values at construction time #2781

Merged

6 tasks

darkdrag00nv2 mentioned this issue Sep 16, 2023

Introduce String.split function #2791

Merged

6 tasks

darkdrag00nv2 closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `split` and `join` for String #2752

Implement `split` and `join` for String #2752

darkdrag00nv2 commented Aug 27, 2023 •

edited

Loading

turbolent commented Aug 29, 2023

darkdrag00nv2 commented Aug 30, 2023

darkdrag00nv2 commented Aug 30, 2023

turbolent commented Sep 1, 2023

darkdrag00nv2 commented Sep 2, 2023

darkdrag00nv2 commented Sep 3, 2023 •

edited

Loading

SupunS commented Sep 8, 2023

turbolent commented Sep 8, 2023 •

edited

Loading

turbolent commented Sep 8, 2023

bluesign commented Sep 10, 2023 •

edited

Loading

darkdrag00nv2 commented Sep 11, 2023

turbolent commented Sep 11, 2023 •

edited

Loading

turbolent commented Sep 11, 2023

darkdrag00nv2 commented Sep 26, 2023

Implement split and join for String #2752

Implement split and join for String #2752

Comments

darkdrag00nv2 commented Aug 27, 2023 • edited Loading

Issue to be solved

Suggested Solution

turbolent commented Aug 29, 2023

darkdrag00nv2 commented Aug 30, 2023

darkdrag00nv2 commented Aug 30, 2023

turbolent commented Sep 1, 2023

darkdrag00nv2 commented Sep 2, 2023

darkdrag00nv2 commented Sep 3, 2023 • edited Loading

SupunS commented Sep 8, 2023

turbolent commented Sep 8, 2023 • edited Loading

turbolent commented Sep 8, 2023

bluesign commented Sep 10, 2023 • edited Loading

darkdrag00nv2 commented Sep 11, 2023

turbolent commented Sep 11, 2023 • edited Loading

turbolent commented Sep 11, 2023

darkdrag00nv2 commented Sep 26, 2023

Implement `split` and `join` for String #2752

Implement `split` and `join` for String #2752

darkdrag00nv2 commented Aug 27, 2023 •

edited

Loading

darkdrag00nv2 commented Sep 3, 2023 •

edited

Loading

turbolent commented Sep 8, 2023 •

edited

Loading

bluesign commented Sep 10, 2023 •

edited

Loading

turbolent commented Sep 11, 2023 •

edited

Loading