Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML: let String handle numbers too #7809

Merged
merged 1 commit into from
May 22, 2019

Conversation

asterite
Copy link
Member

Fixes #5798

I already gave a lot of reasons why this is a good/correct change (every other statically typed language does this), plus we recently bumped into this with something like:

version: 2.8

If I say I want to read the version as a string, there's no reason why I have to go and modify the yaml file to add an annotation like !string or put it in quotes. People don't do that, at all. And there's no point forcing people to do that (and sometimes it's impossible because the yaml is not in their control).

Or imagine have some kind of code:

code: abc123

That looks like a String. Now this (imagine someone is editing it manually, typing it):

code: 12e45

Oops! You just input something that looks like a floating point number. You should be more careful when that happens and put quotes around it. Right? No... people don't do that, they don't care about how the machine will work in this case, they are just editing a string in their mind. And I think that's fine.

@asterite asterite added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization labels May 22, 2019
Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been using String | Int64 on mappings and then a to_s manually. So 👍 on my side

Copy link
Member

@jhass jhass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@asterite asterite added this to the 0.29.0 milestone May 22, 2019
@asterite asterite merged commit 349b0a1 into crystal-lang:master May 22, 2019
@asterite asterite deleted the feature/yaml-string branch May 22, 2019 14:59
@straight-shoota
Copy link
Member

This change is unacceptable. It makes it impossible to use type safe YAML parsing.
I understand that the implemented behaviour can be desired in many cases, but in other cases it is necessary to distinguish between explicit YAML types.
I'm a bit disappointed this change was merged within hours despite my objections in #5798.

The previous behaviour is not a bug. It doesn't need to be fixed.
Instead, we should discuss how both use cases can be made possible.

@asterite
Copy link
Member Author

@straight-shoota Could you provide a use case where this behavior is not desired?

@asterite
Copy link
Member Author

I'm also having a hard time understanding why this is something desirable:

I’ve been using String | Int64 on mappings and then a to_s manually

@asterite
Copy link
Member Author

Or another question: how would you map the version field above without having to use an ugly hack like String | Int64 | Float64 and then calling to_s? We can't use String because it will choke. And we don't want to tell users to put quotes around values that look like numbers (because our users might not be programmers!).

@Blacksmoke16
Copy link
Member

Blacksmoke16 commented May 22, 2019

There is https://crystal-lang.org/api/master/String/RawConverter.html, but it is currently only for JSON. Having that support YAML as well would allow for this. They would just have to specify that converter.

@asterite
Copy link
Member Author

Right, but in JSON every type has a different representation. In YAML that's not exactly like that because you have schemas, and also a proof of this is that using libyaml scalars are just scalars (strings): how to interpret them is left to the user of the library. In this case, the user is Crystal, and by saying "I want this scalar to be a string" we are effectively not treating 2.8 as a number, which it was never in the first place (it's just a scalar, according to libyaml).

@Sija
Copy link
Contributor

Sija commented May 22, 2019

Specifying the type should be enough in resolving ambiguities, if there are such. There's nothing wrong in having user-friendly, loose (in terms of YAML spec) parser, which works "just right" in such cases, with no negative side-effects (perhaps I'm wrong here, but I don't see any...).

@asterite
Copy link
Member Author

Now if you parse with YAML.parse, well, that has to use some default schema, for the lack of type annotations, and then 2.8 is parsed as a float, not a string.

But I think I'm just repeating what I said in the other thread...

@straight-shoota
Copy link
Member

we are effectively not treating 2.8 as a number, which it was never in the first place (it's just a scalar, according to libyaml).

libyaml emits a scalar, but this is very low level. YAML is based on elaborate schemas which describe how to interpret scalars.

While having the same string representation, 2.8 and "2.8" are different values in the Core schema, the most commonly used YAML schema (and also the default in Crystal).

The "misinterpretation" described in #5798 is actually a mismatch between data format and schema.
This can be fixed by either using an appropriate schema or change the data format to represent a string instead of a number in the Core schema.

Could you provide a use case where this behavior is not desired?

Every time someone intends to use YAML as a type safe data format.

Even YAML's null value is suddenly interpreted as a string. Mapping String? only works correctly because Nil is lexically sorted before String. The same goes for the float and int types. But unions with UInt and Time demonstrate unexpected behaviour:

(String | Time).from_yaml("2019-05-27 16:38:42.536517000").class # => String
(String | UInt32).from_yaml("1").class # => String

Crystal uses the Core schema as default and according to that schema, these values should be interpreted as Time and UInt32, respectively. And 2.8 is a number and mapping it to String should error.

For the use case this PR intends to solve, using the Failsafe schema would be an appropriate solution. This schema only contains the most basic data types sequence, mapping and scalar. Any value, that is not a sequence or mapping is interpreted as a string. This is somewhat similar to RawConverter for JSON, but encoded in an official specification describing this minimal mapping interface.

Currently, the Failsafe schema is only supported for parsing a YAML::Any, not with from_yaml methods.
So the only way to use it is like this:

YAML::Schema::FailSafe.parse("2.8").as_s # => "2.8"

A logical expansion would be to provide a way to use the Failsafe scheme with from_yaml for String, Hash, Array and other sequence/mapping types. We can discuss this in a separate issue.

This PR should be reverted because it introduces incorrect behaviour.

@asterite
Copy link
Member Author

YAML is based on elaborate schemas which describe how to interpret scalars.

Exactly! There's the core schema, there's the fail-safe schema. And in my mind, YAML::Serializable or YAML.mapping define a schema: how you want to map a given YAML to Crystal types. And in that case, if I want to map a scalar to a String, every scalar can be mapped into a String.

While having the same string representation, 2.8 and "2.8" are different values in the Core schema, the most commonly used YAML schema (and also the default in Crystal).

Yes, in the Core schema and when you do YAML.parse, where you can't define any schema at all. That's why the default one (Core) is used. But using YAML.mapping changes that.

Even YAML's null value is suddenly interpreted as a string. Mapping String? only works correctly because Nil is lexically sorted before String

I think null is not common at all in YAML. If you want to omit a value you omit it. And we can simply fix th Nil behavior in unions by considering it first.

(String | Time).from_yaml("2019-05-27 16:38:42.536517000").class # => String
(String | UInt32).from_yaml("1").class # => String

I will say that union types in YAML are pretty rare. I wouldn't expect, ever, to have a value that can either be a String or a Time. It's always one value. When it's a union it's usually a composite type. But that type is then usually distinguished by another key.

And we can always fix this by considering String last in unions.

For the use case this PR intends to solve, using the Failsafe schema would be an appropriate solution

I think the problem of wanting to read something as a String but not being able to do that because it looks like a number is not specific to a specific key-value. It's a general problem.

Finally: why C#, Java and Go behave like this PR?

@straight-shoota
Copy link
Member

if I want to map a scalar to a String, every scalar can be mapped into a String.

Until this PR, you could only map matching types. And that's a great default because it fails when there is an unexpected value instead of simply digesting it, which might result in unexpected behaviour later on.

Yes, in the Core schema and when you do YAML.parse, where you can't define any schema at all. That's why the default one (Core) is used. But using YAML.mapping changes that.

I don't follow. What does YAML.mapping change? It uses the Core schema as well.
And the solution I'm proposing is to make .from_yaml (and following that YAML.mapping/YAML::Serializable) aware of the schema to provide different implementations, depending on the intended use case.

I think the problem of wanting to read something as a String but not being able to do that because it looks like a number is not specific to a specific key-value. It's a general problem.

Yes, you wouldn't use the Failsafe schema for mapping a specific key, but for parsing the entire YAML document.

why C#, Java and Go behave like this PR?

Do you have examples for that?

@asterite
Copy link
Member Author

@straight-shoota

I don't have time to set up a Java project right now, but if you can try it I'm pretty sure mapping a String works like this PR.

@asterite
Copy link
Member Author

asterite commented May 27, 2019

I just tried it in Java:

import org.yaml.snakeyaml.Yaml;
import org.yaml.snakeyaml.constructor.Constructor;

public class Main {
	public static class Customer {
		public String firstName;
	}

	public static void main(String[] args) throws Exception {
		new Main().main();
	}

	public void main() throws Exception {
		Yaml yaml = new Yaml(new Constructor(Customer.class));

		Customer customer = (Customer) yaml.load("firstName: 123");
		System.out.print(customer.firstName);
	}

}

Works just fine :-)

@straight-shoota
Copy link
Member

straight-shoota commented May 28, 2019

Ooops, sry. I couldn't recall these examples. Thanks for pointing it out.

At least YAMLDotNet and snakeyaml only implement YAML 1.1, which means untagged nodes are not resolved and their interpretation is completely application-defined. This mostly resembles the fail safe schema of YAML 1.2
go-yaml supports YAML 1.2 but it also seems to employ the fail safe schema and only apply type conversions on top of that.

For example, an explicitly tagged string can be mapped to a time instance:

time := time.Time{}
yaml.Unmarshal([]byte("!!str 2006-01-02T15:04:05Z"), &time)

In Crystal, this does not work, because Time.from_yaml expects a YAML time type:

Time.from_yaml("!!str 2006-01-02T15:04:05Z") # Error: Expected Time, not 2006-01-02T15:04:05Z

But String.from_yaml now accepts literally anything:

String.from_yaml("!!float not even a valid float") # => "not even a valid float"

This should definitely be an error. I suppose this could be fixed, though.

But what about this:

String.from_yaml("!!float 1.00") # => "1.00"
String.from_yaml("!!float 1")    # => "1"

According to the YAML specification, !!float 1.00 and !!float 1 represent the same value. But when you ask to map it to a string, they suddenly become different values?

YAMLDotNet for example treats them equally:

deserializer.Deserialize<String>(new StringReader("!!float 1.00")) // => "1"
deserializer.Deserialize<String>(new StringReader("!!float 1"))    // => "1"

This means it correctly resolves the value according to the specified tag and then maps that value (a YAML float) to a .NET string.


The approach of using the fail safe schema + custom type mappings can certainly be useful. But Crystal's entire YAML library is based on the core schema which is recommended as default schema. It provides tag resolution for the most common data types.

The current state after this PR is an inconclusive mixture between failsafe and core schema, which is not a desirable state. It should be either one, or even better, optional support for both of them (plus eventually JSON schema).

@asterite
Copy link
Member Author

Maybe we should fail on explicitly tagged valued. But I disagree about everything else. We should take pragmatism into account. For me that @bcardiff had to use a union and then call to_s is not acceptable, and that people keep bumping into this issue isn't acceptable either. Let's keep this PR merged for a couple of releases. I'm pretty sure nobody will complain about this anymore, practically (theoretically we could continue discussing what's right or wrong).

@straight-shoota
Copy link
Member

straight-shoota commented May 28, 2019

I'm not at all against a pragmatic solution. And I totally agree that union + to_s is not practical, it can even be incorrect (when a value is normalized).

But please let's introduce a standard-conform solution, not a weird hack that essentially creates a hybrid between fail safe and core schema, specific to the String type.
YAML is already complex enough at times, we should avoid imposing more complexity and surprises on the developer.

A proper solution is to fully support YAML parsing using the fail safe schema. This would fix #5798 and work similar to the implementations in other languages as mentioned above, being fully compliant to the YAML specification.

@asterite
Copy link
Member Author

@straight-shoota But doesn't the failsafe schema just support strings? Not even floats or ints.

I really don't think there's any conflict here. YAML.parse uses the core schema. There String.from_yaml isn't used at all. Then YAML.mapping defines the schema to use on top of YAML. The core schema isn't involved at all there.

They are two different schemas with no interaction.

I also want to add the note that tagging isn't common at all and I've never seen it used (mainly because YAML is mainly for humans, and humans don't worry/think about tags). So yes, we could theoretically worry about it and try to make a best effort, but maybe it's not worth it.

@straight-shoota
Copy link
Member

straight-shoota commented May 28, 2019

But doesn't the failsafe schema just support strings? Not even floats or ints.

No. The failsafe schema supports any data type in YAML. It just doesn't provide any default tag resolution for untagged scalars and leaves that to the application. So, for example when the scalar value 2.8 can be interpreted as a string or a float, depending on whether you call String.from_yaml or Float64.from_yaml.

That's how go-yaml, YAMLDotNet and snakeyaml work and the proper implementation for the problem described in #5798.

In the core schema on the other hand, 2.8 is defined to be resolved to a float and it can be parsed with Float64.from_yaml. But when asking for a string (String.from_yaml) it should error because there is a type conflict.

YAML.mapping defines the schema to use on top of YAML

YAML.mapping defines the mapping from YAML types to Crystal types. All .from_yaml implementations are based on the YAML types which are resolved according to the Core schema.

For example: Int32.from_yaml only works with a plain scalar like 1 which resolves to the YAML int type and thus provides a matching data type. It does not work with a quoted scalar like "1" which resolves to the YAML str type and is not a matching data type.

They are two different schemas with no interaction.

Exactly. But this PR combines behaviour of the fail safe schema (all plain scalars map to string) with the core schema (plain scalars resolve to a specific type depending on their format).

@asterite
Copy link
Member Author

No. The failsafe schema supports any data type in YAML. It just doesn't provide any default tag resolution for untagged scalars and leaves that to the application. So, for example when the scalar value 2.8 can be interpreted as a string or a float, depending on whether you call String.from_yaml or Float64.from_yaml.
That's how go-yaml, YAMLDotNet and snakeyaml work and the proper implementation for the problem described in #5798.

However, when I parse with go-yaml into map[interface{}]interface{}, meaning I don't know what types I'll get, when I parse 123, using reflection, I get an int. When I parse 123.45 I get a float. But when I map it to a string I always get a string.

But maybe I shouldn't base my thoughts on how it should work based on other implementations.

I guess... I'll stop commenting here. Feel free to come up with a good implementation and submit a PR or an RFC. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

YAML ParseException parsing strings that look like numbers
6 participants