-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide data types for semver and semver_range to enable indexing and querying semantic version values #48878
Comments
Pinging @elastic/es-search (:Search/Mapping) |
The discussion around integer/long mappings seems relevant to this issue. |
Yes please!! This would be super helpful. |
ECS captures versions in straightforward My ideal scenario would be to have a datatype we can add directly to existing Here's a few things I'd like to see supported:
I don't currently have a clear use case for semver_range, but this could also prove useful. |
Thanks @webmat, this is very useful. I agree ranges are important. I think we could support versions with the following restrictions:
The build number doesn't need to be part of the indexed representation since it doesn't matter for ordering. This would allow to store versions as a 16 bytes integer, which is the maximum that Lucene supports. Do these restrictions feel ok? |
I think the 2 bytes restriction per number and the overall 16 byte integer representation are reasonable, yes. Looking at your comment before edits, I was about to say I didn't expect us digging into or interpreting the labels that much (the number in alpha.1). I would specifically avoid trying to parse labels, actually. There's conventions there, but people do weird things with these labels. You'll note that I'm not even calling them "pre-release", but specifically "label" which doesn't impose a semantic meaning. 2 examples to support this:
Whenever we reference this part of a version string, I would suggest we name it "label" or something that doesn't impose a semantic meaning. To confirm, is it correct to say that we would parse and index up to the first 3 numbers, but not the 4th (e.g. in longer versions such as Chrome's "78.0.3904.108")? I'm good with this, as long as the 4 number versions do get parsed successfully. If someone is looking for an exact build number, it they should search for the exact match of the version string, not I'm also good with truncating the label at 13 bytes. Or were you suggesting |
My worry is that the specification is very clear about how these labels should be used to compute precedence, and I worry that users will be caught by surprise if Regarding versions that have long labels, I believe we would be able to not reject them by storing the prefix in the index and the whole thing in doc values. Then range queries would use the index for the first 13 characters and fall back to doc values when version labels are longer, which should be rare. How to handle non-semver version schemes is a good question. For instance, when there is no ambiguity, we could support additional schemes directly e.g.
|
Point no 9 of Semver states:
Supporting that part makes sense and is straightforward. It states that all versions that include "a pre-release label" are lower precedence -- or earlier version -- than the same version without a pre-release label. But to me the spec seems to give up on specifying anything in the meaning of the acceptable labels, in terms of comparing one label to another. If you focus on the last two example labels they give:
These examples are actually all over the place. So I think working to support sensible sorting based on the label part of the version may end up being a lot of effort, for something that's not actually defined by SemVer 2. I think a useful approach to handling the sorting of the labels would actually to also "give up" and just do lexical sorting. Pros:
Cons:
Glancing at version numbers of various package lists like my workstation's, homebrew's or a Linux distro's is a good reminder of how narrow of a corner case sensible interpretation of the labels is. A significant percentage seem to not follow SemVer (maybe 10%?), the majority seems to follow a sensible major.minor.patch format, but few actually have labels; those that do, rarely seem to follow the a logic close to I'd rather we spend our cycles on making
I'm also ok if we don't implement a fall back like described above, as long as version parsing of an unsupported format doesn't fail indexing. That part is absolutely critical, IMO. |
I don't have the same reading. These sentences in particular are only about how versions with the same major, minor and patch should compare depending on their labels:
Let me try to think more about whether I can find a way to index versions that never fails yet still honors precedence rules of semver. |
You're right, I overlooked point no 11, which does a better job at defining precedence between labels of a same version. I think my point still stands on how much of a corner case sorting semver labels is, vs their "correct" usage in the wild, however. |
I agree with this statement, but I know some people can be very picky about it. I also like the fact that we could just direct people to the semver docs to explain how precedence works vs. having to define our own rules. I played a bit with encoding, it looks possible to accept arbitrary strings yet honor precedence rules for version that use the semver scheme. Precedence would also work as expected for variants like
|
Yes, I agree with these restrictions. From the POV of ECS, this datatype would likely be used as a multi-field appended to the various pre-existing And thinking more about the crazy version strings, I guess they would kind of work anyway. The "r2917" above is a real version seen on Homebrew. If this parses to 0.0.0 + pre-release label "r2917", I guess the sorting will work as expected, when sorted against "r2916" and "r2918" 🤷♂ 🙂 |
@webmat What are the reasons why you would not only index the field as a version? The encoding strategies would have sorted |
Right now the fields are already defined as From ECS' point of view, this type incompatibility would be a breaking change. It's something we would only consider doing when ECS turns 2.0, which we would align with Elastic Stack 8.0. |
This is something we could make work. |
To be clear, search would work out of the box, only aggregations would require some work to make sure you can do e.g. a |
Would there be a way to query in general for versions that have a pre-release label? E.g. if I look into my infrastructure's package and software versions, I'd like to be able to query for any version number that includes a label, no matter what the label is. |
It wouldn't come for free. In my opinion, we should either require this to be done on-top (possibly with an ingest processor)with a separate field, e.g. Or we could do it under the hood by indexing hidden fields like |
We have 2 use cases that could be benefited by the above mentioned fields: major, minor, patch and build. |
I started looking into possible encoding schemes for this and have a POC that would allow using a wider range of version schemes, including but not limited to the Semantic Versioning scheme. My current assumptions are:
Where only In extension to the strict SemVer specs and its precedence rules, we could easily further support:
We can encode the optional “build” part into the same field to allow exact matching on it if we simply say that we don’t ensure any specific ordering for that part. The SemVer specs say that “Build metadata must be ignored when determining version precedence. Thus two versions that differ only in the build metadata, have the same precedence.”, but I’d say from a practical point of view it’s enough if we don’t guarantee any precedence here. When e.g. sorting values one has to decide on some sort of ordering anyway. The options sketched out above would still not allow for versions like the RedHat “5.el6” mentioned earlier on this issue. It would be possible to also allow alphanumeric ids in the part, but I wonder how frequent these cases would be. To keep the number of options low, I wonder if for those cases it wouldn’t be better to solve cases like that on ingestion / preprocessing to convert to something like “5-el6” which we could handle with the POC encoding. The POC allows for exact searching, ranges (like |
This reminded me of natsort. Just thought I'd share, in case it's unknown. Might be useful for inspiration. |
This PR adds a new 'version' field type that allows indexing string values representing software versions similar to the ones defined in the Semantic Versioning definition (semver.org). The field behaves very similar to a 'keyword' field but allows efficient sorting and range queries that take into accound the special ordering needed for version strings. For example, the main version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't be possible with 'keyword' fields today. Valid version values are similar to the Semantic Versioning definition, with the notable exception that in addition to the "main" version consiting of major.minor.patch, we allow less or more than three numeric identifiers, i.e. "1.2" or "1.4.6.123.12" are treated as valid too. Relates to elastic#48878
This PR adds a new 'version' field type that allows indexing string values representing software versions similar to the ones defined in the Semantic Versioning definition (semver.org). The field behaves very similar to a 'keyword' field but allows efficient sorting and range queries that take into accound the special ordering needed for version strings. For example, the main version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't be possible with 'keyword' fields today. Valid version values are similar to the Semantic Versioning definition, with the notable exception that in addition to the "main" version consiting of major.minor.patch, we allow less or more than three numeric identifiers, i.e. "1.2" or "1.4.6.123.12" are treated as valid too. Relates to #48878
This PR adds a new 'version' field type that allows indexing string values representing software versions similar to the ones defined in the Semantic Versioning definition (semver.org). The field behaves very similar to a 'keyword' field but allows efficient sorting and range queries that take into accound the special ordering needed for version strings. For example, the main version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't be possible with 'keyword' fields today. Valid version values are similar to the Semantic Versioning definition, with the notable exception that in addition to the "main" version consiting of major.minor.patch, we allow less or more than three numeric identifiers, i.e. "1.2" or "1.4.6.123.12" are treated as valid too. Relates to elastic#48878
This PR adds a new 'version' field type that allows indexing string values representing software versions similar to the ones defined in the Semantic Versioning definition (semver.org). The field behaves very similar to a 'keyword' field but allows efficient sorting and range queries that take into accound the special ordering needed for version strings. For example, the main version parts are sorted numerically (ie 2.0.0 < 11.0.0) whereas this wouldn't be possible with 'keyword' fields today. Valid version values are similar to the Semantic Versioning definition, with the notable exception that in addition to the "main" version consiting of major.minor.patch, we allow less or more than three numeric identifiers, i.e. "1.2" or "1.4.6.123.12" are treated as valid too. Relates to #48878
The PR adding the main "version" field type was merged with #62692. |
Can't wait for this feature @cbuescher. When do you think it will be generally available? I guess the feature has to be added to the different SDKs also |
The main new field type has been merged to the 7.x branch which should go out with the upcoming 7.10 release if there are no last-minute changes. We don't give estimates of release dates etc... here, but the minor version release cadence has been pretty stable, so you can guestimate.
I'm not sure I fully understand this. What SDKs do you mean? |
SDKs as in support in language clients? |
@geekpete @cbuescher yeah, the NEST C# client for example. |
@fredeil Thanks for the clarification. The "version" field type should be usable like any other specialized field type via the REST interface. If the client API has special methods for creating field in index mappings for every field type possible it might need updating. I only know the Java high level client well enough to answer this but there you provide a map of field definitions thats not strongly typed, so no updates needed there. My guess would be that its similar for the C# client. |
I wanted to circle back on an earlier point in the discussion: #48878 (comment) The exchange suggested that After experimenting with the With either path, ECS is still looking forward to adopting |
Keeping this issue open because I'd still like to add a |
The version field has been added, version range is tracked separately n #83995 . Closing this issue. |
Describe the feature:
Provide dedicated types to index and query against semantic versions with a
semver
and asemver_range
type.Use of Semantic Versioning is so widespread that a dedicated
semver
type andsemver_range
type will be useful primitives to add to Elasticsearch for varied use cases.More advanced range combinations might be interesting to implement such as how Python's
pip
tool allows versions to be specified:https://www.python.org/dev/peps/pep-0440/#version-specifiers
The text was updated successfully, but these errors were encountered: