Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cassandra 5.0 vector type CREATE/INSERT support #1020

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

rukai
Copy link
Contributor

@rukai rukai commented Jun 24, 2024

makes progress towards: #1014

The vector type is introduced by the currently in beta cassandra 5.
See: https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html
Scylla does not support vector types and so the tests are setup to only compile/run with a new cassandra_tests config.

This commit does not add support for retrieving the data via a SELECT.
That was omitted to reduce scope and will be implemented in follow up work.

I have split off support for CREATE/INSERT since I value the early feedback of small PRs.
I didnt realize that @pkolaczk had shown interest in implementing this until after I'd went to submit my PR.
Hopefully I've not stepped on any toes.

The cassandra docker-compose.yaml is changed to 5.0 beta cassandra, if this is undesirable I can introduce a second docker-compose.yaml specifically for 5.0

image
Confirmed that the test is running in CI

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Jun 24, 2024
Copy link

github-actions bot commented Jun 24, 2024

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: 2685ab9

See the following report for details:

cargo semver-checks output
./scripts/semver-checks.sh --baseline-rev f59908c54e6b6407112311a3745e32d4bd218d0c
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev f59908c54e6b6407112311a3745e32d4bd218d0c
     Cloning f59908c54e6b6407112311a3745e32d4bd218d0c
     Parsing scylla v0.15.0 (current)
      Parsed [  25.056s] (current)
     Parsing scylla v0.15.0 (baseline)
      Parsed [  22.998s] (baseline)
    Checking scylla v0.15.0 -> v0.15.0 (no change)
     Checked [   0.125s] 94 checks: 93 pass, 1 fail, 0 warn, 0 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/enum_variant_added.ron

Failed in:
  variant CqlType:Vector in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/transport/topology.rs:249

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  48.241s] scylla
     Parsing scylla-cql v0.4.0 (current)
      Parsed [  11.178s] (current)
     Parsing scylla-cql v0.4.0 (baseline)
      Parsed [  11.453s] (baseline)
    Checking scylla-cql v0.4.0 -> v0.4.0 (no change)
     Checked [   0.107s] 94 checks: 94 pass, 0 skip
     Summary no semver update required
    Finished [  22.787s] scylla-cql
make: *** [Makefile:61: semver-rev] Error 1

@rukai rukai marked this pull request as ready for review June 24, 2024 07:28
@wprzytula wprzytula self-requested a review June 24, 2024 07:40
Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see, looks good, thank you for contribution!

However, what I see in this PR is support for parsing schema containing Vector type. What is completely missing is serialization and deserialization support (the latter, you said, you leave for a follow-up). See #1014 (comment) for hints.

scylla/src/transport/topology.rs Outdated Show resolved Hide resolved
scylla/src/transport/topology.rs Outdated Show resolved Hide resolved
Comment on lines 12 to 14
cassandra1:
image: cassandra
image: cassandra:5.0-beta1
healthcheck:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if we are OK with testing only with newest beta cassandra instead of stable. WDYT @Lorak-mmk?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this in addition to stable, but not instead of stable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved cassandra-5.0-beta1 into a separate compose file.

scylla/src/transport/session_test.rs Outdated Show resolved Hide resolved
@rukai rukai force-pushed the parse_vector_type branch 3 times, most recently from 9e4d642 to ff6ce03 Compare June 25, 2024 23:19
scylla/src/transport/session_test.rs Outdated Show resolved Hide resolved
Comment on lines 189 to 191
/// as per <https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html>
/// vectors are limited to a size of 8192
size: u16,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I can see that this is the size limit of a Vector, but what interests me is how Cassandra represents Vector in the system tables. It surely does not use unsigned int, because AFAIK unsigned numbers are not supported by neither ScyllaDB nor Cassandra.
The idea of PreCqlType in topology.rs is to represent the raw type definition read from system tables. Only later can we convert it to more convenient representation, involving unsigned ints.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it makes more sense to store size in the same type as Cassandra does.

Copy link
Contributor Author

@rukai rukai Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@rukai rukai Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the code to use an i32 to match the above findings.
I also renamed the size to dimensions as that is a more accurate name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this comment because we talked about that with @wprzytula
I'm not aware of any case were size of vector is stored in some table (as a cql type). IIUC we always parse type names from strings - so the argument about system tables doesn't apply.
Is it possible to create vector with negative length? I'd be very surprised if it was possible.
If it is not we can and should use unsigned integer for the size (and simplify the parsing function thanks to that).

Should it be u16? I'm not sure. If documentation says that size is limited to 8192 then yes, but I can't find this information is the link in the comment. Could you link the source?

Copy link
Contributor Author

@rukai rukai Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit of 2^13 = 8192 is taken from https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html

However it can be observed that the documentation is incorrect.

From functionally testing, the highest value accepted when creating a table is 2**31 -1 = 2147483647.
The reason would be that its the highest value representable by a signed 32 bit integer
From functional testing it can also be seen that negative numbers and 0 are rejected.

Do you want me to swap the type from an i32 to a u32?

@rukai rukai mentioned this pull request Jun 26, 2024
8 tasks
@rukai rukai force-pushed the parse_vector_type branch 2 times, most recently from a0906fd to 4be5e19 Compare June 26, 2024 22:30
/// * The first character is not a digit or '-'
/// * The the integer is larger than i32
pub(crate) fn parse_i32(self) -> ParseResult<(i32, Self)> {
let (digits, p) = self.take_while(|c| c.is_ascii_digit() || c == '-');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a minor flaw of accepting strings such as 00-21-37, but such strings will be rejected below anyway, so no problem here.

scylla/src/utils/parse.rs Outdated Show resolved Hide resolved
scylla/src/transport/session_test.rs Show resolved Hide resolved
wprzytula
wprzytula previously approved these changes Jun 27, 2024
@wprzytula wprzytula self-assigned this Jun 27, 2024
@wprzytula wprzytula requested a review from Lorak-mmk June 27, 2024 07:46
@Lorak-mmk Lorak-mmk self-assigned this Jul 9, 2024
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! It looks good, I left a few minor comments (and responded to a past discussion about a type of vector size).

I believe that the PR title and description are a bit imprecise.

The PR does not add CREATE or INSERT support for Vector type. CREATE is and always was possible, so was INSERT with hardcoded values.

The PR enables the driver to fetch and represent metadata containing Vector type, which is required to fully connect to cluster (because if the driver can't fetch metadata then it also won't fetch peer list).

Comment on lines 2891 to 3272
async fn test_vector_type() {
setup_tracing();
let session = create_new_session_builder().build().await.unwrap();
let ks = unique_keyspace_name();

session.query(format!("CREATE KEYSPACE IF NOT EXISTS {} WITH REPLICATION = {{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}}", ks), &[]).await.unwrap();
session
.query(
format!(
"CREATE TABLE IF NOT EXISTS {}.t (a int PRIMARY KEY, b vector<int, 4>, c vector<text, 2>)",
ks
),
&[],
)
.await
.unwrap();

session
.query(
format!(
"INSERT INTO {}.t (a, b, c) VALUES (1, [1, 2, 3, 4], ['foo', 'bar'])",
ks
),
&[],
)
.await
.unwrap();

let prepared_statement = session
.prepare(format!(
"INSERT INTO {}.t (a, b, c) VALUES (?, [11, 12, 13, 14], ['afoo', 'abar'])",
ks
))
.await
.unwrap();
session.execute(&prepared_statement, &(2,)).await.unwrap();
let metadata = session.get_cluster_data();
let columns = &metadata.keyspaces[&ks].tables["t"].columns;
assert_eq!(
columns["b"].type_,
CqlType::Vector {
type_: Box::new(CqlType::Native(NativeType::Int)),
dimensions: 4,
},
);
assert_eq!(
columns["c"].type_,
CqlType::Vector {
type_: Box::new(CqlType::Native(NativeType::Text)),
dimensions: 2,
},
);

// TODO: Implement and test SELECT statements and bind values (`?`)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should test 1 thing, this tests 3:

  • unprepared insert query with vector
  • prepared insert query with vector
  • cluster metadata containing vector type

It should be split into 3 separate tests.
Also I don't think this is the best place for this test, I don't see tests for other types in this file. @wprzytula where do you think those tests should be placed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have split up the tests as requested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think this is the best place for this test, I don't see tests for other types in this file. @wprzytula where do you think those tests should be placed?

I think cql_types_test.rs suits this case best.

Comment on lines 189 to 191
/// as per <https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html>
/// vectors are limited to a size of 8192
size: u16,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this comment because we talked about that with @wprzytula
I'm not aware of any case were size of vector is stored in some table (as a cql type). IIUC we always parse type names from strings - so the argument about system tables doesn't apply.
Is it possible to create vector with negative length? I'd be very surprised if it was possible.
If it is not we can and should use unsigned integer for the size (and simplify the parsing function thanks to that).

Should it be u16? I'm not sure. If documentation says that size is limited to 8192 then yes, but I can't find this information is the link in the comment. Could you link the source?

Comment on lines 43 to 58

# TODO: delete the below and move RUSTFLAGS into the above test run when cassandra 5.0.0 releases.
- name: Setup 3-node Cassandra cluster
run: |
docker compose -f test/cluster/cassandra5/docker-compose.yml up -d --wait
- name: Run tests on cassandra 5 beta
run: |
CDC='disabled' RUSTFLAGS="--cfg cassandra_tests" RUST_LOG=trace SCYLLA_URI=172.42.0.2:9042 SCYLLA_URI2=172.42.0.3:9042 SCYLLA_URI3=172.42.0.4:9042 cargo test --verbose --features "full-serialization" -- --skip test_views_in_schema_info --skip test_large_batch_statements
- name: Stop the cluster
if: ${{ always() }}
run: docker compose -f test/cluster/cassandra5/docker-compose.yml stop
- name: Print the cluster logs
if: ${{ always() }}
run: docker compose -f test/cluster/cassandra5/docker-compose.yml logs
- name: Remove cluster
run: docker compose -f test/cluster/cassandra5/docker-compose.yml down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be done sequentially - first we perform tests on normal cassandra, then on beta.
Cassandra tests are already longest part of our CI. Maybe it would be better to create a separate workflow for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved it into a separate workflow as requested.

@rukai
Copy link
Contributor Author

rukai commented Jul 15, 2024

I believe that the PR title and description are a bit imprecise.

The PR does not add CREATE or INSERT support for Vector type. CREATE is and always was possible, so was INSERT with hardcoded values.

The PR enables the driver to fetch and represent metadata containing Vector type, which is required to fully connect to cluster (because if the driver can't fetch metadata then it also won't fetch peer list).

I'm not quite following what you want.
To save both of us the trouble, can you please provide the exact commit message that you want.

@rukai
Copy link
Contributor Author

rukai commented Jul 28, 2024

Hi, any news on this one?

@wprzytula
Copy link
Collaborator

Hi, any news on this one?

Sorry for the delay, both @Lorak-mmk and I have been on vacation. Now let @Lorak-mmk take a fresh look.

@wprzytula wprzytula requested a review from Lorak-mmk July 29, 2024 08:37
@Lorak-mmk Lorak-mmk removed their assignment Jul 31, 2024
@wprzytula wprzytula removed their assignment Aug 1, 2024
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks much better now. I am however reluctant to merge it before Cassandra version with Vector support is officially released. Mostly because:

  • implementation could change before release
  • cql specification seems incomplete w.r.t vectors. They are mentioned in section 5.2 of protocol v5 docs, so we know their serialized form. I don't however see anything about them in section 4.2.5.2 which describes prepared metadata.

Regarding the above: I totally forgot that prepared metadata is a place where we parse the type not from string but from binary form, sorry about that :(
This is also a reason why I favor a bit larger PRs - this would come up when trying to add support for selecting vectors using a prepared statement.

@wprzytula because spec is incomplete, and type used for size depends on it, I think we should wait until C* releases the feature. wdyt?

@wprzytula
Copy link
Collaborator

@wprzytula because spec is incomplete, and type used for size depends on it, I think we should wait until C* releases the feature. wdyt?

Agreed.
@rukai, please consider this PR useful and generally accepted work, yet delayed. OK?

@rukai
Copy link
Contributor Author

rukai commented Oct 14, 2024

cassandra 5 is released, any interest in landing this?

@rukai
Copy link
Contributor Author

rukai commented Oct 14, 2024

oh right, theres some TODO's in here for when cassandra 5 releases. I'll address those

@rukai
Copy link
Contributor Author

rukai commented Oct 14, 2024

The TODO's have been addressed:
the cassandra 5 specific workflow and docker-compose.yaml have been removed, instead we just use the regular cassandra workflow (which is automatically version 5 now since it does not specify a version)

@rukai
Copy link
Contributor Author

rukai commented Oct 14, 2024

oh right, a recent rust release has disallowed that kind of feature check, I'll look into alternatives

@rukai rukai force-pushed the parse_vector_type branch 3 times, most recently from c22e512 to 49e68e7 Compare October 14, 2024 23:29
@rukai
Copy link
Contributor Author

rukai commented Oct 15, 2024

I applied the same fix as #1049 and CI is green now

wprzytula
wprzytula previously approved these changes Oct 30, 2024
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor comment - after you fix it we can merge I think, and fix potential issue when introducing ser/deser

scylla/src/utils/parse.rs Outdated Show resolved Hide resolved
makes progress towards: scylladb#1014

The vector type is introduced by the currently in beta cassandra 5.
See: https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html
Scylla does not support vector types and so the tests are setup to only
compile/run with a new cassandra_tests config.

This commit does not add support for retrieving the data via a SELECT.
That was omitted to reduce scope and will be implemented in follow up
work.
@Lorak-mmk Lorak-mmk merged commit 12ceb62 into scylladb:main Nov 19, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants