-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return a iterator from Connection::load
#2799
Return a iterator from Connection::load
#2799
Conversation
3ce47f3
to
12fd769
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the PG backend is currently using the entire-result mode of libpq
, while for this to resolve the original issue, we'd have to use row-by-row mode and this is not yet implemented as part of this draft.
@@ -1404,7 +1404,7 @@ pub trait RunQueryDsl<Conn>: Sized { | |||
where | |||
Self: LoadQuery<Conn, U>, | |||
{ | |||
self.internal_load(conn) | |||
self.internal_load(conn)?.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this changes the overall interaction with libpq in row-by-row mode, using load_iter().collect()
may be (esp. depending on the backend) less performant than the current version (while it improves data locality, it may increase the number of allocations/deallocations).
Given this, I feel like we should leave it to each specific backend to choose whether to implement .load()
as .load_iter().collect()
, which unless I'm mistaken is not what is done here.
Once we have benchmarked the row-by-row implementation for PG we can decide what to do for its specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is even for implementations which fetch everything at once we must iterate over all elements as we would like to map every row to a tuple of rust values. Returning an iterator internally may allow future users to decide to collect into some different container than a vector. At least in my own personal projects I build sometimes a HashMap
for random lookup out of the database result. That would be easier if I could just take a Iterator<Item = (Key, Value)>
and collect that into the HashMap directly.
diesel/src/connection/mod.rs
Outdated
where | ||
Self: for<'a> IterableConnection<'a, <Self as Connection>::Backend>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's much point in providing a load_iter
version when the backend does not support partially loading the results: this may reduce the memory usage by no more than ~2.
I'd have imagined:
trait Connection: SimpleConnection + Sized + Send {}
trait IterableConnection: Connection {}
with IterableConnection
only implemented when row-by-row is supported.
That being said, having an additional load_deserialize_iter
with an implementation similar to this one doing just the deserialization on the fly may be nice to have as well eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written above, at least for me that's not only about reducing memory usage, but also about optimizing how often I need to iterate over a result set. See the following example:
Diesel 1.4.x:
users::table.select((users::name, users::id)).load::<(String, i32)>(&conn)?.into_iter().collect::<HashMap<_, _>>();
Diesel with this change:
users::table.select((users::name, users::id)).internal_load::<(String, i32)>(&conn)?.collect::<Result<HashMap<_, _>, _>>();
which is not perfect API wise yet, but removes at least one iteration over the result set. We may want to introduce a more formal API here that is not named internal_load
😉
diesel/src/pg/connection/cursor.rs
Outdated
@@ -24,13 +28,13 @@ impl<'a> ExactSizeIterator for Cursor<'a> { | |||
} | |||
|
|||
impl<'a> Iterator for Cursor<'a> { | |||
type Item = PgRow<'a>; | |||
type Item = crate::QueryResult<PgRow<'a>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't really need Iterator
that returns Row
s: all we ever do with it is call .map(FromSqlRow::build_from_row)
. If we instead make an iterator that both creates the Row<'a>
and calls FromSqlRow
on it as part of a single iterator we don't need to add Rc
s everywhere. (#2707 (reply in thread))
That being said in row-by-row mode there's no need to keep a reference to a global result object, the PGresult
returned by each call to PQgetResult
is enough so we could just store it in Row
. (doc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't really need Iterator that returns Rows: all we ever do with it is call .map(FromSqlRow::build_from_row). If we instead make an iterator that both creates the Row<'a> and calls FromSqlRow on it as part of a single iterator we don't need to add Rcs everywhere.
For diesel itself it's true that we never do anything else than calling FromSqlRow::build_from_row
onto the returned row. The idea behind this change is that it's now possible to just skip diesels deserialization layer, in such a way that it is possible to interact with the raw result set. This is useful for cases where you don't know the schema at compile time and want to build for example select clauses at runtime. In the end this just tries to separate different diesel layers more strictly, so that in the future someone can decide to use diesels Connection
but to skip the query builder entirely.
That all written, I'm not sure if that's even worth it and if it will work with all backends.
diesel/src/pg/connection/result.rs
Outdated
internal_result: RawResult, | ||
column_count: usize, | ||
row_count: usize, | ||
column_name_map: OnceCell<Vec<Option<String>>>, | ||
_marker: PhantomData<&'a super::PgConnection>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This marker seems to indicate that the connection should not be dropped or reused before the PgResult
is dropped, but I haven't seen a reference of this in the Postgres doc. Did I miss it and the previous impl was not super safe because that was missing? Otherwise, why is this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just there to be really safe that nothing goes wrong 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to prevent from e.g. zipping two results and deserializing on the fly no?
Wouldn't adding a few tests to be sure it doesn't break cover for being really sure they don't depend on the connection being dropped/no other query being made be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually documented that it's not needed: https://www.postgresql.org/docs/13/libpq-exec.html#LIBPQ-PQCLEAR
After having looked at the row-by-row mode I'm not sure if I really want to use that as part of the libpq based connection. In general it seems to be a shortcoming of libpq that there is no way to fetch single rows/blocks of rows while reusing existing buffers like this is possible in other client implementations. We may want to keep the fetch all solution for |
0d39aa3
to
51a677a
Compare
I've updated all implementation now. Let's see how the benchmarks look. A more in depth review would be welcome.
|
f9100c7
to
28c1f3c
Compare
@diesel-rs/reviewers This is now ready for a finial review. Benchmark results: The largest regression seems to be for Postgres
Sqlite:
Mysql:
|
* Add tests for backend specific iterator behavior * Optimize the sqlite implementation * Fix the mysql implementation to work correctly for cases where someone stores the row till after the next call to Iterator::next * Add a `RunQueryDsl::load_iter` method
* Remove unneeded life times from `PgResult` and `MysqlRow` * Add tests to show that it is possible to have more than one iterator for `PgConnection`, but not for `SqliteConnection` and `MysqlConnection` * Fix various broken life time bounds for unsafe code on `PgResult` to be bound to the result itself * Copy `MysqlType` by value instead of using a reference
Our normal deserialization workflow does not need the type OID to work, so we can change the code in such a way that the OID is only received from the result if the users actually requests it. This should speedup various benchmarks for `PgConnection` measurably.
e74d742
to
2dedaa4
Compare
This makes it much more understandable what's the type of the return type, while having a minimal performance cost. I've opted into applying this change only for this method and not for `LoadDsl::internal_load` and `Connection::load` as both are not really user facing. (I would count them as advanced methods most users never see and never use directly)
This PR provides a prototypical implementation of
Connection::load
sothat a iterator of
Row
's is returned. This is currently onlyimplemented for the postgres backend to checkout the performance impact
of this change.
@diesel-rs/reviewers Feel free to have a first look at this and comment on the general approach. I'm well aware that there are many missing pieces like documentation, implementation for sqlite and mysql, cleaning up some artifacts from refactoring, …