-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat]Copy-free save and load for cuckoo hashtable #243
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). For more information, open the CLA check for this pull request. |
3fdfd57
to
1d4fe07
Compare
@@ -304,6 +304,18 @@ class CuckooHashTableOfTensors final : public LookupInterface { | |||
return table_->export_values(ctx, value_dim); | |||
} | |||
|
|||
Status Save(OpKernelContext* ctx, const string filepath, | |||
const size_t buffer_size) { | |||
int64 value_dim = value_shape_.dim_size(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensorflow::int64 is returned from dim_size
and also used in table_->save()
|
||
Status Load(OpKernelContext* ctx, const string filepath, | ||
const size_t buffer_size) { | ||
int64 value_dim = value_shape_.dim_size(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t
@@ -304,6 +304,18 @@ class CuckooHashTableOfTensors final : public LookupInterface { | |||
return table_->export_values(ctx, value_dim); | |||
} | |||
|
|||
Status Save(OpKernelContext* ctx, const string filepath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SaveToFile might be better for possible extending in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept.
Status Save(OpKernelContext* ctx, const string filepath, | ||
const size_t buffer_size) { | ||
int64 value_dim = value_shape_.dim_size(0); | ||
return table_->save(ctx, value_dim, filepath, buffer_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SaveToFile might be better for possible extending in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept
return table_->save(ctx, value_dim, filepath, buffer_size); | ||
} | ||
|
||
Status Load(OpKernelContext* ctx, const string filepath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadFromFile might be better for possible extending in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept
Status Load(OpKernelContext* ctx, const string filepath, | ||
const size_t buffer_size) { | ||
int64 value_dim = value_shape_.dim_size(0); | ||
return table_->load(ctx, value_dim, filepath, buffer_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadFromFile
@@ -49,6 +49,84 @@ struct ValueArray : public ValueArrayBase<V> { | |||
template <class T> | |||
using ValueType = ValueArrayBase<T>; | |||
|
|||
template <typename T> | |||
class HostFileBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like repeated code with HostFileBuffer
in lookup_table_op_cpu.h, if yes, recommending you move them to tensorflow_recommenders_addons/dynamic_embedding/core/utils/host_file_buffer.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept
tensorflow_recommenders_addons/dynamic_embedding/core/ops/cuckoo_hashtable_ops.cc
Show resolved
Hide resolved
@@ -338,6 +338,53 @@ def export(self, name=None): | |||
self.resource_handle, self._key_dtype, self._value_dtype) | |||
return keys, values | |||
|
|||
def save(self, filepath, buffer_size=4194304, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save_to_file
would be better
Becausesave_to_hdfs
is possible in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept
value_dtype=self._value_dtype, | ||
buffer_size=buffer_size) | ||
|
||
def load(self, filepath, buffer_size=4194304, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_from_file or load_from_localfile
This feature is very useful. Looking forward to it. May I ask how to use it? In our case, we will use estimator and save checkpoints per epoch. So should we customize a saver to save the tables manually? But if so, how can we get all the tables to save? Currently it only support usage like |
size_t new_max_size = max_size_; | ||
size_t capacity = table_->get_capacity(); | ||
|
||
size_t cur_size = table_->get_size(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from our profiling result, this kernel is expensive. I suggest that we can cache last size, and check if the table need to expand with expecting keys to be added. If not, we could add expect to last size, and just return. In this way, we could reduce the number of calls to get_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cached size makes latency between record to real-time size. Save and load are usually low-frequency operations, which may make large gap between recorded and real-time size.
|
||
~HostFileBuffer() { Close(); } | ||
|
||
void Put(const T value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask if T is tstring
, will it work? I remember that sometimes the key is tstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't used string key type. I think it can be solved by abstracting the host buffer class and template specialization.
size_t key_buffer_size = buffer_size; | ||
string key_tmpfile = filepath + ".keys.tmp"; | ||
string key_file = filepath + ".keys"; | ||
auto key_buffer = HostFileBuffer<K>(ctx, key_tmpfile, key_buffer_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to abstract the file out of the HostFileBuffer
class? We are heavily depending on hdfs, and would like to add hdfs support based on your work. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept
size_t total_keys = 0; | ||
size_t total_values = 0; | ||
while (nkeys > 0) { | ||
nkeys = key_buffer.Fill(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that nkeys is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nkeys is 0, then it leaves an empty file.
dfc5667
to
cf6c24f
Compare
- Also include the horovod compile fail on macOS(They was caused by the same reason)
…t reference with error address when high concurrency and server disconnection.
@@ -172,8 +172,6 @@ class RedisWrapper<RedisInstance, K, V, | |||
} catch (const std::exception &err) { | |||
LOG(ERROR) << "RedisHandler error in PipeExecRead for slices " | |||
<< hkey.data() << " -- " << err.what(); | |||
error_ptr = std::current_exception(); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This is a good solution, we want to solve this problem too, When can it be merged? |
Description
Brief Description of the PR:
Since dynamic embedding could be super large for memory limit. save and load with traditional TensorFlow checkpoint mechanism will use a lot of memory when saving or loading.
This PR provides a method to save or load files for dynamic embedding tables, without full volume copying.
Type of change
Checklist:
How Has This Been Tested?
Yes