Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance with large number of compute_and_apply_vocabulary transforms #180

Closed
cyc opened this issue Jun 11, 2020 · 6 comments
Closed
Assignees

Comments

@cyc
Copy link

cyc commented Jun 11, 2020

I have a dataset that has a relatively large number (dozens) of string/int features that need to be vocabularized. Is there any way to do that more efficiently? Right now in my preprocessing_fn I just have a separate tft.compute_and_apply_vocabulary for each feature, but this blows up my dataflow graph size and I suspect that the overall performance is worse due to this.

Ordinarily if I were applying a numeric transform like tft.bucketize or tft.scale_to_z_score I would just concatenate my numeric features together and apply a single analyzer op in an elementwise manner, which is much more efficient. However, for computing vocabularies there seems to be no way to do an optimization like this.

Also, it is worth noting that if I add just 30-50 more of these string/int features that need to have a vocabulary computed, then I believe that I will quickly run into the "job graph size too large" error on dataflow. Is there a way to get around that?

@rmothukuru rmothukuru self-assigned this Jun 12, 2020
@rmothukuru rmothukuru added the type:performance Performance Issue label Jun 12, 2020
@rmothukuru rmothukuru assigned zoyahav and unassigned rmothukuru Jun 12, 2020
@zoyahav
Copy link
Member

zoyahav commented Jun 12, 2020

Unfortunately, you're right, there's no straight forward way to pack vocabulary computes as you would with other TFT analyzers.

Depending on the size of your vocabularies you could join some of them (though with this method your vocabulary range for each feature will not be continuous, and you will need to be careful with frequency_threshold/top_k to make sure specific feature vocabularies don't get completely filtered out).

raw_data = [
      {'A': 'hello', 'B': 'world'},
      {'A': 'world', 'B': 'hello'},
      {'A': '!', 'B': '!'},
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec({
        'A': tf.io.FixedLenFeature([], tf.string),
        'B': tf.io.FixedLenFeature([], tf.string),
    }))

def preprocessing_fn(inputs):
    a = tf.strings.join(['A', inputs['A']])
    b = tf.strings.join(['B', inputs['B']])
    a_b = tf.concat((a, b), axis=-1)
    vocab = tft.vocabulary(a_b, vocab_filename='a_b_vocab')
    return {
        'a_int': tft.apply_vocabulary(a, vocab),
        'b_int': tft.apply_vocabulary(b, vocab),
    }

...
... = tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
...

tft_output = tft.TFTransformOutput(transform_fn_dir)
print(tft_output.vocabulary_by_name('a_b_vocab'))

The vocabulary contents are:
[b'Bworld', b'Bhello', b'B!', b'Aworld', b'Ahello', b'A!']

And the transformed data is:
{'a_int': 4, 'b_int': 0}
{'a_int': 3, 'b_int': 1}
{'a_int': 5, 'b_int': 2}

@cyc
Copy link
Author

cyc commented Jun 13, 2020

@zoyahav, thanks for the tips! I think that could help in certain cases for some vocabulary features.

I do have a dataset currently with 102 string features and definitely have run into the issue with not even being able to start the dataflow job because the graph size is too large. In the short term, I will try to do my best to reduce the number of tft.vocabulary analyzers (e.g. pack them per your suggestion), but in the long term what are the plans for making the analyzers more scalable on dataflow?

@rcrowe-google
Copy link

@cyc, Following up on this issue, Beam 2.24 has been released and should hopefully help with this issue. Could you try it and let us know? Also, for Dataflow, the V2 runner may very likely help. To try it, add --experiments=use_runner_v2

@rcrowe-google
Copy link

I'm following up since the thread went quiet to make sure that this was resolved for you.

  1. Were you able to try Beam >= 2.24, and was it an improvement?
  2. Were you able to try Dataflow Runner V2, together with Dataflow Shuffle, and was it an improvement?

@arghyaganguly
Copy link

Closing this due to inactivity.Please feel free to reopen.Thanks.

@meowcakes
Copy link

I have encountered the same issue, but with Flink. I am also using tft.compute_and_apply_vocabulary on dozens of features, and the Flink DAG produced by TFT is enormous. Unless --execution_mode_for_batch=BATCH_FORCED is used the pipeline just hangs, and even with that option it takes an excessively long time to run. Increasing the number of task managers and the parallelism actually causes it to take longer to finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants