Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create helper function #26225

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c1eb39d
create helper function
smeet07 Apr 11, 2023
333349e
formatting
smeet07 Apr 11, 2023
a6805e4
formatting again
smeet07 Apr 11, 2023
5886181
re assigning it back to the feature
smeet07 Apr 12, 2023
3d0ded9
trailing whitespace
smeet07 Apr 12, 2023
b22e3a7
whitespace changes
smeet07 Apr 12, 2023
bbe227a
changes
smeet07 Apr 12, 2023
b28d624
replacing whitespace by tabs
smeet07 Apr 12, 2023
5b04990
Update criteo.py
smeet07 Apr 12, 2023
a4dddc9
sparse_to_dense syntax has been changed in tf2.0
smeet07 Apr 14, 2023
3595e06
write unit test to ensure fill_in_missing
smeet07 May 1, 2023
af4789a
indentation
smeet07 May 1, 2023
1989c5c
Create criteo_test.py
smeet07 May 11, 2023
db855bf
Update criteo.py
smeet07 May 11, 2023
b54a881
add license
smeet07 May 11, 2023
b6217e9
import statements
smeet07 May 11, 2023
6bcdded
skip unit test
smeet07 May 17, 2023
c2f0f38
Update criteo_test.py
smeet07 May 17, 2023
92eee6f
Update criteo_test.py
smeet07 May 17, 2023
b445beb
Update criteo_test.py
smeet07 May 18, 2023
015b12e
Update criteo_test.py
smeet07 May 18, 2023
a0a8d27
skipif syntax changes
smeet07 May 18, 2023
c1ca08c
Update criteo_test.py
smeet07 May 18, 2023
5b34941
Update criteo.py
smeet07 May 19, 2023
2be9b63
absolute import
smeet07 May 30, 2023
a31a0a4
whitespace changes
smeet07 Jun 25, 2023
9d3550d
linter changes
smeet07 Jun 25, 2023
01d7dca
indentation
smeet07 Jun 25, 2023
e018e16
linter
smeet07 Jun 25, 2023
223e783
spacing
smeet07 Jun 25, 2023
dff0689
Update criteo.py
smeet07 Jun 25, 2023
084541a
Update criteo_test.py
smeet07 Jun 25, 2023
aaedd64
Update criteo.py
smeet07 Jun 25, 2023
15be29c
Update criteo_test.py
smeet07 Jun 25, 2023
38ba741
Update criteo.py
smeet07 Jun 25, 2023
e5238ec
fix import issue
smeet07 Jul 5, 2023
3496d59
indentation
smeet07 Jul 5, 2023
8ca9a5d
spacing issues
smeet07 Jul 5, 2023
9268b74
lint issues
smeet07 Jul 5, 2023
ff25761
add space
smeet07 Jul 7, 2023
e186f89
add
smeet07 Jul 7, 2023
1af5bf8
fix call
smeet07 Jul 8, 2023
e65c46a
Update criteo_test.py
smeet07 Jul 8, 2023
cf828d5
Update criteo_test.py
smeet07 Jul 9, 2023
e8756c0
Update criteo_test.py
smeet07 Jul 11, 2023
8f9bbe9
Update sdks/python/apache_beam/testing/benchmarks/cloudml/criteo_tft/…
smeet07 Jul 18, 2023
596f424
remove try block
smeet07 Jul 18, 2023
8216b0d
remove double except block
smeet07 Jul 18, 2023
b595fcb
remove imports and type assignment
smeet07 Jul 18, 2023
f5b25b5
remove trailing whitespace
smeet07 Jul 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,11 @@
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import tensorflow_transform as tft
try:
import tensorflow as tf
import tensorflow_transform as tft
except ImportError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this try except here. It is needed only in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding it was causing import error that is why added it , I'll remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should import criteo.py in tests under a try except block since criteo.py has a dependency on tft which might not be installed on all test environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've imported criteo.py in a try block

tf = None


def _get_raw_categorical_column_name(column_idx):
Expand Down Expand Up @@ -110,6 +113,20 @@ def make_input_feature_spec(include_label=True):
return result


def fill_in_missing(feature, default_value=-1):
if tf is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this if condition as well.

feature = tf.sparse.SparseTensor(
indices=feature.indices,
values=feature.values,
dense_shape=[feature.dense_shape[0], 1])
feature = tf.sparse.to_dense(feature, default_value=default_value)
# Reshaping from a batch of vectors of size 1 to a batch of
# scalar and adding a bucketized version.
feature = tf.squeeze(feature, axis=1)

return feature


def make_preprocessing_fn(frequency_threshold):
"""Creates a preprocessing function for criteo.

Expand All @@ -132,15 +149,7 @@ def preprocessing_fn(inputs):
result = {'clicked': inputs['clicked']}
for name in _INTEGER_COLUMN_NAMES:
feature = inputs[name]
# TODO(https://github.com/apache/beam/issues/24902):
# Replace this boilerplate with a helper function.
# This is a SparseTensor because it is optional. Here we fill in a
# default value when it is missing.
feature = tft.sparse_tensor_to_dense_with_shape(
feature, [None, 1], default_value=-1)
# Reshaping from a batch of vectors of size 1 to a batch of scalars and
# adding a bucketized version.
feature = tf.squeeze(feature, axis=1)
feature = fill_in_missing(feature)
result[name] = feature
result[name + '_bucketized'] = tft.bucketize(feature, _NUM_BUCKETS)
for name in _CATEGORICAL_COLUMN_NAMES:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import unittest

import numpy as np
import pytest

from typing import Any, Callable, Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Any, Callable from imports since they are not being used anywhere.

Lint errors would get solved after this.


try:
import tensorflow_transform as tft
import tensorflow as tf
smeet07 marked this conversation as resolved.
Show resolved Hide resolved
from apache_beam.testing.benchmarks.cloudml.criteo_tft.criteo import fill_in_missing
except ImportError:
tft = None
tf = None
fill_in_missing : Optional[Callable[[tf.sparse.SparseTensor, int], tf.Tensor]] = None
smeet07 marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.uses_tft
smeet07 marked this conversation as resolved.
Show resolved Hide resolved
@unittest.skipIf(tft is None or tf is None, 'Missing dependencies. ')
class FillInMissingTest(unittest.TestCase):
def test_fill_in_missing(self):
# Create a rank 2 sparse tensor with missing values
indices = np.array([[0, 0], [0, 2], [1, 1], [2, 0]])
values = np.array([1, 2, 3, 4])
dense_shape = np.array([3, 3])
sparse_tensor = tf.sparse.SparseTensor(indices, values, dense_shape)

# Fill in missing values with -1
filled_tensor = tf.Tensor()
if fill_in_missing is not None:
filled_tensor = fill_in_missing(sparse_tensor, -1)

# Convert to a dense tensor and check the values
expected_output = np.array([1, -1, 2, -1, -1, -1, 4, -1, -1])
actual_output = filled_tensor.numpy()
self.assertEqual(expected_output, actual_output)


if __name__ == '__main__':
unittest.main()