Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

Keras-Preprocessing Redesign #10

Merged
merged 11 commits into from
Sep 23, 2019
356 changes: 356 additions & 0 deletions rfcs/20190729-keras-preprocessing-redesign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,356 @@
# Keras Preprocessing API

| Status | Proposed |
:-------------- |:---------------------------------------------------- |
| **Author(s)** | Francois Chollet ([email protected]), Frederic Branchaud-Charron ([email protected])|
| **Updated** | 2019-08-21 |


## Context

`tf.data.Dataset` is the main API for data loading and preprocessing in TensorFLow. It has two advantages:

- It supports GPU prefetching
- It supports distribution via the Distribution Strategies API

Meanwhile, `keras.preprocessing` is a major API for data loading and preprocessing in Keras. It is based
on Numpy and Scipy, and it produces instances of the `keras.utils.Sequence` class, which are finite-length,
resettable Python generators that yield batches of data.

Some features of `keras.preprocessing` are highly useful and don't have straightforward equivalents in `tf.data`
(in particular image data augmentation and dynamic time series iteration).

Ideally, the utilities in `keras.preprocessing` should be made compatible with `tf.data`.
This presents the opportunity to improve on the existing API. In particular we don't have good support
for image segmentation use cases today.

Some features are also being supplanted by [preprocessing layers](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md), in particular text processing.
As a result we may want move the current API to an API similar to Layers.


## Goals

- Unify "keras.preprocessing" and the recently-introduced [Preprocessing Layers API](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md).
- Make all features of `keras.preprocessing` compatible with `tf.data`.
- As a by-product, add required ops to TensorFlow (`tf.image`).


## Proposed changes at a high-level


- Deprecate `ImagePipelineGenerator` in favor of new `ImagePipeline` class similar to a `Sequential` model.
- Inherits from `keras.layers.PreprocessingLayer` for all image transformations.
- Deprecate `Tokenizer` class in favor of `TextVectorization` preprocessing layer.
- Replace `TimeseriesGenerator` with a function-based API.


## Detailed API changes


### ImagePipeline

#### Constructor

`ImagePipeline` inherits from `PreprocessingLayer` (or alternatively `keras.model.Sequential`, whose behavior is similar) and takes a list of layers as inputs. In the future it will inherit from `PreprocessingStage`.

`ImagePipeline` is a preprocessing layer that encapsulate a series of image transformations. Since some of these transformations may be trained (featurewise normalization), it exposes the method `adapt`, like all other preprocessing layers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a preproc layer, right? A preprocessing stage? Do layers implement adapt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the same question, i think @fchollet is referring to preprocessing stage here.

Layers do not implement adapt, this was a concept introduced in the preprocessing layers design and is similar to fit. adapt is the API used to train preprocessing layers. The name fit was not reused since the data used for training is different between this API and fit. Also i think adapt was originally called update in the processing layers design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a preprocessing stage (so by extension it is a preprocessing layer, since preprocessing stage will subclass PreprocessingLayer).

I describe it as a preprocessing layer specifically because it is likely that PreprocessingStage will not yet exist when we ship the initial version of this API, hence ImagePipeline would subclass PreprocessingLayer in its first iteration.

The method name adapt was the consensus result of the preprocessing layer API design review meeting (not great, but we have to settle on something).



```python

class ImagePipeline(Sequential):

def __init__(self, layers:List[Layer]):
...
```

#### Example usage

```python
preprocessor = ImagePipeline([
RandomFlip(horizontal=True),
RandomRotation(0.2, fill_mode='constant'),
RandomZoom(0.2, fill_mode='constant'),
RandomTranslation(0.2, fill_mode='constant'),
Normalization(), # This is the same Normalization introduced in preprocessing layers
])
preprocessor.adapt(sample_data) # optional step in case the object needs to be trained

dataset = preprocessor.from_directory(dir_name, image_size=(512, 512))
model.fit(dataset, epochs=10)
```

#### Methods

```python
def from_directory(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify-- these are now methods of an instantiated ImagePipeline, not standalones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are instance methods, not standalone functions.

self,
directory,
targets='inferred',
target_mode='categorical',
class_names='inferred',
color_mode='rgb',
batch_size=32,
image_size=(255, 255),
shuffle=True,
seed=None,
follow_links=False,
validation_split=None,
subset='training',
subset=None):
"""Generates a Dataset from files in a directory.

# Arguments:
directory: Directory where the data is located.
If `targets` is "inferred", it should contain
subdirectories, each containing images for a class.
Otherwise, the directory structure is ignored.
targets: Either
"inferred" (targets are generated from the directory structure),
None (no targets),
or a list of integer labels of the same size as the number of image
files found in the directory.
target_mode:
- 'categorical' means that the inferred labels are
encoded as a categorical vector (e.g. for categorical_crossentropy).
- 'binary' means that the inferred labels (there can be only 2)
are encoded as binary scalars (e.g. for binary_crossentropy).
class_names: Only valid if "targets" is "inferred". This is the explict
list of class names (must match names of subdirectories). Used
to control the order of the classes (otherwise alphanumerical order is used).
color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb".
Whether the images will be converted to
have 1, 3, or 4 channels.
batch_size: Size of the batches of data (default: 32).
image_size: Size to resize images to after they are read from disk.
Since the pipeline processes batches of images that must all have the same size,
this must be provided.
shuffle: Whether to shuffle the data (default: True)
If set to False, sorts the data in alphanumeric order.
seed: Optional random seed for shuffling and transformations.
follow_links: Whether to follow links inside
subdirectories (default: False).
validation_split: Optional float between 0 and 1,
fraction of data to reserve for validation.
subset: One of "training" or "validation". Only used if `validation_split` is set.
"""

def from_dataframe(
self,
dataframe,
directory=None,
data_column='filename',
target_column='class',
target_mode='categorical',
weight_column=None,
color_mode='rgb',
batch_size=32,
image_size=(255, 255),
shuffle=True,
seed=None,
validation_split=None,
subset=None):
"""Generates a Dataset from a Pandas dataframe.

# Arguments:
dataframe: Pandas dataframe instance.
directory: The directory that image paths refer to.
data_column: Name of column with the paths for the input images.
target_column: Name of column with the class information.
target_mode:
- 'categorical' means that the inferred labels are
encoded as a categorical vector (e.g. for categorical_crossentropy).
- 'binary' means that the inferred labels (there can be only 2)
are encoded as binary scalars (e.g. for binary_crossentropy).
weight_column: Name of column with sample weight information.
color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb".
Whether the images will be converted to
have 1, 3, or 4 channels.
batch_size: Size of the batches of data (default: 32).
image_size: Size to resize images to after they are read from disk.
Since the pipeline processes batches of images that must all have the same size,
this must be provided.
shuffle: Whether to shuffle the data (default: True)
If set to False, sorts the data in alphanumeric order.
seed: Optional random seed for shuffling and transformations.
validation_split: Optional float between 0 and 1,
fraction of data to reserve for validation.
subset: One of "training" or "validation". Only used if `validation_split` is set.
"""

def preview(self, data, save_to_directory=None, save_prefix=None, save_format='png'):
"""Enables users to preview the image augmentation configuration.

# Arguments
data: Image data. Could be strings (a list of image paths), a list of PIL image instances,
a list of arrays, or a list of eager tensors.
save_to_directory: Directory to save transformed images. Mandatory if not in a notebook.
If in a notebook and this is not specified, images are displayed in-line.
save_prefix: String, filename prefix for saved images.
save_format: String, extension for saved images.
"""
```

**Note:** `from_arrays` is not included since it is possible to transform Numpy data simply by calling the `ImagePipeline` object (like a layer).


### Layers

The new data augmentation layers will inherit `keras.layers.Layer` and work in a similar way.

```python
Resizing(height, width) # Resize while distorting aspect ratio
CenterCrop(height, width) # Resize without distorting aspect ratio
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in a RandomCrop? Or just Crop, with center v random parameterized? IIRC, random cropping is part of some imagenet pipelines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding RandomCrop is a good idea. It is technically equivalent to a combination of RandomTranslation and CenterCrop, but it is a useful shortcut.

RandomCrop(height, width, seed=None) # Return a (height, width) crop from a random location
Rescaling(value) # Divide by `value`
RandomFlip(horizontal=False, vertical=False, seed=None)
RandomTranslation(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomRotation(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomZoom(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomBrightness(amplitude=0., seed=None)
RandomContrast(amplitude=0., seed=None)
RandomSaturation(amplitude=0., seed=None)
RandomWidth(amplitude=0., seed=None) # Expand / shrink width while distorting aspect ratio
RandomHeight(amplitude=0., seed=None) # Expand / shrink height while distorting aspect ratio
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: amplitude is a weird parameter on some of these-- eg, what is the amplitude of a rotation, or a width resize? As these are separate layers whose params will diverge over time, does it make sense to use the "right" words here rather than biasing towards the same words?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API consistency is important to reduce cognitive load and minimize surprises / looking up the docs. Is there a better universal word we could use in this context?

Note: this is also the reason why we use the keyword "kernel" throughout Keras even in places where it doesn't exactly apply.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice and easy to use, but I am a little concerned that there's no apparent way to specify these in absolute units: pixels, radians,... etc.

Exposing some way to skip the relative units would make it easier to build piplelines specified in absolute units without spending time questioning things like:

  • What the relative units are for each? How they're interpreted?
  • How does multi-scale training, or images with a different input range affect the pipeline?

If larger than 1, it is rounded to one for the lower boundary (but not the higher boundary).

For random zoom this comes out a little strange. If I want a uniform scale in [1/2, 2] I can set amplitude=[1/2, 1]. But, IIUC, the random part here is linear so 1/3 of images are shrunk, and 2/3 are expanded. A log-scale option for random zoom would be nice to have.

```

The `amplitude` argument may be:
- a positive float: it is understood as "fraction of total" (total is the current width, or height, or 180 degrees in the case `RandomRotation`). E.g. `0.2` results in variations in the [-20%, +20%] range. If larger than 1, it is rounded to one for the lower boundary (but not the higher boundary).
- a tuple of 2 positive floats: understood as a fractional range, e.g. `(0.2, 0.4)` is interpreted as the [-20%, +40%] range. The first float may not be larger than 1.

To do a random center crop that zooms in and discards part of the image, you would do:

```python
preprocessor = ImagePipeline([
RandomZoom([0., 0.2]),
CenterCrop(height, width),
])
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allow injecting custom augmentation? For example, suppose I want to apply Gaussian Blur or channel wise contrast, can I inject that into it directly?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you just need to create your own layer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a single layer can have multiple transformations inside the call which can be defined by other libraries like imguag or albumnetation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Custom layers offer you great flexibility to implement your own transformations. But note that all transformations should be defined using TF ops if you want performance. Otherwise you'd have to step out of graph execution (it's still technically doable, though).



#### Notes

- We are dropping support for ZCA whitening as it is no longer popular in the computer vision community.
- We don't have immediate support for random translations along only one axis.
- We only plan on implementing support for `data_format='channels_last'`. As such this argument does not appear in the API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Does this match the expectations of accelerator users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that NVIDIA is moving towards native support for channels_last, removing the need to convert to channels_first for performance. I omitted the argument for the sake of simplicity, but we can always add support if the need arises.



#### Example implementation

```python
class RandomFlip(PreprocessingLayer):

def __init__(self, horizontal=False, vertical=False, seed=None):
self.horizontal = horizontal
self.vertical = vertical
self.seed = seed or random_int()
self._rng = rng_from_seed(seed)

def call(self, inputs, training=None, seed=None):
seed = seed or self._rng.sample()
if training:
if self.horizontal:
inputs = tf.image.random_flip_left_right(inputs, seed=seed)
if self.vertical:
inputs = tf.image.random_flip_up_down(inputs, seed=seed)
return inputs
```



#### Question: how to support image segmentation in a simple way?

**Requirements:**
- Image loading and image augmentation should be synced across inputs and targets
- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets

**Proposal:**

```python
# Shared spatial transformations for inputs and targets
augmenter = ImagePipeline([
RandomRotation(0.5),
RandomFlip(vertical=True)
])

input_pipeline = ImagePipeline([
augmenter,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to take a lot of care with the random augmentation layers to make sure the syncing works correctly. They would need to be deterministic & 're-settable' in some fashion that's built-in to the API & that the ImagePipeline/from_directory apis would take advantage of.

Copy link

@tomerk tomerk Aug 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to make matters trickier now that I think about it, this would also need to work well w/ parallel processing in the dataset (& possibly w/ distribution strategies active). It may be worth taking a page out of Jax's book as inspiration? https://github.com/google/jax/blob/master/design_notes/prng.md

Or using some of the mechanisms from Peng's RFC for random numbers in tf 2.0:
https://github.com/tensorflow/community/pull/38/files?short_path=b84a5ce#diff-b84a5ce018def5de3e1396b9962feff1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should add a public API to control the seeding behavior, besides the seed argument in the constructor.

Would you have a specific API in mind? (in terms of methods and their signature)

Copy link

@tomerk tomerk Aug 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the safest thing would be to make random augmentation layers only use stateless random ops and support taking a seed argument directly in the call methods. The constructor-time seed argument would then be renamed initial_seed. If no seed argument is provided at call-time, the seed generated from initial_seed would be used.

Otherwise, the layer's initial_seed would be combined with the seed passed in to make sure different layer objects act differently when passed the same seed (while a layer object shared in different models would act the same way in both when passed the same seed).

So, the implementation & API for RandomFlip would look something like:

class RandomFlip(PreprocessingLayer):
   def __init__(self, horizontal=False, vertical=False, initial_seed=None):
    self.horizontal = horizontal
    self.vertical = vertical
    self.initial_seed = initial_seed
    self._initial_seed_or_random =  initial_seed or random_value()
    self._current_seed = self._initial_seed_or_random
   def call(self, inputs, training=None, seed=None):
    if seed is None:
      seed = self._current_seed
      self._current_seed += 1
    else:
      seed = seed + self._initial_seed_or_random

    if training:
      if self.horizontal:
        inputs = tf.image.random_flip_left_right(inputs, seed=seed)
      if self.vertical:
        inputs = tf.image.random_flip_up_down(inputs, seed=seed)
    return inputs

We'll run into similar challenges as with the training argument of layers & models where we have to feed it through nested models & layers to avoid bugs. We can use a similar mechanism in __call__ to solve the problem.

This sort of deterministic randomness could be generally useful for random models & layers in Keras beyond just for random augmentations & preprocessing layers (e.g. for dropout layers)


The from_directory & input methods would continue to take an optional seed argument. They could either provide the dataset tuple index as a seed in the 'call' method of ImagePIpeline, or use some sort of 'tf.function & distribution strategy-friendly' version of tf.random.experimental.Generator to generate the seeds for the layers. We can check with Peng who's been working on it to see what the status there is.

The random layers should also use tf.random.experimental.Generator or something similar to maintain their internal seeds rather than using raw python in the form of self.seed = self.seed + 1.
That way they will work correctly w/ tf.function & distribution strategies.

A note on retracing tf.function: If we want to wrap the call method in a tf.function like we do for saved_models, we will need to take care to represent the seeds passed into call as scalar tensors rather than just python objects. Otherwise the function will get retraced for each seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to these suggestions, especially having seed as a call argument. Note that we don't need to change the name of the constructor argument; seed is fine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the workflow for an image-segmentation pipeline would be:

Create two identical ImagePipelines, with the same seed, and run them on both inputs? I guess the alternatives would be:

  • Make these work on nests of images
  • Or have a .reapply method?

So if someone wanted to use this for object detection then you could build a BBox equivalent of each layer, and write a converter that makes a new Sequential pipeline with all the BBox layers substituted in, using the same seed. Would that be a reasonable approach to using this for object_detection?

RandomBrightness(0.2),
RandomContrast(0.2),
RandomSaturation(0.2),
])
target_pipeline = ImagePipeline([
augmenter,
OneHot(num_classes)
])

input_ds = input_pipeline.from_directory(
input_dir, targets=None, image_size=(150, 150), batch_size=32,
seed=123) # This seed supercedes the per-layer seed in all transformations
target_ds = target_pipeline.from_directory(
target_dir, # target_dir should have same structure as input_dir.
targets=None, image_size=(150, 150), batch_size=32, seed=123)

ds = tf.data.Dataset.zip((input_ds, target_ds))
model.fit(ds)
```

Note that the behavior of having the `seed` argument in `from_directory` supercedes the per-layer argument is achieved by using the seed
to sample new random ints (scalar tensors from `tf.random.experimental.Generator`) to serve as the `call` argument to each underlying layer.


### TimeseriesGenerator

- Deprecate existing `TimeSeriesGenerator` class
- Introduce functional replacement `timeseries_dataset`:

```python
def timeseries_dataset(
data, targets, length,
sampling_rate=1,
stride=1,
start_index=0,
end_index=None,
shuffle=False,
reverse=False,
batch_size=128):
"""Utility function for generating batches of temporal data.

This function takes in a sequence of data-points gathered at
equal intervals, along with time series parameters such as
stride, length of history, etc., to produce batches for
training/validation.

# Arguments
data: Indexable generator (such as list or Numpy array)
containing consecutive data points (timesteps).
The data should be at 2D, and axis 0 is expected
to be the time dimension.
targets: Targets corresponding to timesteps in `data`.
It should have same length as `data`.
length: Length of the output sequences (in number of timesteps).
sampling_rate: Period between successive individual timesteps
within sequences. For rate `r`, timesteps
`data[i]`, `data[i-r]`, ... `data[i - length]`
are used for create a sample sequence.
stride: Period between successive output sequences.
For stride `s`, consecutive output samples would
be centered around `data[i]`, `data[i+s]`, `data[i+2*s]`, etc.
start_index: Data points earlier than `start_index` will not be used
in the output sequences. This is useful to reserve part of the
data for test or validation.
end_index: Data points later than `end_index` will not be used
in the output sequences. This is useful to reserve part of the
data for test or validation.
shuffle: Whether to shuffle output samples,
or instead draw them in chronological order.
reverse: Boolean: if `true`, timesteps in each output sample will be
in reverse chronological order.
batch_size: Number of timeseries samples in each batch
(except maybe the last one).

# Returns
A Dataset instance.
"""
```