Making workflows reproducible

September 15, 2020

Neural networks need randomization to effectively train, so this will always be a stochastic (rather than deterministic) process and loss curves will therefore always be different each time. Metrics like accuracy may also differ significantly.

However, there are some measures that can be taken that collectively attempt to ensure consistency in training.

Using TFRecords

One of the motivations for using TFRecords for data is to ensure a consistency in what images get allocated as training ad which get allocated as validation. These images are already randomized, and are not randomized further during training

Use deterministic operations

CuDNN does not guarantee reproducibility in some of its routines on GPUs, so it is not "deterministic". The operating system environment variable "TF_DETERMINISTIC_OPS" can be used to control this behaviour. Setting the environment variable TF_CUDNN_DETERMINISM=1 forces the selection of deterministic cuDNN convolution and max-pooling algorithms. When this is enabled, the algorithm selection procedure itself is also deterministic. However, selecting these deterministic options may reduce performance.

os.environ["TF_DETERMINISTIC_OPS"] = "1"

When we create a batched data set, we can set the option_no_order.experimental_deterministic to True

option_no_order = tf.data.Options()
option_no_order.experimental_deterministic = True

dataset = tf.data.Dataset.list_files(filenames)
dataset = dataset.with_options(option_no_order)

Use a seed for random number generation

Use a seed value and substantiate np.random.seed and tf.random.set_seed with it, which will subsequently apply to any numpy operations that involved random numbers

SEED=42
np.random.seed(SEED)
tf.random.set_seed(SEED)

Where possible, use a larger batch size

Larger batch sizes tend to promote more stable validation loss curves. This is usually only possible with relatively large hardware, because large batches mean larger amounts of GPU memory required.