Making workflows reproducible
Neural networks need randomization to effectively train, so this will always be a stochastic (rather than deterministic) process and loss curves will therefore always be different each time. Metrics like accuracy may also differ significantly.
However, there are some measures that can be taken that collectively attempt to ensure consistency in training.
Using TFRecords
One of the motivations for using TFRecords for data is to ensure a consistency in what images get allocated as training ad which get allocated as validation. These images are already randomized, and are not randomized further during training
Use deterministic operations
CuDNN does not guarantee reproducibility in some of its routines on GPUs, so it is not "deterministic". The operating system environment variable "TF_DETERMINISTIC_OPS" can be used to control this behaviour. Setting the environment variable TF_CUDNN_DETERMINISM=1
forces the selection of deterministic cuDNN convolution and max-pooling algorithms. When this is enabled, the algorithm selection procedure itself is also deterministic. However, selecting these deterministic options may reduce performance.
os.environ["TF_DETERMINISTIC_OPS"] = "1"
When we create a batched data set, we can set the option_no_order.experimental_deterministic
to True
option_no_order = tf.data.Options()
option_no_order.experimental_deterministic = True
dataset = tf.data.Dataset.list_files(filenames)
dataset = dataset.with_options(option_no_order)
Use a seed for random number generation
Use a seed value and substantiate np.random.seed
and tf.random.set_seed
with it, which will subsequently apply to any numpy operations that involved random numbers
SEED=42
np.random.seed(SEED)
tf.random.set_seed(SEED)
Where possible, use a larger batch size
Larger batch sizes tend to promote more stable validation loss curves. This is usually only possible with relatively large hardware, because large batches mean larger amounts of GPU memory required.