Creating a Tensorflow Dataset for an image segmentation task

October 1, 2020

Use case

You have a folder called imdir that folder contains tens to millions of jpeg files, and another, lab_path that contains tens to millions of corresponding label images, also as jpeg files. Label images are 8-bit, and encode labels as unique integer pixel values.

For the code to work, the images and label images need to be jpegs. On linux with the convert function from imagemagick, convert folder of pngs into jpegs

for file in *.png
do
convert $file $"${file%.png}.jpg"
done

Create a data pre-processing pipeline

Define a directory where you want to save your TFrecord files, called tfrecord_dir

Get a list of the images in imdir, get the number of images nb_images, and compute the size of each shard and the number of shards

images = tf.io.gfile.glob(imdir+os.sep+'*.jpg')

nb_images=len(tf.io.gfile.glob(imdir+os.sep+'*.jpg'))

SHARDS = int(nb_images / ims_per_shard) + (1 if nb_images % ims_per_shard != 0 else 0)

shared_size = int(np.ceil(1.0 * nb_images / SHARDS))s, 'filename')

Next, define a pre-processing workflow

dataset = tf.data.Dataset.list_files(imdir+os.sep+'*.jpg', seed=10000) # This also shuffles the images
dataset = dataset.map(read_image_and_label)
dataset = dataset.map(resize_and_crop_image, num_parallel_calls=AUTO)
dataset = dataset.map(recompress_image, num_parallel_calls=AUTO)
dataset = dataset.batch(shared_size)

Where the following function reads an image from file, and finds the corresponding label image and reads that in too (this example from the OBX dataset)

#-----------------------------------
def read_image_and_label(img_path):
    """
    "read_image_and_label(img_path)"
    This function reads an image and label and decodes both jpegs
    into bytestring arrays.
    This works by parsing out the label image filename from its image pair
    Thre are different rules for non-augmented versus augmented imagery
    INPUTS:
        * img_path [tensor string]
    OPTIONAL INPUTS: None
    GLOBAL INPUTS: None
    OUTPUTS:
        * image [bytestring]
        * label [bytestring]
    """
    bits = tf.io.read_file(img_path)
    image = tf.image.decode_jpeg(bits)

    # have to use this tf.strings.regex_replace utility because img_path is a Tensor object
    lab_path = tf.strings.regex_replace(img_path, "images", "labels")
    lab_path = tf.strings.regex_replace(lab_path, ".jpg", "_deep_whitewater_shallow_no_water_label.jpg")
    lab_path = tf.strings.regex_replace(lab_path, "augimage", "auglabel")
    bits = tf.io.read_file(lab_path)
    label = tf.image.decode_jpeg(bits)

    return image, label

The following function crops and image and label pair to square and resizes to a TARGET_SIZE

#-----------------------------------
def resize_and_crop_image(image, label):
    """
    "resize_and_crop_image"
    This function crops to square and resizes an image and label
    INPUTS:
        * image [tensor array]
        * label [tensor array]
    OPTIONAL INPUTS: None
    GLOBAL INPUTS: TARGET_SIZE
    OUTPUTS:
        * image [tensor array]
        * label [tensor array]
    """
    w = tf.shape(image)[0]
    h = tf.shape(image)[1]
    tw = TARGET_SIZE
    th = TARGET_SIZE
    resize_crit = (w * th) / (h * tw)
    image = tf.cond(resize_crit < 1,
                  lambda: tf.image.resize(image, [w*tw/w, h*tw/w]), # if true
                  lambda: tf.image.resize(image, [w*th/h, h*th/h])  # if false
                 )
    nw = tf.shape(image)[0]
    nh = tf.shape(image)[1]
    image = tf.image.crop_to_bounding_box(image, (nw - tw) // 2, (nh - th) // 2, tw, th)

    label = tf.cond(resize_crit < 1,
                  lambda: tf.image.resize(label, [w*tw/w, h*tw/w]), # if true
                  lambda: tf.image.resize(label, [w*th/h, h*th/h])  # if false
                 )
    label = tf.image.crop_to_bounding_box(label, (nw - tw) // 2, (nh - th) // 2, tw, th)

    return image, label

This function takes images and labels as byte strings and recodes as a 8bit jpeg

#-----------------------------------
def recompress_image(image, label):
    """
    "recompress_image"
    This function takes an image and label encoded as a byte string
    and recodes as an 8-bit jpeg
    INPUTS:
        * image [tensor array]
        * label [tensor array]
    OPTIONAL INPUTS: None
    GLOBAL INPUTS: None
    OUTPUTS:
        * image [tensor array]
        * label [tensor array]
    """
    image = tf.cast(image, tf.uint8)
    image = tf.image.encode_jpeg(image, optimize_size=True, chroma_downsampling=False)

    label = tf.cast(label, tf.uint8)
    label = tf.image.encode_jpeg(label, optimize_size=True, chroma_downsampling=False)
    return image, label

Write TFrecord shards to disk

Finally, write the dataset to tfrecord format files for 'analysis ready data' that is highly compressed and easy to share

for shard, (image, label) in enumerate(dataset):
  shard_size = image.numpy().shape[0]
  filename = tfrecord_dir+os.sep+"obx" + "{:02d}-{}.tfrec".format(shard, shard_size)

  with tf.io.TFRecordWriter(filename) as out_file:
    for i in range(shard_size):
      example = to_seg_tfrecord(image.numpy()[i],label.numpy()[i])
      out_file.write(example.SerializeToString())
    print("Wrote file {} containing {} records".format(filename, shard_size))

Image augmentation

Do you need to augment your imagery (create transformations of the image and label pairs and write them to disk)?

Define image dimensions

NY = 7360
NX = 4912

Define two ImageDataGenerator instances with the same arguments, one for images and the other for labels (masks)

data_gen_args = dict(featurewise_center=False,
                     featurewise_std_normalization=False,
                     rotation_range=5,
                     width_shift_range=0.1,
                     height_shift_range=0.1,
                     zoom_range=0.2)
image_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**data_gen_args)
mask_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**data_gen_args)

You likely can't read all your images into memory, so you'll have to do this in batches. The following loop will grab a new random batch of images, apply the augmentation generator to the images to create alternate versions, then writes those alternate versions of images and labels to disk

i=1
for k in range(14):

    #set a different seed each time to get a new batch of ten
    seed = int(np.random.randint(0,100,size=1))
    img_generator = image_datagen.flow_from_directory(
            imdir,
            target_size=(NX, NY),
            batch_size=10,
            class_mode=None, seed=seed, shuffle=True)

    #the seed must be the same as for the training set to get the same images
    mask_generator = mask_datagen.flow_from_directory(
            lab_path,
            target_size=(NX, NY),
            batch_size=10,
            class_mode=None, seed=seed, shuffle=True)

    #The following merges the two generators (and their flows) together:
    train_generator = (pair for pair in zip(img_generator, mask_generator))

    #grab a batch of 10 images and label images
    x, y = next(train_generator)

    # write them to file and increment the counter
    for im,lab in zip(x,y):
        imwrite(imdir+os.sep+'augimage_000'+str(i)+'.jpg', im)
        imwrite(lab_path+os.sep+'auglabel_000'+str(i)+'_deep_whitewater_shallow_no_water_label.jpg', lab)
        i += 1

    #save memory
    del x, y, im, lab
    #get a new batch