Creating a Tensorflow Dataset for an image segmentation task
Use case
You have a folder called imdir
that folder contains tens to millions of jpeg files, and another, lab_path
that contains tens to millions of corresponding label images, also as jpeg files. Label images are 8-bit, and encode labels as unique integer pixel values.
For the code to work, the images and label images need to be jpegs. On linux with the convert
function from imagemagick, convert folder of pngs into jpegs
for file in *.png
do
convert $file $"${file%.png}.jpg"
done
Create a data pre-processing pipeline
Define a directory where you want to save your TFrecord files, called tfrecord_dir
Get a list of the images
in imdir
, get the number of images nb_images
, and compute the size of each shard and the number of shards
images = tf.io.gfile.glob(imdir+os.sep+'*.jpg')
nb_images=len(tf.io.gfile.glob(imdir+os.sep+'*.jpg'))
SHARDS = int(nb_images / ims_per_shard) + (1 if nb_images % ims_per_shard != 0 else 0)
shared_size = int(np.ceil(1.0 * nb_images / SHARDS))s, 'filename')
Next, define a pre-processing workflow
dataset = tf.data.Dataset.list_files(imdir+os.sep+'*.jpg', seed=10000) # This also shuffles the images
dataset = dataset.map(read_image_and_label)
dataset = dataset.map(resize_and_crop_image, num_parallel_calls=AUTO)
dataset = dataset.map(recompress_image, num_parallel_calls=AUTO)
dataset = dataset.batch(shared_size)
Where the following function reads an image from file, and finds the corresponding label image and reads that in too (this example from the OBX dataset)
#-----------------------------------
def read_image_and_label(img_path):
"""
"read_image_and_label(img_path)"
This function reads an image and label and decodes both jpegs
into bytestring arrays.
This works by parsing out the label image filename from its image pair
Thre are different rules for non-augmented versus augmented imagery
INPUTS:
* img_path [tensor string]
OPTIONAL INPUTS: None
GLOBAL INPUTS: None
OUTPUTS:
* image [bytestring]
* label [bytestring]
"""
bits = tf.io.read_file(img_path)
image = tf.image.decode_jpeg(bits)
# have to use this tf.strings.regex_replace utility because img_path is a Tensor object
lab_path = tf.strings.regex_replace(img_path, "images", "labels")
lab_path = tf.strings.regex_replace(lab_path, ".jpg", "_deep_whitewater_shallow_no_water_label.jpg")
lab_path = tf.strings.regex_replace(lab_path, "augimage", "auglabel")
bits = tf.io.read_file(lab_path)
label = tf.image.decode_jpeg(bits)
return image, label
The following function crops and image and label pair to square and resizes to a TARGET_SIZE
#-----------------------------------
def resize_and_crop_image(image, label):
"""
"resize_and_crop_image"
This function crops to square and resizes an image and label
INPUTS:
* image [tensor array]
* label [tensor array]
OPTIONAL INPUTS: None
GLOBAL INPUTS: TARGET_SIZE
OUTPUTS:
* image [tensor array]
* label [tensor array]
"""
w = tf.shape(image)[0]
h = tf.shape(image)[1]
tw = TARGET_SIZE
th = TARGET_SIZE
resize_crit = (w * th) / (h * tw)
image = tf.cond(resize_crit < 1,
lambda: tf.image.resize(image, [w*tw/w, h*tw/w]), # if true
lambda: tf.image.resize(image, [w*th/h, h*th/h]) # if false
)
nw = tf.shape(image)[0]
nh = tf.shape(image)[1]
image = tf.image.crop_to_bounding_box(image, (nw - tw) // 2, (nh - th) // 2, tw, th)
label = tf.cond(resize_crit < 1,
lambda: tf.image.resize(label, [w*tw/w, h*tw/w]), # if true
lambda: tf.image.resize(label, [w*th/h, h*th/h]) # if false
)
label = tf.image.crop_to_bounding_box(label, (nw - tw) // 2, (nh - th) // 2, tw, th)
return image, label
This function takes images and labels as byte strings and recodes as a 8bit jpeg
#-----------------------------------
def recompress_image(image, label):
"""
"recompress_image"
This function takes an image and label encoded as a byte string
and recodes as an 8-bit jpeg
INPUTS:
* image [tensor array]
* label [tensor array]
OPTIONAL INPUTS: None
GLOBAL INPUTS: None
OUTPUTS:
* image [tensor array]
* label [tensor array]
"""
image = tf.cast(image, tf.uint8)
image = tf.image.encode_jpeg(image, optimize_size=True, chroma_downsampling=False)
label = tf.cast(label, tf.uint8)
label = tf.image.encode_jpeg(label, optimize_size=True, chroma_downsampling=False)
return image, label
Write TFrecord shards to disk
Finally, write the dataset to tfrecord format files for 'analysis ready data' that is highly compressed and easy to share
for shard, (image, label) in enumerate(dataset):
shard_size = image.numpy().shape[0]
filename = tfrecord_dir+os.sep+"obx" + "{:02d}-{}.tfrec".format(shard, shard_size)
with tf.io.TFRecordWriter(filename) as out_file:
for i in range(shard_size):
example = to_seg_tfrecord(image.numpy()[i],label.numpy()[i])
out_file.write(example.SerializeToString())
print("Wrote file {} containing {} records".format(filename, shard_size))
Image augmentation
Do you need to augment your imagery (create transformations of the image and label pairs and write them to disk)?
Define image dimensions
NY = 7360
NX = 4912
Define two ImageDataGenerator
instances with the same arguments, one for images and the other for labels (masks)
data_gen_args = dict(featurewise_center=False,
featurewise_std_normalization=False,
rotation_range=5,
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=0.2)
image_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**data_gen_args)
mask_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**data_gen_args)
You likely can't read all your images into memory, so you'll have to do this in batches. The following loop will grab a new random batch of images, apply the augmentation generator to the images to create alternate versions, then writes those alternate versions of images and labels to disk
i=1
for k in range(14):
#set a different seed each time to get a new batch of ten
seed = int(np.random.randint(0,100,size=1))
img_generator = image_datagen.flow_from_directory(
imdir,
target_size=(NX, NY),
batch_size=10,
class_mode=None, seed=seed, shuffle=True)
#the seed must be the same as for the training set to get the same images
mask_generator = mask_datagen.flow_from_directory(
lab_path,
target_size=(NX, NY),
batch_size=10,
class_mode=None, seed=seed, shuffle=True)
#The following merges the two generators (and their flows) together:
train_generator = (pair for pair in zip(img_generator, mask_generator))
#grab a batch of 10 images and label images
x, y = next(train_generator)
# write them to file and increment the counter
for im,lab in zip(x,y):
imwrite(imdir+os.sep+'augimage_000'+str(i)+'.jpg', im)
imwrite(lab_path+os.sep+'auglabel_000'+str(i)+'_deep_whitewater_shallow_no_water_label.jpg', lab)
i += 1
#save memory
del x, y, im, lab
#get a new batch