Converting between YOLO and PASCAL-VOC object recognition formats, and creating a Tensorflow Dataset

August 17, 2020

This blog post walks through the (somewhat cumbersome - I won't lie!) process of converting between YOLO and PASCAL-VOC 'bounding box' annotation data formats for image recognition problems.

The files we create using makesense.ai and downloaded in YOLO format with the .txt extension can be converted to the PASCAL-VOC format with the .xml extension. This blog post shows you how to do that with python. I also show you how to convert to a generic csv format that is also sometimes used. Finally, I show you how to convert your PASCAL-VOC format data into a Tensorflow TFRecord that use Protocol buffers, which are a cross-platform, cross-language library for efficient serialization of structured data.

Resources I used

These Tensorflow instructions for how to add a dataset, as well as some more specific Tensorflow object detection workflows, and finally the Tensorflow Model Garden, which you'll use here. This and this gave some outdated advice that was nevertheless useful, if not used here. This provides more details on TFRecords and their usages.

First, dealing with 'empty' imagery/annotations

We first need to make sure there is a txt file for every image. Any missing .txt files are for images with no annotations (i.e. no people). So, we create an empty txt file with the right name if it is missing.

We only need two libraries for this:

import os, glob

And one for-loop that iterates through each folder of images (test, train, and validation in the example below). If a certain .txt file is missing, it simply creates an empty one

for cond in ['test', 'train','validation']:

    jpg = glob.glob(cond+'/*.jpg')

    for f in jpg:
       file_query = f.replace('jpg','txt').replace(cond, cond+'_labels')
       if os.path.isfile(file_query):
          pass
       else:
          print("Creating %s" % (file_query))
          with open(file_query, 'w') as fp:
             pass

Second, YOLO to PASCAL-VOC format conversion

PASCAL-VOC is a very common object recognition data format, probably more common than the YOLO format. Many example workflows will use either one of these two formats. Here we convert YOLO (.txt) format to PASCAL-VOC (.xml).

Let's set up the problem. Define an IMG_PATH containing jpg images (in the example below, called test), and a corresponding folder containing the associated .txt files (called test_labels below). This is what my file paths look like on my Linux box:

IMG_PATH = "/media/marda/TWOTB/USGS/SOFTWARE/MLMONDAYS/2_ObjRecog/test"

# txt_folder is txt file root that using makesense.ai rectbox
txt_folder = "/media/marda/TWOTB/USGS/SOFTWARE/MLMONDAYS/2_ObjRecog/test_labels"

We define a list of labels. We only have one label, person

fw = os.listdir(IMG_PATH)
# path of save xml file
save_path = '' # keep it blank

labels = ['person']
global label
label = ''

Some utilities:

def csvread(fn):
    with open(fn, 'r') as csvfile:
        list_arr = []
        reader = csv.reader(csvfile, delimiter=' ')
        for row in reader:
            list_arr.append(row)
    return list_arr

def convert_label(txt_file):
    global label
    for i in range(len(labels)):
        if txt_file[0] == str(i):
            label = labels[i]
            return label
    return label

This is the code that extract the info from a YOLO record:

def extract_coor(txt_file, img_width, img_height):
    x_rect_mid = float(txt_file[1])
    y_rect_mid = float(txt_file[2])
    width_rect = float(txt_file[3])
    height_rect = float(txt_file[4])

    x_min_rect = ((2 * x_rect_mid * img_width) - (width_rect * img_width)) / 2
    x_max_rect = ((2 * x_rect_mid * img_width) + (width_rect * img_width)) / 2
    y_min_rect = ((2 * y_rect_mid * img_height) -
                  (height_rect * img_height)) / 2
    y_max_rect = ((2 * y_rect_mid * img_height) +
                  (height_rect * img_height)) / 2

    return x_min_rect, x_max_rect, y_min_rect, y_max_rect

Loop through each file (in fw) and carry out the conversion, writing one xml format file for each txt file you have

for line in fw:
    root = etree.Element("annotation")

    # try debug to check your path
    img_style = IMG_PATH.split('/')[-1]
    img_name = line
    image_info = IMG_PATH + "/" + line
    img_txt_root = txt_folder + "/" + line[:-4]
    # print(img_txt_root)
    txt = ".txt"

    txt_path = img_txt_root + txt
    # print(txt_path)
    txt_file = csvread(txt_path)

    # read the image  information
    img_size = Image.open(image_info).size

    img_width = img_size[0]
    img_height = img_size[1]
    img_depth = Image.open(image_info).layers

    folder = etree.Element("folder")
    folder.text = "%s" % (img_style)

    filename = etree.Element("filename")
    filename.text = "%s" % (img_name)

    path = etree.Element("path")
    path.text = "%s" % (IMG_PATH)

    source = etree.Element("source")

    source_database = etree.SubElement(source, "database")
    source_database.text = "Unknown"

    size = etree.Element("size")
    image_width = etree.SubElement(size, "width")
    image_width.text = "%d" % (img_width)

    image_height = etree.SubElement(size, "height")
    image_height.text = "%d" % (img_height)

    image_depth = etree.SubElement(size, "depth")
    image_depth.text = "%d" % (img_depth)

    segmented = etree.Element("segmented")
    segmented.text = "0"

    root.append(folder)
    root.append(filename)
    root.append(path)
    root.append(source)
    root.append(size)
    root.append(segmented)

    for ii in range(len(txt_file)):
        label = convert_label(txt_file[ii][0])
        x_min_rect, x_max_rect, y_min_rect, y_max_rect = extract_coor(
            txt_file[ii], img_width, img_height)

        object = etree.Element("object")
        name = etree.SubElement(object, "name")
        name.text = "%s" % (label)

        pose = etree.SubElement(object, "pose")
        pose.text = "Unspecified"

        truncated = etree.SubElement(object, "truncated")
        truncated.text = "0"

        difficult = etree.SubElement(object, "difficult")
        difficult.text = "0"

        bndbox = etree.SubElement(object, "bndbox")
        xmin = etree.SubElement(bndbox, "xmin")
        xmin.text = "%d" % (x_min_rect)
        ymin = etree.SubElement(bndbox, "ymin")
        ymin.text = "%d" % (y_min_rect)
        xmax = etree.SubElement(bndbox, "xmax")
        xmax.text = "%d" % (x_max_rect)
        ymax = etree.SubElement(bndbox, "ymax")
        ymax.text = "%d" % (y_max_rect)

        root.append(object)

    file_output = etree.tostring(root, pretty_print=True, encoding='UTF-8')
    # print(file_output.decode('utf-8'))
    ff = open('%s.xml' % (img_name[:-4]), 'w', encoding="utf-8")
    ff.write(file_output.decode('utf-8'))

Third, create a TF-RECORD from the PASCAL-VOC data

The preferred way to carry out this procedure seems to change regularly, so it can be tricky to find out this information. Let's start by creating a new conda environment for this task, callled tf_test_py36, containing a specific version of python (my current go-to at the time of writing is 3.6 rather than the stable 3.7, because of dependency issues that can sometimes arise on Windows OS)

conda create --name tf_test_py36 python=3.6 tensorflow lxml contextlib2

and activate:

conda activate tf_test_py36

Getting Tensorflow Garden set up

Clone the Tensorflow Garden GitHub repository:

git clone https://github.com/tensorflow/models.git

Add the top-level /models folder to your system Python path.

export PYTHONPATH=$PYTHONPATH:/media/marda/TWOTB/USGS/SOFTWARE/models

Install other dependencies:

pip install --user -r official/requirements.txt
cd research
protoc object_detection/protos/*.proto --python_out=.

Create the tf-record

(Your current directory should be models/research)

This workflow is specific to the POB (People on Beaches) dataset that only has one label, so first, create a file called object_detection/data/pob_label_map.pbtxt and copy the following into it:

item {
  id: 1
  name: 'person'
}

Second, create a new file called POB_images, and copy all your jpg files and corresponding xml files into it - all together

Third, create a new file object_detection/dataset_tools/create_pob_tf_record.py and copy the following code into it. The variable num_shards is the number of pieces you'd like to create. It matters not for this dataset; we use 10.

from glob import glob
import hashlib, io, os, logging, random, re, contextlib2
from lxml import etree
import numpy as np
import PIL.Image
import tensorflow.compat.v1 as tf

from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util

flags = tf.app.flags
flags.DEFINE_string('data_dir', '', 'Root directory to raw pet dataset.')
flags.DEFINE_string('output_dir', '', 'Path to directory to output TFRecords.')
flags.DEFINE_string('label_map_path', 'data/pet_label_map.pbtxt',
                    'Path to label map proto')
flags.DEFINE_integer('num_shards', 10, 'Number of TFRecord shards')

FLAGS = flags.FLAGS

The following is the main function that gets called to carry out the conversion. It creates a single tf.Example message (or protobuf), which is a flexible message type that represents a {"string": value} mapping

def dict_to_tf_example(data,
                       label_map_dict,
                       image_subdirectory,
                       ignore_difficult_instances=False):
  """Convert XML derived dict to tf.Example proto.

  Notice that this function normalizes the bounding box coordinates provided
  by the raw data.

  Args:
    data: dict holding PASCAL XML fields for a single image (obtained by
      running dataset_util.recursive_parse_xml_to_dict)
    label_map_dict: A map from string label names to integers ids.
    image_subdirectory: String specifying subdirectory within the
      Pascal dataset directory holding the actual image data.
    ignore_difficult_instances: Whether to skip difficult instances in the
      dataset  (default: False).

  Returns:
    example: The converted tf.Example.

  Raises:
    ValueError: if the image pointed to by data['filename'] is not a valid JPEG
  """
  img_path = os.path.join(image_subdirectory, data['filename'])
  with tf.gfile.GFile(img_path, 'rb') as fid:
    encoded_jpg = fid.read()
  encoded_jpg_io = io.BytesIO(encoded_jpg)
  image = PIL.Image.open(encoded_jpg_io)
  if image.format != 'JPEG':
    raise ValueError('Image format not JPEG')
  key = hashlib.sha256(encoded_jpg).hexdigest()

  width = int(data['size']['width'])
  height = int(data['size']['height'])

  xmins = []
  ymins = []
  xmaxs = []
  ymaxs = []
  classes = []
  classes_text = []
  truncated = []
  poses = []
  difficult_obj = []
  if 'object' in data:
    for obj in data['object']:
      difficult = bool(int(obj['difficult']))
      if ignore_difficult_instances and difficult:
        continue
      difficult_obj.append(int(difficult))

      xmin = float(obj['bndbox']['xmin'])
      xmax = float(obj['bndbox']['xmax'])
      ymin = float(obj['bndbox']['ymin'])
      ymax = float(obj['bndbox']['ymax'])

      xmins.append(xmin / width)
      ymins.append(ymin / height)
      xmaxs.append(xmax / width)
      ymaxs.append(ymax / height)
      class_name = 'person' #get_class_name_from_filename(data['filename'])
      classes_text.append(class_name.encode('utf8'))
      classes.append(label_map_dict[class_name])
      truncated.append(int(obj['truncated']))
      poses.append(obj['pose'].encode('utf8'))

  feature_dict = {
      'image/height': dataset_util.int64_feature(height),
      'image/width': dataset_util.int64_feature(width),
      'image/filename': dataset_util.bytes_feature(
          data['filename'].encode('utf8')),
      'image/source_id': dataset_util.bytes_feature(
          data['filename'].encode('utf8')),
      'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
      'image/encoded': dataset_util.bytes_feature(encoded_jpg),
      'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
      'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
      'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
      'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
      'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
      'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
      'image/object/class/label': dataset_util.int64_list_feature(classes),
      'image/object/difficult': dataset_util.int64_list_feature(difficult_obj),
      'image/object/truncated': dataset_util.int64_list_feature(truncated),
      'image/object/view': dataset_util.bytes_list_feature(poses),
  }

  example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
  return example

This portion does the file writing (i.e. creates the .tfrecord files from the collection of tf.Example records):

def create_tf_record(output_filename,
                     num_shards,
                     label_map_dict,
                     annotations_dir,
                     image_dir,
                     examples):
  """Creates a TFRecord file from examples.

  Args:
    output_filename: Path to where output file is saved.
    num_shards: Number of shards for output file.
    label_map_dict: The label map dictionary.
    annotations_dir: Directory where annotation files are stored.
    image_dir: Directory where image files are stored.
    examples: Examples to parse and save to tf record.
  """
  with contextlib2.ExitStack() as tf_record_close_stack:
    output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
        tf_record_close_stack, output_filename, num_shards)
    for idx, example in enumerate(examples):
      if idx % 100 == 0:
        logging.info('On image %d of %d', idx, len(examples))
      xml_path = os.path.join(annotations_dir, 'xmls', example.split('.jpg')[0] + '.xml')

      if not os.path.exists(xml_path):
        logging.warning('Could not find %s, ignoring example.', xml_path)
        continue
      with tf.gfile.GFile(xml_path, 'r') as fid:
        xml_str = fid.read()
      xml = etree.fromstring(xml_str)
      data = dataset_util.recursive_parse_xml_to_dict(xml)['annotation']

      try:
        tf_example = dict_to_tf_example(
            data,
            label_map_dict,
            image_dir)
        if tf_example:
          shard_idx = idx % num_shards
          output_tfrecords[shard_idx].write(tf_example.SerializeToString())
      except ValueError:
        logging.warning('Invalid example: %s, ignoring.', xml_path)

The main function reads all the jpg files in POB_images, as well as all the xml files in the same directory. Note that this could be set up differently, to read the xml files from a separate annotations_dir. The files are randomly shuffled. The number of training examples is 70% of the total, and the remaining 30% of files are used for validation. It then calls the create_tf_record function to create the .tfrecord set of 10 files.

def main(_):
  data_dir = FLAGS.data_dir
  label_map_dict = label_map_util.get_label_map_dict(FLAGS.label_map_path)

  logging.info('Reading from POB dataset.')
  image_dir = annotations_dir = os.path.join(data_dir, 'POB_images')

  examples_list = glob(image_dir+'/*.jpg')

  # Test images are not included in the downloaded data set, so we shall perform
  # our own split.
  random.seed(42)
  random.shuffle(examples_list)
  num_examples = len(examples_list)
  num_train = int(0.7 * num_examples)
  train_examples = examples_list[:num_train]
  val_examples = examples_list[num_train:]
  logging.info('%d training and %d validation examples.',
               len(train_examples), len(val_examples))

  train_output_path = os.path.join(FLAGS.output_dir, 'pob_train.record')
  val_output_path = os.path.join(FLAGS.output_dir, 'pob_val.record')

  # call to create the training files
  create_tf_record(
      train_output_path,
      FLAGS.num_shards,
      label_map_dict,
      annotations_dir,
      image_dir,
      train_examples)

# call again to make the validation set
  create_tf_record(
      val_output_path,
      FLAGS.num_shards,
      label_map_dict,
      annotations_dir,
      image_dir,
      val_examples)

if __name__ == '__main__':
  tf.app.run()

And finally, run the script and make your tf-record format data ...

python object_detection/dataset_tools/create_pob_tf_record.py --label_map_path=object_detection/data/POB_label_map.pbtxt  --data_dir=`pwd` --output_dir=`pwd`

This will create 10 files for the training data, and 10 for the validation set. You can now use these files to efficiently train a model using Tensoflow/Keras.

[OPTIONAL] XML to CSV

Sometimes you also see people use object annotations in csv format. Luckily we can use the xml library to help carry out the data parsing, and pandas to easily convert to a dataframe, and then to a formatted csv file

import os, glob
import pandas as pd
import xml.etree.ElementTree as ET

def xml_to_csv(path):
    xml_list = []
    for xml_file in glob.glob(path + '/*.xml'):
        tree = ET.parse(xml_file)
        root = tree.getroot()
        for member in root.findall('object'):
            value = (root.find('filename').text,
                     int(root.find('size')[0].text),
                     int(root.find('size')[1].text),
                     member[0].text,
                     int(member[4][0].text),
                     int(member[4][1].text),
                     int(member[4][2].text),
                     int(member[4][3].text)
                     )
            xml_list.append(value)
    column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df

Then use like this:

image_path = os.path.join(os.getcwd(),'test_labels_xml')
xml_df = xml_to_csv(image_path)
xml_df.to_csv('test_labels.csv', index=None)