Converting between YOLO and PASCAL-VOC object recognition formats, and creating a Tensorflow Dataset
This blog post walks through the (somewhat cumbersome - I won't lie!) process of converting between YOLO and PASCAL-VOC 'bounding box' annotation data formats for image recognition problems.
The files we create using makesense.ai
and downloaded in YOLO
format with the .txt
extension can be converted to the PASCAL-VOC
format with the .xml
extension. This blog post shows you how to do that with python. I also show you how to convert to a generic csv format that is also sometimes used. Finally, I show you how to convert your PASCAL-VOC
format data into a Tensorflow TFRecord that use Protocol buffers, which are a cross-platform, cross-language library for efficient serialization of structured data.
Resources I used
These Tensorflow instructions for how to add a dataset, as well as some more specific Tensorflow object detection workflows, and finally the Tensorflow Model Garden, which you'll use here. This and this gave some outdated advice that was nevertheless useful, if not used here. This provides more details on TFRecords and their usages.
First, dealing with 'empty' imagery/annotations
We first need to make sure there is a txt file for every image. Any missing .txt files are for images with no annotations (i.e. no people). So, we create an empty txt file with the right name if it is missing.
We only need two libraries for this:
import os, glob
And one for-loop
that iterates through each folder of images (test
, train
, and validation
in the example below). If a certain .txt file is missing, it simply creates an empty one
for cond in ['test', 'train','validation']:
jpg = glob.glob(cond+'/*.jpg')
for f in jpg:
file_query = f.replace('jpg','txt').replace(cond, cond+'_labels')
if os.path.isfile(file_query):
pass
else:
print("Creating %s" % (file_query))
with open(file_query, 'w') as fp:
pass
Second, YOLO to PASCAL-VOC format conversion
PASCAL-VOC is a very common object recognition data format, probably more common than the YOLO format. Many example workflows will use either one of these two formats. Here we convert YOLO (.txt
) format to PASCAL-VOC (.xml
).
Let's set up the problem. Define an IMG_PATH
containing jpg images (in the example below, called test
), and a corresponding folder containing the associated .txt files (called test_labels
below). This is what my file paths look like on my Linux box:
IMG_PATH = "/media/marda/TWOTB/USGS/SOFTWARE/MLMONDAYS/2_ObjRecog/test"
# txt_folder is txt file root that using makesense.ai rectbox
txt_folder = "/media/marda/TWOTB/USGS/SOFTWARE/MLMONDAYS/2_ObjRecog/test_labels"
We define a list of labels. We only have one label, person
fw = os.listdir(IMG_PATH)
# path of save xml file
save_path = '' # keep it blank
labels = ['person']
global label
label = ''
Some utilities:
def csvread(fn):
with open(fn, 'r') as csvfile:
list_arr = []
reader = csv.reader(csvfile, delimiter=' ')
for row in reader:
list_arr.append(row)
return list_arr
def convert_label(txt_file):
global label
for i in range(len(labels)):
if txt_file[0] == str(i):
label = labels[i]
return label
return label
This is the code that extract the info from a YOLO record:
def extract_coor(txt_file, img_width, img_height):
x_rect_mid = float(txt_file[1])
y_rect_mid = float(txt_file[2])
width_rect = float(txt_file[3])
height_rect = float(txt_file[4])
x_min_rect = ((2 * x_rect_mid * img_width) - (width_rect * img_width)) / 2
x_max_rect = ((2 * x_rect_mid * img_width) + (width_rect * img_width)) / 2
y_min_rect = ((2 * y_rect_mid * img_height) -
(height_rect * img_height)) / 2
y_max_rect = ((2 * y_rect_mid * img_height) +
(height_rect * img_height)) / 2
return x_min_rect, x_max_rect, y_min_rect, y_max_rect
Loop through each file (in fw
) and carry out the conversion, writing one xml
format file for each txt
file you have
for line in fw:
root = etree.Element("annotation")
# try debug to check your path
img_style = IMG_PATH.split('/')[-1]
img_name = line
image_info = IMG_PATH + "/" + line
img_txt_root = txt_folder + "/" + line[:-4]
# print(img_txt_root)
txt = ".txt"
txt_path = img_txt_root + txt
# print(txt_path)
txt_file = csvread(txt_path)
# read the image information
img_size = Image.open(image_info).size
img_width = img_size[0]
img_height = img_size[1]
img_depth = Image.open(image_info).layers
folder = etree.Element("folder")
folder.text = "%s" % (img_style)
filename = etree.Element("filename")
filename.text = "%s" % (img_name)
path = etree.Element("path")
path.text = "%s" % (IMG_PATH)
source = etree.Element("source")
source_database = etree.SubElement(source, "database")
source_database.text = "Unknown"
size = etree.Element("size")
image_width = etree.SubElement(size, "width")
image_width.text = "%d" % (img_width)
image_height = etree.SubElement(size, "height")
image_height.text = "%d" % (img_height)
image_depth = etree.SubElement(size, "depth")
image_depth.text = "%d" % (img_depth)
segmented = etree.Element("segmented")
segmented.text = "0"
root.append(folder)
root.append(filename)
root.append(path)
root.append(source)
root.append(size)
root.append(segmented)
for ii in range(len(txt_file)):
label = convert_label(txt_file[ii][0])
x_min_rect, x_max_rect, y_min_rect, y_max_rect = extract_coor(
txt_file[ii], img_width, img_height)
object = etree.Element("object")
name = etree.SubElement(object, "name")
name.text = "%s" % (label)
pose = etree.SubElement(object, "pose")
pose.text = "Unspecified"
truncated = etree.SubElement(object, "truncated")
truncated.text = "0"
difficult = etree.SubElement(object, "difficult")
difficult.text = "0"
bndbox = etree.SubElement(object, "bndbox")
xmin = etree.SubElement(bndbox, "xmin")
xmin.text = "%d" % (x_min_rect)
ymin = etree.SubElement(bndbox, "ymin")
ymin.text = "%d" % (y_min_rect)
xmax = etree.SubElement(bndbox, "xmax")
xmax.text = "%d" % (x_max_rect)
ymax = etree.SubElement(bndbox, "ymax")
ymax.text = "%d" % (y_max_rect)
root.append(object)
file_output = etree.tostring(root, pretty_print=True, encoding='UTF-8')
# print(file_output.decode('utf-8'))
ff = open('%s.xml' % (img_name[:-4]), 'w', encoding="utf-8")
ff.write(file_output.decode('utf-8'))
Third, create a TF-RECORD from the PASCAL-VOC data
The preferred way to carry out this procedure seems to change regularly, so it can be tricky to find out this information. Let's start by creating a new conda
environment for this task, callled tf_test_py36
, containing a specific version of python (my current go-to at the time of writing is 3.6 rather than the stable 3.7, because of dependency issues that can sometimes arise on Windows OS)
conda create --name tf_test_py36 python=3.6 tensorflow lxml contextlib2
and activate:
conda activate tf_test_py36
Getting Tensorflow Garden set up
Clone the Tensorflow Garden GitHub repository:
git clone https://github.com/tensorflow/models.git
Add the top-level /models folder to your system Python path.
export PYTHONPATH=$PYTHONPATH:/media/marda/TWOTB/USGS/SOFTWARE/models
Install other dependencies:
pip install --user -r official/requirements.txt
cd research
protoc object_detection/protos/*.proto --python_out=.
Create the tf-record
(Your current directory should be models/research
)
This workflow is specific to the POB
(People on Beaches) dataset that only has one label, so first, create a file called object_detection/data/pob_label_map.pbtxt
and copy the following into it:
item {
id: 1
name: 'person'
}
Second, create a new file called POB_images
, and copy all your jpg files and corresponding xml files into it - all together
Third, create a new file object_detection/dataset_tools/create_pob_tf_record.py
and copy the following code into it. The variable num_shards
is the number of pieces you'd like to create. It matters not for this dataset; we use 10.
from glob import glob
import hashlib, io, os, logging, random, re, contextlib2
from lxml import etree
import numpy as np
import PIL.Image
import tensorflow.compat.v1 as tf
from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util
flags = tf.app.flags
flags.DEFINE_string('data_dir', '', 'Root directory to raw pet dataset.')
flags.DEFINE_string('output_dir', '', 'Path to directory to output TFRecords.')
flags.DEFINE_string('label_map_path', 'data/pet_label_map.pbtxt',
'Path to label map proto')
flags.DEFINE_integer('num_shards', 10, 'Number of TFRecord shards')
FLAGS = flags.FLAGS
The following is the main function that gets called to carry out the conversion. It creates a single tf.Example
message (or protobuf), which is a flexible message type that represents a {"string": value}
mapping
def dict_to_tf_example(data,
label_map_dict,
image_subdirectory,
ignore_difficult_instances=False):
"""Convert XML derived dict to tf.Example proto.
Notice that this function normalizes the bounding box coordinates provided
by the raw data.
Args:
data: dict holding PASCAL XML fields for a single image (obtained by
running dataset_util.recursive_parse_xml_to_dict)
label_map_dict: A map from string label names to integers ids.
image_subdirectory: String specifying subdirectory within the
Pascal dataset directory holding the actual image data.
ignore_difficult_instances: Whether to skip difficult instances in the
dataset (default: False).
Returns:
example: The converted tf.Example.
Raises:
ValueError: if the image pointed to by data['filename'] is not a valid JPEG
"""
img_path = os.path.join(image_subdirectory, data['filename'])
with tf.gfile.GFile(img_path, 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = PIL.Image.open(encoded_jpg_io)
if image.format != 'JPEG':
raise ValueError('Image format not JPEG')
key = hashlib.sha256(encoded_jpg).hexdigest()
width = int(data['size']['width'])
height = int(data['size']['height'])
xmins = []
ymins = []
xmaxs = []
ymaxs = []
classes = []
classes_text = []
truncated = []
poses = []
difficult_obj = []
if 'object' in data:
for obj in data['object']:
difficult = bool(int(obj['difficult']))
if ignore_difficult_instances and difficult:
continue
difficult_obj.append(int(difficult))
xmin = float(obj['bndbox']['xmin'])
xmax = float(obj['bndbox']['xmax'])
ymin = float(obj['bndbox']['ymin'])
ymax = float(obj['bndbox']['ymax'])
xmins.append(xmin / width)
ymins.append(ymin / height)
xmaxs.append(xmax / width)
ymaxs.append(ymax / height)
class_name = 'person' #get_class_name_from_filename(data['filename'])
classes_text.append(class_name.encode('utf8'))
classes.append(label_map_dict[class_name])
truncated.append(int(obj['truncated']))
poses.append(obj['pose'].encode('utf8'))
feature_dict = {
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(
data['filename'].encode('utf8')),
'image/source_id': dataset_util.bytes_feature(
data['filename'].encode('utf8')),
'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
'image/object/difficult': dataset_util.int64_list_feature(difficult_obj),
'image/object/truncated': dataset_util.int64_list_feature(truncated),
'image/object/view': dataset_util.bytes_list_feature(poses),
}
example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
return example
This portion does the file writing (i.e. creates the .tfrecord
files from the collection of tf.Example
records):
def create_tf_record(output_filename,
num_shards,
label_map_dict,
annotations_dir,
image_dir,
examples):
"""Creates a TFRecord file from examples.
Args:
output_filename: Path to where output file is saved.
num_shards: Number of shards for output file.
label_map_dict: The label map dictionary.
annotations_dir: Directory where annotation files are stored.
image_dir: Directory where image files are stored.
examples: Examples to parse and save to tf record.
"""
with contextlib2.ExitStack() as tf_record_close_stack:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_filename, num_shards)
for idx, example in enumerate(examples):
if idx % 100 == 0:
logging.info('On image %d of %d', idx, len(examples))
xml_path = os.path.join(annotations_dir, 'xmls', example.split('.jpg')[0] + '.xml')
if not os.path.exists(xml_path):
logging.warning('Could not find %s, ignoring example.', xml_path)
continue
with tf.gfile.GFile(xml_path, 'r') as fid:
xml_str = fid.read()
xml = etree.fromstring(xml_str)
data = dataset_util.recursive_parse_xml_to_dict(xml)['annotation']
try:
tf_example = dict_to_tf_example(
data,
label_map_dict,
image_dir)
if tf_example:
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
except ValueError:
logging.warning('Invalid example: %s, ignoring.', xml_path)
The main
function reads all the jpg files in POB_images
, as well as all the xml
files in the same directory. Note that this could be set up differently, to read the xml
files from a separate annotations_dir
. The files are randomly shuffled. The number of training examples is 70% of the total, and the remaining 30% of files are used for validation. It then calls the create_tf_record
function to create the .tfrecord
set of 10 files.
def main(_):
data_dir = FLAGS.data_dir
label_map_dict = label_map_util.get_label_map_dict(FLAGS.label_map_path)
logging.info('Reading from POB dataset.')
image_dir = annotations_dir = os.path.join(data_dir, 'POB_images')
examples_list = glob(image_dir+'/*.jpg')
# Test images are not included in the downloaded data set, so we shall perform
# our own split.
random.seed(42)
random.shuffle(examples_list)
num_examples = len(examples_list)
num_train = int(0.7 * num_examples)
train_examples = examples_list[:num_train]
val_examples = examples_list[num_train:]
logging.info('%d training and %d validation examples.',
len(train_examples), len(val_examples))
train_output_path = os.path.join(FLAGS.output_dir, 'pob_train.record')
val_output_path = os.path.join(FLAGS.output_dir, 'pob_val.record')
# call to create the training files
create_tf_record(
train_output_path,
FLAGS.num_shards,
label_map_dict,
annotations_dir,
image_dir,
train_examples)
# call again to make the validation set
create_tf_record(
val_output_path,
FLAGS.num_shards,
label_map_dict,
annotations_dir,
image_dir,
val_examples)
if __name__ == '__main__':
tf.app.run()
And finally, run the script and make your tf-record format data ...
python object_detection/dataset_tools/create_pob_tf_record.py --label_map_path=object_detection/data/POB_label_map.pbtxt --data_dir=`pwd` --output_dir=`pwd`
This will create 10 files for the training data, and 10 for the validation set. You can now use these files to efficiently train a model using Tensoflow/Keras.
[OPTIONAL] XML to CSV
Sometimes you also see people use object annotations in csv format. Luckily we can use the xml
library to help carry out the data parsing, and pandas
to easily convert to a dataframe, and then to a formatted csv file
import os, glob
import pandas as pd
import xml.etree.ElementTree as ET
def xml_to_csv(path):
xml_list = []
for xml_file in glob.glob(path + '/*.xml'):
tree = ET.parse(xml_file)
root = tree.getroot()
for member in root.findall('object'):
value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(member[4][0].text),
int(member[4][1].text),
int(member[4][2].text),
int(member[4][3].text)
)
xml_list.append(value)
column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
xml_df = pd.DataFrame(xml_list, columns=column_name)
return xml_df
Then use like this:
image_path = os.path.join(os.getcwd(),'test_labels_xml')
xml_df = xml_to_csv(image_path)
xml_df.to_csv('test_labels.csv', index=None)