Using Rename and Replace in Python To Clean Image Data

Over the years, I've made more silly mistakes than I can count when it comes to organizing my tabular data. At this point, I probably manage tabular data with my eyes closed.  However, this is my first time really working with image data and I made a bunch of mistakes. That's what we're going to dive into in this article.  Things like saving the annotations in the wrong file format when they were supposed to be .xml. 

via GIPHY

For the person out there that has never played with image data for object detection, you have both the image file (I had jpeg) and an annotation file for each image that describes the bounding box of the object in the image and what the particular image is.  I was using .xml for the notation files, another popular format is JSON.

I not only needed to change the file format to .xml, I had also made a couple other mistakes that I needed to fix in the annotation files. I thought that these little code snippets would be valuable to someone, I can't be the only person to make these mistakes. Hopefully this helps someone out there. The little scripts we're going to look at in this article are:

  • Changing the file path of the annotations

  • Changing "bus" to "school bus" in annotations, because I hadn't been consistent when labeling

  • Changing the names of images

  • Changing the file type to .xml

All of these above bullets basically make this a mini tutorial on for loops, replace, and rename in python.

Before we get started, if you’ve tried Coursera or other MOOCs to learn python and you’re still looking for the course that’ll take you much further, like working in VS Code, setting up your environment, and learning through realistic projects.. this is the course I used: Python Course.

The problem I'm solving

I just want to quickly provide some context, so it makes more sense why I'm talking about school buses.  These data issues I'm fixing were through working on a computer vision project to detect the school bus driving by my home.  My family is lucky because the school bus has to drive past my house and turn around before picking up my daughter at the end of the driveway.  We've set up a text alert when the bus drives by, and it's the perfect amount of time to put our shoes on and head out the door. I came up with this project as a way to try out the CometML software that handles experiment tracking.

Changing the file path and making object names consistent

This first little snippet of code is going to loop through and change all of the annotation files.  It'll open an annotation file, read that file, use the replace function twice to fix two different mistakes I made in the .xml file (storing it in the content variable),  and then we're writing those changes to the file.

Since the level of this blog article is supposed to be very friendly, once you import "os", you can use os.getcwd() to get your current working directory.  My file path is navigating to my data from the current working directory, this is called a relative file path.  If you were to start your file path with "C:\Users\Kristen", etc.  that's an "absolute" file path.  An absolute file path will absolutely work here, however, relative file paths are nice because if the folder with your project ever moves an absolute path absolutely won't work anymore.

import os

###  Where the files live
files = os.listdir("datafolder/train/annotation/")

#print(files)

for file in files:
### In order to make multiple replacements, I save the output of Python's string replace() method in the same variable multiple times
    contents = open("datafolder/train/annotation/" + file,'r').read()
###  I was updating the annotation files to contain the correct file path with .xml inside the file, but you could replace anything
    contents = contents.replace("[content you need replaced]", 
    "[new content]")
### I had labeled some images with "bus" and some with "school bus", making them all consistent
    contents = contents.replace("school bus", "bus")
    write_file = open("datafolder/train/annotation/" + file,'w')
    write_file.write(contents)
    write_file.close()

Changing the names of images

Next up, I had created image datasets from video multiple times. I had written an article about how I created my image dataset from video in R here.  I'll also be writing an article about how I did it in python, once I have that written I'll be sure to link it in this article.

If you're brand new to computer vision, the easiest way I found to have targeted image data for my model was to create the dataset myself.  I did this by just taking a video of the bus driving by my house, then using a script to take frames from that video and turn it in to images.

Each time I created a new set of image data (I had converted multiple videos), the names of the images started with "image_00001".  As I'm sure you know, this meant that I couldn't put all of my images in a single folder.  Similar problem, still using the replace function, but we were using the function to change information inside a file before, now we're changing the name of the file itself.  Let's dive in to changing the names of the image files.

Again, this is a simple for loop, and I'm just looping through each image in the directory, creating a new file name for the image using "replace", then renaming the whole file path plus image name so that our image is in the directory with the appropriate name.  Rename is an operation on the file: "Change the file name from x to y".  Replace is a string operation: "Replace any occurence of 'foo' in string X with 'bar'.

import os

dir = "datafolder/train/images/"

for file in os.listdir(dir):
    new_file = file.replace("image_0","bus")
###  now put the file path together with the name of the new image and rename the file to the new name.
    os.rename(dir + file, dir + new_file)

Changing the file type to .xml

My biggest faux pas was spending the time to manually label the images but saving them in the wrong file type.  The documentation for labelImg was quite clear that they needed to be in Pascal VOC format, which is an .xml file.  Since I had manually labeled the data, I wasn't looking to re-do any of my work there.  Again, we're looping through each photo and using the rename function to rename the image

import os

path = 'datafolder/train/annotation/'
i = 0
for filename in os.listdir(path):
    os.rename(os.path.join(path,filename), os.path.join(path,'captured'+str(i)+'.xml'))
    i = i +1


Summary

If you're playing with computer vision, I highly suggest checking out the comet_ml library.  With just a couple lines of code it'll store a snapshot of your dependencies, code and anything else you need for your model runs to be reproducible.  This is an absolute life saver when you later run into a bug and you're not sure if it's a problem with your dependencies, etc. You'll also get a bunch of metrics and graphics to help you assess your model and compare it to other model runs right out of the box.

Hopefully you feel as though you're more comfortable using the rename and replace functions in python.  I thought it was super fun to demonstrate using image data.  There are so many more pieces to the computer vision project I'm working on and I can't wait to share them all.


Previous
Previous

Concatenating and Splitting Strings in R

Next
Next

An Analysis of The Loss Functions in Keras