Introduction:

One of the most important aspects of a self-driving vehicle is to understand the world around us. We deeply believe that it’s possible to use camera and computer vision technology to achieve this, instead of using traditional radar or lidar. Once we can gain that understanding, we can program the vehicle to make decisions.

The algorithm behind this is semantic segmentation with Convolutional Neural Networks. This technology allows us to overlay masks on top of the image to understand where the road is, where the vehicles are, etc. In short, we are classifying every individual pixel of our input image.

Also, our approach is not an end to end one. Unlike the autonomous steering system, where the image input is directly mapped to a steering output. In semantic segmentation, we don’t directly map the visuals from the camera to any decisions.

result3_6
An inference result from the segmentation network

The goal is to use the segmentation mask returned by the network to make some decision about the vehicle’s behavior.

About the Network:

Throughout the years, after the initial proposal of FCN (fully convolutional network), there are many different network architectures for image segmentation. Namely, SegNet and DeepLab. Despite their outstanding performance, those networks are not suited for our application with a very limited amount of processing power.

We are using ENet, a real-time image segmentation network, proposed by researchers at Purdue University and the University of Warsaw, Poland. Our implementation is done in Keras, as usual.

Screen Shot 2018-02-08 at 7.35.11 PM.png
This is directly taken from the ENet paper.

The benefit of ENet is that it can run at ~10fps on the Jetson. This is a huge benefit, especially in an application such as a self-driving cart. The network only has 300k parameters, but it performed very well on training.

Development & Training

We used the CityScapes dataset for training. Specifically, we used their extra_training data with extra_coarse labels. During training, we unsurprisingly realized that more data point will tremendously increase the performance.

However, we hit a couple of bumps during development.

  1. We decided to generate a CSV file with all the image paths in order to train the model. This way, the training generator stays simple.
  2. We also implemented helper methods to turn the group truth (label) image, (shape: width * height * 3), to a (width * height * the number of labels). In another word, each category, ex, human, cars, or road, will have its own layer in the output matrix.
  3. We also took out some of the unnecessary labels. It made training and inference a lot faster.

Figure Legend

Green: plants

Blue: vehicles/obstacles

Purple: Road

Red: human

result2_3.png
Visualization on the CityScapes dataset

As you can see, after only a couple of hours of training, the network performed very well on the validation dataset.

test.png
An example of the training label from the cityscapes dataset. Note, this is fine annotation, not coarse annotation.

Conclusion:

We are eager to implement this system on the vehicle. We will also program some rules for the vehicle to follow, once it understands its surroundings. If you have any questions or comments, please contact me at contact@neilnie.com. Thank you!

10 thoughts on “Understanding the World Through Semantic Segmentation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s