Recognizing and Segmenting Objects in the Presence of Occlusion and Clutter
One of the fundamental problems of computer vision is to detect and localize objects such as humans and faces in images. Object detection is a building block for a wide range of applications including self-driving cars, robotics and face recognition. Though significant progress has been achieved in these tasks, it is still challenging to obtain robust results in unconstrained images. Real world scenes usually contain more than one object and it is very likely that some parts of an object are occluded by other objects in the scene. To tackle occlusion, image features generated by occlusion should be explicitly modeled rather than treated as noise. In this thesis, a deformable part model for detection and keypoint localization is introduced that explicitly models part occlusion. The proposed model structure makes it possible to augment positive training data with large numbers of synthetically occluded instances. This allows us to easily incorporate the statistics of occlusion patterns in a discriminatively trained model. To exploit bottom-up cues such as occluding contours and image segments, we extend the proposed model to utilize bottom-up class-specific segmentation in order to jointly detect and segment out the foreground pixels belonging to the object. In these approaches, a detector for a single object category is trained which operates independently of other detections in the scene. An appealing alternative approach for detection in cluttered images is to move from single object detection to whole-image parsing. The presence of occlusion can then be “explained away” by the presence of an occluding object. We model multi-object detection by classifying each pixel of the image (semantic segmentation) using Convolutional Neural Network. CNN architectures have terrific recognition performance but rely on spatial pooling which makes it difficult to adapt them to tasks that require dense, pixel-accurate labeling. We demonstrate that while the apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature representation contains significant sub-pixel localization information. We describe a multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps. We demonstrate that this approach yields state-of-the-art semantic segmentation results without resorting to more complex random-field inference or instance detection driven architectures.
Text ReferenceGolnaz Ghiasi. Recognizing and Segmenting Objects in the Presence of Occlusion and Clutter. PhD thesis, University of California, Irvine, 11 2016.
author = "Ghiasi, Golnaz",
title = "Recognizing and Segmenting Objects in the Presence of Occlusion and Clutter",
booktitle = "PhD Thesis",
school = "University of California, Irvine",
year = "2016",
month = "11"