| | | | | | | | | | | | | | | | | | | |
. | | Name | Affiliation | Title | Authors | Abstract | | | | | | | | | | | | | |
. | 10:15-10:30 | Arrival and coffee | | | | | | | | | | | | | | | | | |
. | 10:30-10:55 | Piotr Dollar | Caltech | Evaluation of State-of-the-Art Pedestrian Detection | P. Dollár, C. Wojek, B. Schiele and P. Perona | Pedestrian
detection is a key problem in computer vision, with several
applications including robotics, surveillance and automotive safety. We
introduce a new, more realistic dataset two orders of magnitude larger
than existing datasets. The dataset contains richly annotated video,
recorded from a moving vehicle, with challenging images of low
resolution, frequently occluded people. We propose improved evaluation
metrics, demonstrating that commonly used per-window measures are
flawed and can fail to predict system performance on full images. We
also benchmark several promising detection systems, providing an
overview of state-of-the-art performance and a direct, unbiased
comparison of existing methods. Finally, by analyzing common failure
cases, we help identify future research directions for the field. | | | | | | | | | | | | | |
. | 11:00-11:25 | Jan Prokaj | USC | 3-D Model Based Vehicle Recognition
| Jan Prokaj and Gerard Medioni | We
present a method for recognizing a vehicle’s make and model in a video
clip taken from an arbitrary viewpoint. This is an improvement over
existing methods which require a front view. In addition, we present a
Bayesian approach for establishing accurate correspondences in multiple
view geometry. We take a model-based, top-down approach to classify
vehicles. First, the vehicle pose is estimated in every frame by
calculating its 3-D motion on a plane using a structure from motion
algorithm. Then, exemplars from a database of 3-D models are rotated to
the same pose as the vehicle in the video, and projected to the image.
Features in the model images and the vehicle image are matched, and a
model matching score is computed. The model with the best score is
identified as the model of the vehicle in the video. Results on real
video sequences are presented.
| | | | | | | | | | | | | |
. | 11:30-11:55 | Hamed Pirsiavash | UCI | Bilinear classifiers for visual recognition | Hamed Pirsiavash, Deva Ramanan, Charless Fowlkes | We
describe an algorithm for learning bilinear SVMs. Bilinear classifiers
are a discriminative variant of bilinear models, which capture the
dependence of data on multiple factors. Such models are particularly
appropriate for visual data that is better represented as a matrix or
tensor, rather than a vector. Matrix encodings allow for more natural
regularization through rank restriction. For example, a rank-one
scanning-window classifier yields a separable filter. Low-rank models
have fewer parameters and so are easier to regularize and faster to
score at run-time. We learn low-rank models with bilinear classifiers.
We also use bilinear classifiers for transfer learning by sharing
linear factors between different classification tasks. Bilinear
classifiers are trained with biconvex programs. Such programs are
optimized with coordinate descent, where each coordinate step requires
solving a convex program - in our case, we use a standard off-the-shelf
SVM solver. We demonstrate bilinear SVMs on difficult problems of
people detection in video sequences and action classification of video
sequences, achieving state-of-the-art results in both. | | | | | | | | | | | | | |
. | 12:00-12:25 | Larry Matthies | JPL |
Real-time pedestrian detection and tracking for mobile robots | | Safe operation of mobile robots around people is a paramount concern, which has led DoD sponsors of mobile robot research to shift the focus of robot perception research from terrain-understanding for obstacle detection to classifying which potential obstacles are people. Unlike much research on pedestrian detection, which uses monocular imagery, in this application the availability of 3-D sensors is a given, so it makes most sense to incorporate 3-D perception in the detection and tracking process. I will present an update on our work in this area, which uses real-time stereo vision to create a “polar-perspective” map, segments candidate blobs from this map, applies a classifier to image-based and 3-D features of these blobs, and tracks map blobs over time to suppress false alarms. I will also summarize recently-started extensions to this work to detect and track cars as well and to estimate the head pose of detected pedestrians as an aid to robot path planning. | | | | | | | | | | | | | |
. | 12:30-1:30 | Lunch on 6th floor balcony | | | | | | | | | | | | | | | | | |
. | 1:30-1:55 | Oscar Beijbom | UCSD | Single image focus level assessment using SVM | Oscar Beijbom | Differential white blood cell count is the process of counting and classifying white blood cells in blood smears. It is one of the most common clinical tests which is performed in order to make diagnoses in conjunction with medical examinations.
These tests indicate deceases such as infections, allergies, and blood
cancer and approximately 200-300 million are done yearly around the
world. Cellavision AB has developed machines that automate this work and is the global leader in this market. The method developed in this thesis will replace and improve the auto focus routine in these machines. It makes it possible to capture a focused image in only two steps instead of using an iterative multi step algorithm like those used today in most auto focus systems, including the one currently used at Cellavision. In the proposed method a Support Vector Machine, SVM, is trained to assess quantitatively, from a singel image, the level of defocus as well as the direction of defocus for that image. The SVM is trained on features that measure both the image contrast and the image content. High precision is made possible through extracting features from the different parts of the image as well as from the image as a whole. This requires the image to be segmented and a method for doing this is proposed. Using this method 99.5% of the images in the test data’s distances to focus were classified less or equal to 5µm wrong while over 85% were classified completely correctly. A 5µm defocus is borderline to what the human eye perceives as defocused.
| | | | | | | | | | | | | |
. | 2:00-2:25 | Chaitanya Desai | UCI | Discriminative models for multi-class object layout | Chaitanya Desai, Deva Ramanan, Charless Fowlkes
| Many
state-of-the-art approaches for object recognition reduce the problem
to a 0-1 classification task. Such reductions allow one to leverage
sophisticated classifiers for learning. These models are typically
trained independently for each class using positive and negative
examples cropped from images. At test-time, various post-processing
heuristics such as non-maxima suppression (NMS) are required to
reconcile multiple detections within and between different classes for
each image. Though crucial to good performance on benchmarks, this
post-processing is usually defined heuristically. We introduce a unified
model for multi-class object recognition that casts the problem as a
structured prediction task. Rather than predicting a binary label for
each image window independently, our model simultaneously predicts a
structured labeling of the entire image. Our model learns statistics
that capture the spatial arrangements of various object classes in real
images, both in terms of which arrangements to suppress through NMS and
which arrangements to favor through spatial co-occurrence statistics.
We formulate parameter estimation in our model as a max-margin learning
problem. Given training images with ground-truth object locations, we
show how to formulate learning as a convex optimization problem. We
employ a cutting plane algorithm to efficiently learn a model from
thousands of training images. We show state-of-the-art results on the
PASCAL VOC benchmark that indicate the benefits of learning a global
model encapsulating the spatial layout of multiple object classes. | | | | | | | | | | | | | |
. | 2:30-2:55 | Baback Moghaddam | JPL | Low-Level Vision for Planetary Change Detection | |
I will present a prototype automatic vision system for planetary image change
detection. Applications include finding new craters and "gullies" on Mars from
current orbiting platforms, data-mining multi-mission legacy image databases
for undiscovered geologic phenomena, and automating the search for "lost"
spacecraft. | | | | | | | | | | | | | |
. | 3:00-3:25 | Hartmut Neven | Google | Which capabilities are missing when trying to design a comprehensive visual search engine? | | Computer vision has made significant advances during the last decade. Many capabilities such as the detection of faces or the recognition of rigid textured objects such as landmarks are now working to very satisfying levels. Across the various products and services offered by Google we are interested in analyzing an image crawled on the web in all its aspects. When designing such a comprehensive system it becomes obvious however that important abilities are still lacking. One example is object class recognition that scales to thousands or even millions of classes. Another area where we are still facing obstacles is the reliable recognition of objects that have little surface texture and which are largely contour defined. Even a seemingly simple task such as reading text in a photo is still lacking the accuracy we need. The talk describes our efforts in designing a large scale image recognition system that can analyze any given image on the web with respect to many dimensions. We report on the recognition disciplines in which we made good progress but more importantly call out areas which still require additional work to reach production ready solutions.
| | | | | | | | | | | | | |
. | 3:30-4:00 | Coffee Break | | | | | | | | | | | | | | | | | |
. | 4:00-4:25 | Luis Goncalves | Evolution Robotics Retail | Computer Vision for Retail Fraud | TBA | A brief description of the work done at Evolution Robotics Retail, using Computer Vision to prevent retail fraud. | | | | | | | | | | | | | |
. | 4:30-4:55 | Ricky Sethi | UCR | Activity Recognition Using a Physics-Based Data Driven Hamiltonian Monte Carlo | Ricky J. Sethi Amit K. Roy-Chowdhury Brian E. Moore | Motion
and image analysis are both important for activity recognition in
video. We present a new approach that extends the Hamiltonian Monte
Carlo (HMC) to allow us to simultaneously search over the combined
motion and image space in a concerted manner using well-known Markov
Chain Monte Carlo (MCMC) techniques. For motion analysis in video, we
use tracks generated from the video to calculate the Hamiltonian
equations of motion for the systems under study, thus utilizing
analytical Hamiltonian dynamics to derive a physically significant HMC
algorithm which can be used for activity analysis. We then use image
analysis to help explore both the motion energy space and the image
space by integrating the Hamiltonian energy-based approach with an
image-based data-driven proposal to drive the HMC, thereby yielding a
Data Driven HMC (DDHMC). We reduce the enormity of the search space by
driving the Hamiltonian dynamics-based MCMC with image data in this
DDHMC. We also develop the reverse algorithm, which uses motion energy
proposals to search the image space. Experimental validation of the
theory is provided on the well-known USF Gait and Weizmann datasets.
While HMC has been used in other contexts, this is possibly the first
paper that shows how it can be used for activity recognition in video
taking into account the image analysis results and using the physical
motion information of the system. In addition, the DDHMC framework has
potential application to other domains where statistical sampling
techniques are useful, as we outline in the section on future work. | | | | | | | | | | | | | |
. | 5:00-5:25 | Anup Doshi | UCSD | Vision-based Driver Attention and Behavior Inference | Anup Doshi and Mohan M. Trivedi | We
introduce a new approach to analyzing the attentive and behavioral
state of a human subject, given cameras focused on the subject and
their environment. In particular, the task of analyzing the focus of
attention of a human driver is of primary concern. Up to 80\% of
crashes are related to driver inattention; thus it is important for an
Intelligent Driver Assistance System (IDAS) to be aware of the driver
state. We present a new Bayesian paradigm for estimating human
attention specifically addressing the problems arising in live driving
situations. We will then discuss several novel findings about driver
behavior and how those can affect the design of a vision-based HCI
interface for driver assistance. | | | | | | | | | | | | | |
. | 5:30-5:55 | Pietro Perona | Caltech | Towards the visual analysis of animal behavior | TBA | TBA | | | | | | | | | | | | | |
. | 6:00-8:00 | Dinner | | | | | | | | | | | | | | | | | |
. | |