Name

Affiliation

Title

Authors

Abstract.

10:30

Carolina

UCSD

Multi-Class Object Localization by Combining Local Contextual Interactions

Carolina Galleguillos, Brian McFee , Serge Belongie and Gert Lanckriet

Recent work in object localization has shown that the use of contextual cues can greatly improve accuracy over models that use appearance features alone. Although many of these models have successfully explored different types of
contextual sources, they only consider one type of contextual interaction (e.g., pixel, region or object level interactions), leaving open questions about the true potential contribution of context. Furthermore, contributions across object classes and over appearance features still remain unknown.
In this work, we introduce a novel model for multiclass object localization that incorporates different levels
of contextual interactions. We study contextual interactions at pixel, region and object level by using three different
sources of context: semantic, boundary support and contextual neighborhoods. Our framework learns a single similarity metric from multiple kernels, combining pixel and region interactions with appearance features, and then uses a conditional random field to incorporate object level interactions.

We perform experiments on two challenging image databases: MSRC and PASCAL VOC 2007. Experimental
results show that our model outperforms current state-ofthe-art contextual frameworks and reveals individual contributions for each contextual interaction level, as well as the importance of each type of feature in object localization.

10:50

Dennis Park

UC Irvine

Multiresolution models for object detection

Dennis Park, Deva Ramanan, and Charless Fowlkes

Most current approaches to recognition aim to be scale-
invariant. However, the cues available for recognizing a 300 pixel tall
object are qualitatively dierent from those for recognizing a 3 pixel tall
object. We argue that for sensors with nite resolution, one should in-
stead use scale-variant, or multiresolution representations that adapt in
complexity to the size of a putative detection window. We describe a
multiresolution model that acts as a deformable part-based model when
scoring large instances and a rigid template with scoring small instances.
We also examine the interplay of resolution and context, and demon-
strate that context is most helpful for detecting low-resolution instances
when local models are limited in discriminative power. We demonstrate
impressive results on the Caltech Pedestrian benchmark, which contains
object instances at a wide range of scales. Whereas recent state-of-the-
art methods demonstrate missed detection rates of 86%-37% at 1 false-
positive-per-image, our multiresolution model reduces the rate to 29%.

11:10

Ricky Sethi

UCLA and UCR

The Human Action Image and its Application to Motion Recognition

Ricky J. Sethi
Amit K. Roy-Chowdhury

Recognizing a person's motion is intuitive for humans but represents a challenging problem in machine vision. In this paper, we present a multi-disciplinary framework for recognizing human actions. We develop a novel descriptor, the Human Action Image (HAI), a physically-significant, compact representation for the motion of a person, which we derive from Hamilton's Action. We prove the additivity of Hamilton's Action in order to formulate the HAI and then embed the HAI as the Motion Energy Pathway of the Neurobiological model of motion recognition. The Form Pathway is modelled using existing low-level feature descriptors based on shape and appearance. Finally, we propose a Weighted Integration (WI) methodology to combine the two pathways via statistical Hypothesis Testing using the bootstrap to do the final recognition. Experimental validation of the theory is provided on the well-known Weizmann and USF Gait datasets.

11:30

Piotr Dollar

Caltech

The Fastest Pedestrian Detector in the West

Piotr Dollar, Serge Belongie, Pietro Perona

We demonstrate a multiscale pedestrian detector operating in near real time (~6 fps on 640x480 images) with state-of-the-art detection performance. The computational bottleneck of many modern detectors is the construction of an image pyramid, typically sampled at 8-16 scales per octave, and associated feature computations at each scale. We propose a technique to avoid constructing such a finely sampled image pyramid without sacrificing performance: our key insight is that for a broad family of features, including gradient histograms, the feature responses computed at a single scale can be used to approximate feature responses at nearby scales. The approximation is accurate within an entire scale octave. This allows us to decouple the sampling of the image pyramid from the sampling of detection scales. Overall, our approximation yields a speedup of 10-100 times over competing methods with only a minor loss in detection accuracy of about 1-2% on the Caltech Pedestrian dataset across a wide range of evaluation settings. The results are confirmed on three additional datasets (INRIA, ETH, and TUD-Brussels) where our method always scores within a few percent of the state-of-the-art while being 1-2 orders of magnitude faster. The approach is general and should be widely applicable.

11:50

Kris Kitani

UCSD

Learning Action Categories for First Person Vision

Kris Kitani

This work explores the use of ego-motion features obtained from a head-mounted camera to learn ego-centric action categories without supervision. We show that a non-parametric Bayesian mixture model can be used to efficiently learn action categories from an unlabeled continuous video database.

12:10

Lunch Break

1:30

Yoav Schechner

Caltech

Audio-Visual Association: Look at Sparse Events

Zohar Barzilay and Yoav Y. Schechner

In complex scenes, a camcorder having a single microphone captures several audio-associated visual objects (AVOs), i.e., moving visual objects that emit sounds. At the same time, the camera may also view silent objects, and the microphone may sense sounds unrelated to the scene in view. We seek to: spatially localize independent AVOs; isolate the audio component corresponding to each; and ignore the other sources. Achieving this using just a single microphone is challenging. We describe computational approaches to these problems. They are based on correlating events that are sparse in both auditory and visual domains. In particular, audio onsets and instances of visual high spatial acceleration are temporally sparse. However, their coincidence helps establish audio-visual correspondence continuously throughout videos. In addition, visual spatial sparsity of AVOs is a strong cue for localization.

1:50

Luis Goncalves

EvoRetail, MetaModal

Human Vision and Real Intelligence - Seeing with Sound

Luis Goncalves

I will describe the "Seeing-with-Sound" project, where we have built a prototype sensory substitution device that transforms images into sound. With training, blind participants are able to locate, track, walk to, and grab objects. The challenges, obstacles, promise, and opportunities of this neuroplasticity-inspired field will be discussed.

2:10

Alper Ayvaci

UC, Los Angeles

Occlusion Detection and Motion Estimation with Occlusions

Alper Ayvaci, Michalis Raptis, Stefano Soatto

We tackle the problem of simultaneously detecting occlusions and estimating optical flow. We show that, under standard assumptions of Lambertian reflection and static illumination, the task can be posed as a convex optimization problem. Therefore, the solution, computed using efficient algorithms, is guaranteed to be unique and globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance, in addition to motion estimation. We also discuss the shortcomings and limitations of our approach.

2:30

Julian Yarkony

UCI

Planar Cycle Covering Graphs

Julian Yarkony, Charless Fowlkes, Alexander Ihler

We describe a new variational lower-bound on the minimum energy configuration of a planar binary MRF. Our method is based on adding auxiliary nodes to every face of a planar embedding of the graph in order to capture the effect of unary potentials. A ground state of the resulting approximation can be computed effeciently by reduction to minimum-weight perfect matching. We show that optimization of variational parameters achieves as tight a bound as dual-decomposition approaches that use the set of all cycles or all outer-planar subproblems. We demonstrate that our variational optimization converges quickly and provides high-quality solutions to hard combinatorial problems.

2:50

Piotr Slomka

Cedars-Sinai Medical Center

Computer analysis of modern cardiac imaging data

Piotr Slomka

The goal of our research is to automate quantitative analysis of complex cardiac imaging data. The images are obtained by Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Single Photon Emission Computed Tomography (SPECT) modalities. New advances in these technologies allow 3D acquisition of the beating heart with high temporal and spatial resolution. Our work involves segmentation of the heart, detection and analysis of lesions in the coronary arteries and in the myocardium (heart muscle), automatic registration of multimodality data, inter-scan comparison of image changes, and quantification of physiologically important parameters. We have demonstrated that our methods result in more reproducible testing, and in some cases, a more accurate diagnosis than that provided by a visual analysis performed by experienced readers. I will provide a brief overview of the cardiac imaging methods, our latest results, and computing challenges associated with automated analysis of cardiac data.

3:10

Tali Treibitz

UCSD

Resolution Loss without Blur: Recovery Limits in Pointwise Degradation

Tali Treibitz and Yoav Schechner

Pointwise image formation models appear in a variety of photography
and computer vision problems, such as: specular/diffuse reflection,
irradiance falloff from point sources, attenuation and veiling in
haze, direct/indirect illumination, dirty windows, semireflections,
vignetting and more. An expanding array of methods has been devised to
handle such problems. However, what is the recovery limit? Are there
object features that cannot be effectively recovered under the
pointwise degradation, despite the best efforts made by the recovery
method? Can we quantitatively assess the recoverability of an object
of a certain size and contrast? If some objects are not recovered, is
there a point in trying to develop a new method to salvage them, or is
their loss fundamental? We derive bounds to recovery from pointwise
degradation, even if the parameters of an algorithm are perfectly
set. The analysis uses a physical model for the acquired signal and
noise, and also accounts for potential post-acquisition noise
filtering. The analysis yields an effective cutoff-frequency, which is
induced by noise, despite having no optical blur in the imaging model.

The talk is based on works from ICCP'09 and CVPR'09.

3:30

Coffee Break

4:00

Marco Andreetti

California Institute of Technology

Unsupervised Learning of Categorical Segments in Image Collections

Marco Andreetto
Lihi Zelnik-Manor
Pietro Perona

Which one comes first: segmentation or recognition? We propose a unified framework for carrying out the two simultaneously and without supervision. The framework combines a flexible probabilistic model, for representing the shape and appearance of each segment, with the popular ``bag of visual words'' model for recognition.
If applied to a collection of images, our framework can simultaneously discover the segments of each image, and the correspondence between such segments, without supervision.Such recurring segments may be thought of as the `parts' of corresponding
objects that appear multiple times in the image collection. Thus, the model may be used for learning new categories, detecting/classifying objects, and segmenting images, without
using expensive human annotation.

4:20

Mohsen Hejrati

University of California at Irvine

Every Picture Tells a Story: Generating Sentences from Images

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth

Humans can prepare concise descriptions of pictures, focus- ing on what they find important. We demonstrate that automatic meth- ods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning ob- tained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us- ing data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.

4:40

Steven Branson

UCSD

Visual Recognition with Humans in the Loop

Steve Branson , Catherine Wah, Florian Schroﬀ, Boris Babenko, Peter
Welinder, Pietro Perona, and Serge Belongie

Abstract. We present an interactive, hybrid human-computer method
for object classiﬁcation. The method applies to classes of objects that are
recognizable by people with appropriate expertise (e.g., animal species or
airplane model), but not (in general) by people without such expertise. It
can be seen as a visual version of the 20 questions game, where questions
based on simple visual attributes are posed interactively. The goal is to
identify the true class while minimizing the number of questions asked,
using the visual content of the image. We introduce a general framework
for incorporating almost any oﬀ-the-shelf multi-class object recognition
algorithm into the visual 20 questions game, and provide methodologies
to account for imperfect user responses and unreliable computer vision
algorithms. We evaluate our methods on Birds-200, a difficult dataset
of 200 tightly-related bird species, and on the Animals With Attributes
dataset. Our results demonstrate that incorporating user input drives up
recognition accuracy to levels that are good enough for practical appli-
cations, while at the same time, computer vision reduces the amount of
human interaction required.

4:10

Carl Vondrick

UC Irvine

Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces

Carl Vondrick, Deva Ramanan, Donald Patterson

Accurately annotating entities in video is labor intensive and
expensive. As the quantity of online video grows, traditional solutions to
this task are unable to scale to meet the needs of researchers with lim-
ited budgets. Current practice provides a temporary solution by paying
dedicated workers to label a fraction of the total frames and otherwise
settling for linear interpolation. As budgets and scale require sparser key
frames, the assumption of linearity fails and labels become inaccurate.
To address this problem we have created a public framework for dividing
the work of labeling video data into micro-tasks that can be completed
by huge labor pools available through crowdsourced marketplaces. By
extracting pixel-based features from manually labeled entities, we are
able to leverage more sophisticated interpolation between key frames to
maximize performance given a budget. Finally, by validating the power
of our framework on dicult, real-world data sets we demonstrate an
inherent trade-o between the mix of human and cloud computing used
vs. the accuracy and cost of the labeling.

5:30

Peter Welinder

Caltech

The Multidimensional Wisdom of Crowds

Peter Welinder, Steve Branson, Serge Belongie, Pietro Perona

Distributing labeling tasks among hundreds or thousands of annotators is an increasingly important method for annotating large datasets. We present a model for the annotation process that can be used to estimate annotator skill, bias and data difficulty. Both annotator skill and data difficulty are modeled as multidimensional quantities, which allows the model to discover and represent groups of annotators that have different sets of skills and knowledge, and data that differ qualitatively. Focusing on binary image labels, we demonstrate that the model predicts ground truth labels on both synthetic and real data better than the current state of the art methods. Furthermore, we show that the model is able to discriminate between different groups of annotators based only on the labels they provide.