| | | | | | | | | | | | | | | | | | | |
. | | Name | Affiliation | Title | Authors | Abstract. | | | | | | | | | | | | | |
. | 10:30 | Carolina | UCSD | Multi-Class Object Localization by Combining Local Contextual Interactions | Carolina Galleguillos, Brian McFee , Serge Belongie and Gert Lanckriet | Recent work in object localization has shown that the use of contextual cues can greatly improve accuracy over models that use appearance features alone. Although many of these models have successfully explored different types of contextual sources, they only consider one type of contextual interaction (e.g., pixel, region or object level interactions), leaving open questions about the true potential contribution of context. Furthermore, contributions across object classes and over appearance features still remain unknown. In this work, we introduce a novel model for multiclass object localization that incorporates different levels of contextual interactions. We study contextual interactions at pixel, region and object level by using three different sources of context: semantic, boundary support and contextual neighborhoods. Our framework learns a single similarity metric from multiple kernels, combining pixel and region interactions with appearance features, and then uses a conditional random field to incorporate object level interactions.
We perform experiments on two challenging image databases: MSRC and PASCAL VOC 2007. Experimental results show that our model outperforms current state-ofthe-art contextual frameworks and reveals individual contributions for each contextual interaction level, as well as the importance of each type of feature in object localization. | | | | | | | | | | | | | |
. | 10:50 | Dennis Park | UC Irvine | Multiresolution models for object detection | Dennis Park, Deva Ramanan, and Charless Fowlkes | Most current approaches to recognition aim to be scale- invariant. However, the cues available for recognizing a 300 pixel tall object are qualitatively dierent from those for recognizing a 3 pixel tall object. We argue that for sensors with nite resolution, one should in- stead use scale-variant, or multiresolution representations that adapt in complexity to the size of a putative detection window. We describe a multiresolution model that acts as a deformable part-based model when scoring large instances and a rigid template with scoring small instances. We also examine the interplay of resolution and context, and demon- strate that context is most helpful for detecting low-resolution instances when local models are limited in discriminative power. We demonstrate impressive results on the Caltech Pedestrian benchmark, which contains object instances at a wide range of scales. Whereas recent state-of-the- art methods demonstrate missed detection rates of 86%-37% at 1 false- positive-per-image, our multiresolution model reduces the rate to 29%. | | | | | | | | | | | | | |
. | 11:10 | Ricky Sethi | UCLA and UCR | The Human Action Image and its Application to Motion Recognition | Ricky J. Sethi Amit K. Roy-Chowdhury | Recognizing a person's motion is intuitive for humans but represents a challenging problem in machine vision. In this paper, we present a multi-disciplinary framework for recognizing human actions. We develop a novel descriptor, the Human Action Image (HAI), a physically-significant, compact representation for the motion of a person, which we derive from Hamilton's Action. We prove the additivity of Hamilton's Action in order to formulate the HAI and then embed the HAI as the Motion Energy Pathway of the Neurobiological model of motion recognition. The Form Pathway is modelled using existing low-level feature descriptors based on shape and appearance. Finally, we propose a Weighted Integration (WI) methodology to combine the two pathways via statistical Hypothesis Testing using the bootstrap to do the final recognition. Experimental validation of the theory is provided on the well-known Weizmann and USF Gait datasets. | | | | | | | | | | | | | |
. | 11:30 | Piotr Dollar | Caltech | The Fastest Pedestrian Detector in the West | Piotr Dollar, Serge Belongie, Pietro Perona | We demonstrate a multiscale pedestrian detector operating in near real time (~6 fps on 640x480 images) with state-of-the-art detection performance. The computational bottleneck of many modern detectors is the construction of an image pyramid, typically sampled at 8-16 scales per octave, and associated feature computations at each scale. We propose a technique to avoid constructing such a finely sampled image pyramid without sacrificing performance: our key insight is that for a broad family of features, including gradient histograms, the feature responses computed at a single scale can be used to approximate feature responses at nearby scales. The approximation is accurate within an entire scale octave. This allows us to decouple the sampling of the image pyramid from the sampling of detection scales. Overall, our approximation yields a speedup of 10-100 times over competing methods with only a minor loss in detection accuracy of about 1-2% on the Caltech Pedestrian dataset across a wide range of evaluation settings. The results are confirmed on three additional datasets (INRIA, ETH, and TUD-Brussels) where our method always scores within a few percent of the state-of-the-art while being 1-2 orders of magnitude faster. The approach is general and should be widely applicable.
| | | | | | | | | | | | | |
. | 11:50 | Kris Kitani | UCSD | Learning Action Categories for First Person Vision | Kris Kitani | This work explores the use of ego-motion features obtained from a head-mounted camera to learn ego-centric action categories without supervision. We show that a non-parametric Bayesian mixture model can be used to efficiently learn action categories from an unlabeled continuous video database. | | | | | | | | | | | | | |
. | 12:10 | Lunch Break | | | | | | | | | | | | | | | | | |
. | 1:30 | Yoav Schechner | Caltech | Audio-Visual Association: Look at Sparse Events | Zohar Barzilay and Yoav Y. Schechner | In complex scenes, a camcorder having a single microphone captures several audio-associated visual objects (AVOs), i.e., moving visual objects that emit sounds. At the same time, the camera may also view silent objects, and the microphone may sense sounds unrelated to the scene in view. We seek to: spatially localize independent AVOs; isolate the audio component corresponding to each; and ignore the other sources. Achieving this using just a single microphone is challenging. We describe computational approaches to these problems. They are based on correlating events that are sparse in both auditory and visual domains. In particular, audio onsets and instances of visual high spatial acceleration are temporally sparse. However, their coincidence helps establish audio-visual correspondence continuously throughout videos. In addition, visual spatial sparsity of AVOs is a strong cue for localization. | | | | | | | | | | | | | |
. | 1:50 | Luis Goncalves | EvoRetail, MetaModal | Human Vision and Real Intelligence - Seeing with Sound | Luis Goncalves | I will describe the "Seeing-with-Sound" project, where we have built a prototype sensory substitution device that transforms images into sound. With training, blind participants are able to locate, track, walk to, and grab objects. The challenges, obstacles, promise, and opportunities of this neuroplasticity-inspired field will be discussed. | | | | | | | | | | | | | |
. | 2:10 | Alper Ayvaci | UC, Los Angeles | Occlusion Detection and Motion Estimation with Occlusions | Alper Ayvaci, Michalis Raptis, Stefano Soatto | We tackle the problem of simultaneously detecting occlusions and estimating optical flow. We show that, under standard assumptions of Lambertian reflection and static illumination, the task can be posed as a convex optimization problem. Therefore, the solution, computed using efficient algorithms, is guaranteed to be unique and globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance, in addition to motion estimation. We also discuss the shortcomings and limitations of our approach. | | | | | | | | | | | | | |
. | 2:30 | Julian Yarkony | UCI | Planar Cycle Covering Graphs | Julian Yarkony, Charless Fowlkes, Alexander Ihler | We describe a new variational lower-bound on the minimum energy configuration of a planar binary MRF. Our method is based on adding auxiliary nodes to every face of a planar embedding of the graph in order to capture the effect of unary potentials. A ground state of the resulting approximation can be computed effeciently by reduction to minimum-weight perfect matching. We show that optimization of variational parameters achieves as tight a bound as dual-decomposition approaches that use the set of all cycles or all outer-planar subproblems. We demonstrate that our variational optimization converges quickly and provides high-quality solutions to hard combinatorial problems. | | | | | | | | | | | | | |
. | 2:50 | Piotr Slomka | Cedars-Sinai Medical Center | Computer analysis of modern cardiac imaging data | Piotr Slomka | The goal of our research is to automate quantitative analysis of complex cardiac imaging data. The images are obtained by Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Single Photon Emission Computed Tomography (SPECT) modalities. New advances in these technologies allow 3D acquisition of the beating heart with high temporal and spatial resolution. Our work involves segmentation of the heart, detection and analysis of lesions in the coronary arteries and in the myocardium (heart muscle), automatic registration of multimodality data, inter-scan comparison of image changes, and quantification of physiologically important parameters. We have demonstrated that our methods result in more reproducible testing, and in some cases, a more accurate diagnosis than that provided by a visual analysis performed by experienced readers. I will provide a brief overview of the cardiac imaging methods, our latest results, and computing challenges associated with automated analysis of cardiac data. | | | | | | | | | | | | | |
. | 3:10 | Tali Treibitz | UCSD | Resolution Loss without Blur: Recovery Limits in Pointwise Degradation | Tali Treibitz and Yoav Schechner | Pointwise image formation models appear in a variety of photography and computer vision problems, such as: specular/diffuse reflection, irradiance falloff from point sources, attenuation and veiling in haze, direct/indirect illumination, dirty windows, semireflections, vignetting and more. An expanding array of methods has been devised to handle such problems. However, what is the recovery limit? Are there object features that cannot be effectively recovered under the pointwise degradation, despite the best efforts made by the recovery method? Can we quantitatively assess the recoverability of an object of a certain size and contrast? If some objects are not recovered, is there a point in trying to develop a new method to salvage them, or is their loss fundamental? We derive bounds to recovery from pointwise degradation, even if the parameters of an algorithm are perfectly set. The analysis uses a physical model for the acquired signal and noise, and also accounts for potential post-acquisition noise filtering. The analysis yields an effective cutoff-frequency, which is induced by noise, despite having no optical blur in the imaging model.
The talk is based on works from ICCP'09 and CVPR'09.
| | | | | | | | | | | | | |
. | 3:30 | Coffee Break | | | | | | | | | | | | | | | | | |
. | 4:00 | Marco Andreetti | California Institute of Technology | Unsupervised Learning of Categorical Segments in Image Collections | Marco Andreetto Lihi Zelnik-Manor Pietro Perona | Which one comes first: segmentation or recognition? We propose a unified framework for carrying out the two simultaneously and without supervision. The framework combines a flexible probabilistic model, for representing the shape and appearance of each segment, with the popular ``bag of visual words'' model for recognition. If applied to a collection of images, our framework can simultaneously discover the segments of each image, and the correspondence between such segments, without supervision.Such recurring segments may be thought of as the `parts' of corresponding objects that appear multiple times in the image collection. Thus, the model may be used for learning new categories, detecting/classifying objects, and segmenting images, without using expensive human annotation. | | | | | | | | | | | | | |
. | 4:20 | Mohsen Hejrati | University of California at Irvine | Every Picture Tells a Story: Generating Sentences from Images | Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth | Humans can prepare concise descriptions of pictures, focus- ing on what they find important. We demonstrate that automatic meth- ods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning ob- tained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us- ing data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche. | | | | | | | | | | | | | |
. | 4:40 | Steven Branson | UCSD | Visual Recognition with Humans in the Loop | Steve Branson , Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie
| Abstract. We present an interactive, hybrid human-computer method for object classification. The method applies to classes of objects that are recognizable by people with appropriate expertise (e.g., animal species or airplane model), but not (in general) by people without such expertise. It can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate our methods on Birds-200, a difficult dataset of 200 tightly-related bird species, and on the Animals With Attributes dataset. Our results demonstrate that incorporating user input drives up recognition accuracy to levels that are good enough for practical appli- cations, while at the same time, computer vision reduces the amount of human interaction required.
| | | | | | | | | | | | | |
. | 4:10 | Carl Vondrick | UC Irvine | Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces | Carl Vondrick, Deva Ramanan, Donald Patterson | Accurately annotating entities in video is labor intensive and expensive. As the quantity of online video grows, traditional solutions to this task are unable to scale to meet the needs of researchers with lim- ited budgets. Current practice provides a temporary solution by paying dedicated workers to label a fraction of the total frames and otherwise settling for linear interpolation. As budgets and scale require sparser key frames, the assumption of linearity fails and labels become inaccurate. To address this problem we have created a public framework for dividing the work of labeling video data into micro-tasks that can be completed by huge labor pools available through crowdsourced marketplaces. By extracting pixel-based features from manually labeled entities, we are able to leverage more sophisticated interpolation between key frames to maximize performance given a budget. Finally, by validating the power of our framework on dicult, real-world data sets we demonstrate an inherent trade-o between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. | | | | | | | | | | | | | |
. | 5:30 | Peter Welinder | Caltech | The Multidimensional Wisdom of Crowds | Peter Welinder, Steve Branson, Serge Belongie, Pietro Perona | Distributing labeling tasks among hundreds or thousands of annotators is an increasingly important method for annotating large datasets. We present a model for the annotation process that can be used to estimate annotator skill, bias and data difficulty. Both annotator skill and data difficulty are modeled as multidimensional quantities, which allows the model to discover and represent groups of annotators that have different sets of skills and knowledge, and data that differ qualitatively. Focusing on binary image labels, we demonstrate that the model predicts ground truth labels on both synthetic and real data better than the current state of the art methods. Furthermore, we show that the model is able to discriminate between different groups of annotators based only on the labels they provide. | | | | | | | | | | | | | |
. | | | | | | | | | | | | | | | | | | | |