Johannes Burge, Charless Fowlkes, Martin Banks
Jounral of Neuroscience, 2010, 30(21), 7269-7280.
[pdf]
Both grouping and figure-ground are thought to be important in reducing the visual complexity of a scene to that of processing a small number of cohesive, non-accidental units. This has a clear meaning for computer vision, since algorithms for recognizing particular objects in natural, cluttered scenes cannot afford to test every possible subset of pixels to see if they match some stored appearance model. Extracting coherent image structures in a feed-forward fashion is necessary to prune and guide this combinatorial search.
However, it is not obvious how to translate the phenomenological, qualitative descriptions of cues found in the Gestalt and psychophysics literature into concrete algorithms suitable for dealing with the dazzling diversity present in real images. Furthermore, there is seldom an explanation as to how multiple conflicting cues might be fused to yield a cohesive percept or why a given cue is used by the visual system.
For example, the visual phenomenon of figure-ground is easily demonstrated by Rubin's well known face-vase illusion. One has the introspective sense that attending to particular convexities causes the percept to flip between two faces in profile or the silhouette of a vase. While psychologists have quantified the tendency to see convex parts as figural, the basic question, "Why does convexity affect what we see?" is left unanswered.
Our work on the ecological statistics of perceptual organization provides a fresh outlook on these difficulties, by connecting the literature on human vision to the statistics of the natural world. Although convex fragments of a shape can occur on either the figure or the ground side, it is more common that the figure is convex, an observation we have quantified by making empirical measurements on a large collection of natural images that have been annotated by human subjects. Similarly, we have shown that for horizontal contours, the lower region is more often figure, consistent with commonly seeing the tops of foreground objects laid out against a distant landscape or the sky. In a parallel study of grouping cues, we have also made systematic measurements of the relative information content of proximity, brightness, color and texture similarity in predicting which pixels in an image belong to the same surface or object.
These statistical regularities give an explanation as to why particular cues are used by the visual system. An organism that exploits convexity or lower-region as a cue to make figure-ground inference would have an obvious evolutionary advantage, more often correctly grasping nearby objects and navigating through gaps rather than colliding with obstacles.