Learning top down scene context for visual attention modeling in natural images
Abstract
Top down image semantics play a major role in predicting where people look in images. Present state-of-the-art approaches to model human visual attention incorporate high level object detections signifying top down image semantics in a separate channel along with other bottom up saliency channels. However, multiple objects in a scene are competing to attract our attention and this interaction is ignored in current models. To overcome this limitation, we propose a novel object context based visual attention model which incorporates the co-occurence of multiple objects in a scene for visual attention modeling. The proposed regression based algorithm uses several high level object detectors for faces, people, cars, text and understands how their joint presence affects visual attention. Experimental results on the MIT eye tracking dataset demonstrates that the proposed method outperforms other state-of-the-art visual attention models.