Prompt-guided and multimodal landscape scenicness assessments with vision-language models

Alex Levering; Diego Marcos; Nathan Jacobs; Devis Tuia

doi:10.1371/journal.pone.0307083

Prompt-guided and multimodal landscape scenicness assessments with vision-language models

PLoS One. 2024 Sep 30;19(9):e0307083. doi: 10.1371/journal.pone.0307083. eCollection 2024.

Authors

Alex Levering^{1

2}, Diego Marcos³, Nathan Jacobs⁴, Devis Tuia⁵

Affiliations

¹ Laboratory of Geo-Information Science and Remote Sensing, Wageningen University, Wageningen, the Netherlands.
² Instituut voor Milieuvraagstukken, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.
³ Inria, Université de Montpellier, Montpellier, France.
⁴ McKelvey School of Engineering, Washington University in St. Louis, St. Louis, MO, United States of America.
⁵ Ecole Polytechnique Fédérale de Lausanne, Environmental Computational Science and Earth Observation Laboratory, Sion, Switzerland.

Abstract

Recent advances in deep learning and Vision-Language Models (VLM) have enabled efficient transfer to downstream tasks even when limited labelled training data is available, as well as for text to be directly compared to image content. These properties of VLMs enable new opportunities for the annotation and analysis of images. We test the potential of VLMs for landscape scenicness prediction, i.e., the aesthetic quality of a landscape, using zero- and few-shot methods. We experiment with few-shot learning by fine-tuning a single linear layer on a pre-trained VLM representation. We find that a model fitted to just a few hundred samples performs favourably compared to a model trained on hundreds of thousands of examples in a fully supervised way. We also explore the zero-shot prediction potential of contrastive prompting using positive and negative landscape aesthetic concepts. Our results show that this method outperforms a linear probe with few-shot learning when using a small number of samples to tune the prompt configuration. We introduce Landscape Prompt Ensembling (LPE), which is an annotation method for acquiring landscape scenicness ratings through rated text descriptions without needing an image dataset during annotation. We demonstrate that LPE can provide landscape scenicness assessments that are concordant with a dataset of image ratings. The success of zero- and few-shot methods combined with their ability to use text-based annotations highlights the potential for VLMs to provide efficient landscape scenicness assessments with greater flexibility.

Copyright: © 2024 Levering et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Deep Learning*
Esthetics
Humans
Language

Grants and funding

The author(s) received no specific funding for this work.