This page is about vision and language tasks that are distinct from visual question answering (which only deals with question answering) or grounded language learning (which includes a learning component to the task).
See also this bibliography.
See also Awesome Vision & Language Pretraining Papers.