Skip to main navigation Skip to search Skip to main content

From recognition to understanding: enriching visual models through multi-modal semantic integration

S. Sharifi Noorian

Research output: ThesisDissertation (TU Delft)

244 Downloads (Pure)

Abstract

This thesis addresses the semantic gap in visual understanding, improving visual models with semantic reasoning capabilities so they can handle tasks like image captioning, question-answering, and scene understanding. The main focus is on integrating visual and textual data, leveraging human cognitive insights, and developing a robust multi-modal foundation model. The research starts with the exploration of multi-modal data integration to enhance semantic and contextual reasoning in fine-grained scene recognition. The proposed multi-modal models, which combine visual and textual inputs, outperform traditional models that rely solely on visuals. This is particularly true in complex urban environments where visual ambiguities often occur. This method emphasizes the significance of semantic enrichment through multi-modal integration, which helps resolve visual ambiguities and improve scene understanding.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Delft University of Technology
Supervisors/Advisors
  • Houben, G.J.P.M., Promotor
  • Bozzon, A., Promotor
  • Yang, J., Copromotor
Award date10 Feb 2025
DOIs
Publication statusPublished - 2025

Fingerprint

Dive into the research topics of 'From recognition to understanding: enriching visual models through multi-modal semantic integration'. Together they form a unique fingerprint.

Cite this