Multimodal Hierarchical Models for Visually Grounded Concept Learning from Cartoon Videos

Kyung-Min Kim, Jung-Woo Ha, Beom-Jin Lee, and Byoung-Tak Zhang


Human conceptual systems are grounded in perception, action and internal states including external environments in the viewpoint of the grounded theory. In specific, concepts are grounded in modality-specific regions of brain. Based on this grounded theory, we propose a hierarchical model for learning visually grounded concepts from cartoon videos, i.e., multimodal concept hierarchy (MuCH). The multimodal concept hierarchy consists of two layers such as a conceptual layer and a modality-dependent layer. The conceptual layer includes the variables for concepts involved in the video stories and the modality-dependent layer contains the large population of large number of sparse codes encoding the associations between visual and textual information. This model structure allows each concept to be represented with the probability distribution of visual and textual variables. MuCH learns generalized concepts from many concept instances represented with scene image patches and words of the subtitles based on Monte-Carlo simulation while observing new videos, and thus robustly tracing concept drift as the progress of the stories. This process is addressed in terms of sequential Bayesian inference. When scene images are given, using the learned concepts, the model generates the subtitles reflecting the scene contents. We evaluate our model on cartoon videos ‘Pororo’ of total 268 minutes length and report the concept-based scene-to-subtitle generation, comparing with other multimodal learning models. Experimental results show that given same text query, our model generates more accurate and more character-specific subtitles than other models. Furthermore, we discuss the applications of the proposed model and some technical issues on concept learning from large-scale video data for practical usage.