Multi-Modal Model - Next Step towards Artificial General Intelligence
Multi-Modal Model - Next Step towards Artificial General Intelligence
  • Reporter Kim San
  • 승인 2023.06.15 08:47
  • 댓글 0
이 기사를 공유합니다

▲ Selected 1024 × 1024 samples / Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image generation with clip latents.
▲ Selected 1024 × 1024 samples / Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image generation with clip latents.

  Consider for a moment how humans perceive and interact with the world. Imagine how difficult it would be to make out a speech not by listening to the audio but merely by reading the lips. This is a relatively daunting task for a human, let alone for a computer. There is a limited amount of information that can be encoded in the shape of one’s lips, far less than what is required to sufficiently convey the complexity of human language. Once the visual and auditory information is combined, however, one can quite confidently decipher the message. What was ambiguous by looking at the shape of one’s lips can be complemented by listening to the sounds. Going beyond this rather trivial example of audio-visual speech recognition, there are a plethora of real-world tasks that require knowledge across multiple modalities of input. The motivation behind the multi-modal model is to leverage a number of different forms of data that potentially encode complementary elements of information – hence the name multi-modal – to improve the overall performance for a certain task. 
  The idea of multi-modality is further expanded by tasks that inherently involve two or more modalities of input. Those tasks include but are not restricted to Vision-Language problems that are recently taking the world by storm. A highly relatable illustration of this is Google's Image Search, where the goal is to retrieve images corresponding to the provided search prompt. This requires an understanding of both natural language and visual data and the ability to draw correspondence between them. The search engine must be able to model the general knowledge landscape of the world in a way that if, for example, a user is searching for a “green school bus,” the search engine must be able to return not any bus but specifically a school bus; and not any school bus but specifically a green school bus. Already, from this simple example, the model needs to have a categorical understanding about the notion of automobiles and colors.
  Similar but definitely more mesmerizing example would be the vision-language generative model that creates high-resolution, photorealistic images accurately depicting the given text prompt. OpenAI’s DALL-E and Midjourney are the most widely known. These models are like Image Search Engines in that they need to have cross modal knowledge between language and image but are unlike Image Search Engines in that they generate completely new unseen images from scratch, hence the name AI Art. What is stunning about these generative models is their ability to generate plausible images from surrealistic prompts such as “a photo of flying elephants”, or “a picture of an avocado armchair.” These models seem to have specific prompts that tend to generate better results, and this gave rise to a new concept called prompt engineering. DALL-E and their variants have now completely changed the ways in which humans engage in creative works.
  There are a number of downstream tasks such as Visual Question Answering, Referring Image Segmentation, and Image Captioning that require cross-interaction between modalities. Generally, these models encode text and images into a high-dimensional space using modality-specific encoders after which they are fused (or projected) onto a common latent space. The fusion can be done using various techniques, and researchers are actively coming up with more effective methods. Multi-modality is a big leap toward realizing Artificial General Intelligence. To achieve AGI, AI systems must be able to integrate and process information from multiple sources – much like humans do – and integrate them in meaningful ways. By enabling AI systems to process multiple types of data, multi-modal models by definition overcome one of the major limitations of current AI systems: the lack of generalization across multiple domains of tasks.