To facilitate multi-modal interactions, the system blends 22 different visual foundation models (VFM) with OpenAI’s ChatGPT.
Recently, Microsoft Research released Visual ChatGPT, a chatbot system that can produce and modify graphics in response to text commands from users. A paper posted on arXiv explained the system. Users can send the bot text messages or upload photographs to communicate with it. Additionally, the bot has the ability to create new photos from scratch in response to a text prompt or alter existing images from the conversation history. The Prompt Manager, which transforms user input into a “chain of thought” prompt to help ChatGPT decide whether a VFM tool is required to complete an image task, is the bot’s main component.
Although ChatGPT and other large language models (LLM) have demonstrated impressive natural language processing abilities, they are only taught to handle text as an input method. The Microsoft team created a Prompt Manager to generate text inputs to ChatGPT that result in outputs that can call VFMs like CLIP or Stable Diffusion to carry out computer vision tasks rather than training a new model to accommodate multimodal input.
The VFMs are classified as LangChain agent Tools, and the Prompt Manager is built on a LangChain Agent. The agent applies prompt prefixes and suffixes to the user’s prompt and conversation history, which includes image filenames, to evaluate whether a tool is necessary.
To handle the user’s desired task, ChatGPT is guided by further language in the prefix to inquire, “Do I need to use a tool?,” and if so, to output the name of the tool together with any prerequisite inputs, such as an image filename or a text description of the image to generate. Until it no longer requires a tool, the agent will repeatedly run VFM tools and deliver the generated image to chat. The most recent text output will then be delivered to chat.