During The ChatGPT Boom, SenseTime Releases A Large Multimodal Model


The new model has been made accessible through OpenGVLab, a broad visual open source platform in which SenseTime takes part, since the date of its release.

In the midst of the most recent AI tsunami brought on by ChatGPT, Chinese artificial intelligence pioneer SenseTime revealed what it called the “biggest multimodal open-source large-language model” on March 14. The model, known as Intern 2.5, is the biggest and most precise object recognition benchmark dataset COCO that exceeds 65.0 mAP among the world’s open source models on ImageNet, according to SenseTime.

The cross-modal task processing capacity of Intern 2.5 can offer effective and precise perception and comprehension support for robotics and automated driving. SenseTime, Shanghai Artificial Intelligence Laboratory, Tsinghua University, Chinese University of Hong Kong, and Shanghai Jiao Tong University all worked together to construct the model.

Presently, as a result of the rapid expansion of numerous applications, standard computer vision has proven incapable to handle a number of particular tasks required in the real world. Text-based task definition allows Intern 2.5, a higher-level visual system with universal scene perception and complex problem-solving abilities, to flexibly specify the work requirements of various scenarios. It has advanced vision and complex problem-solving skills in broad situations including image description, visual question-answering, visual reasoning, and word recognition. It can deliver instructions or responses depending on given visual images and prompts for tasks.

For instance, it can significantly enhance perception and comprehension skills in automated driving, accurately support determining the status of traffic lights, road signs, and other information, and give useful information for a vehicle’s decision-making and overall planning.

It can also produce content using artificial intelligence. A diffusion model generation algorithm may produce realistic images that are of excellent quality based on the demands put forth by consumers. This type of technology exhibits exceptional performance in the area of cross-modal visuals and text due to the efficient integration of vision, voice, and multitasking modelling skills.

The model achieves 90.1% accuracy on ImageNet, a sizable visual database created for use in research on visual object identification software, using only publicly available data. Except for Google’s and Microsoft’s models, this is the only one with an accuracy record of more than 90%.


Please enter your comment!
Please enter your name here