
📝 New from FAIR: An Introduction to Vision-Language Modeling.
Vision-language models (VLMs) are an area of research that holds a lot of potential to change our interactions with technology, however there are many challenges in building these types of models. Together with a set of collaborators across academia, we’re... See more
LLaVA v1.5, a new open-source multimodal model stepping onto the scene as a contender against GPT-4 with multimodal capabilities. It uses a simple projection matrix to connect the pre-trained CLIP ViT-L/14 vision encoder with Vicuna LLM, resulting in a robust model that can handle images and text. The model is trained in two stages: first, updated... See more
This AI newsletter is all you need #68

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Finds that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever.
https://t.co/GxxW9iOW8z... See more