Tesla Transitions to Full Autopilot 🏎️
PLUS: OpenSource Multimodal Language Model, Stable LM 2 beats Microsoft Phi, One-click animate with Runway Gen-2
Today’s top AI Highlights:
01.AI Releases World’s First 34B VLM
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Stable LM 2 1.6B Beats Microsoft Phi and Even Bigger Models
Tesla Transitions to Full Autopilot
Animate 5 Different Objects in One Image with GEN-2
& so much more!
Read time: 3 mins
Latest Developments 🌍
OpenSource Bilingual Multimodal Model
01.AI new vision language model Yi-VL has garnered significant attention, becoming the first open-source 34B vision language model globally, and outperforming all opensource VLMs including Alibaba’s Qwen and Meta’s Emu, in MMMU and CMMMU benchmarks, and closely following GPT-4V. It is capable of content comprehension, recognition, and multi-round conversations about images.
Key Highlights:
Yi-VL is equipped to handle complex interactions involving both text and images. It's particularly skilled in visual QA, and supports English and Chinese, including the ability to recognize text in images. Its high-resolution image comprehension at 448×448 pixels allows for detailed analysis of visual content.
Yi-VL's structure is composed of three main parts. The Vision Transformer (ViT), based on the CLIP ViT-H/14 model, is used for turning images into a format the model can understand. The Projection Module then aligns these image features with the text features, using a specialized two-layer network. The Yi LLM, available in both 34B and 6B variants, is adept at processing and generating English and Chinese language, demonstrating its linguistic versatility.
Yi-VL has some limitations: the model can only process one image at a time. Additionally, it sometimes faces challenges in generating accurate content from complex scenes with multiple objects and may resize input images to a fixed resolution (448×448), which can result in loss of detail in low-resolution images.
Scaling Monocular Depth Estimation to New Heights
Depth Anything by TikTok is a simple yet powerful foundation model for robust monocular depth estimation. It shows exceptional zero-shot capabilities across six public datasets and real-world photos, demonstrating impressive generalization ability. When fine-tuned with metric depth information from the NYUv2 and KITTI datasets, it establishes new state-of-the-art benchmarks.
Key Highlights:
By developing a data engine for automatic annotation, the project has amassed a staggering dataset of approximately 62 million unlabeled images, vastly improving data coverage and reducing generalization errors.
Utilizing advanced data augmentation tools, a more challenging optimization target is set, pushing the model to seek additional visual knowledge and develop stronger representations.
The incorporation of an auxiliary supervision mechanism allows the model to inherit rich semantic understanding from pre-trained encoders, enhancing its depth perception capabilities.
Stability AI Enters the Small LMs World 🫰
Stability AI has released Stable LM 2 1.6B, a new addition to its language model offerings. This 1.6 billion parameter model stands out for its multilingual capabilities, covering languages like English, Spanish, German, Italian, French, Portuguese, and Dutch. What sets this model apart is its blend of compact size and efficiency, trained on approximately 2 trillion tokens, making it more accessible for a wider range of developers in the generative AI space.
Key Highlights:
Stable LM 2 1.6B is distinguished by its multilingual training and balance between speed and performance, facilitated by recent algorithmic advancements. The model's reduced hardware requirements broaden its accessibility. It comes in both pre-trained and instruction-tuned versions, with a final checkpoint release aiding in fine-tuning.
The architecture of Stable LM 2 1.6B is a decoder-only transformer, mirroring the LLaMA model with specific modifications for enhanced performance. The training incorporated diverse datasets like Falcon RefinedWeb extract and The Pile, supplemented with multilingual data from CulturaX.
The model outperforms other small language models like Microsoft’s Phi-1.5 and Phi-2, TinyLlama 1.1B, and Falcon 1B, particularly in tasks involving models under 2 billion parameters. Its efficiency in multilingual tasks is notable, with significant achievements in translated benchmarks such as the ARC Challenge and LAMBADA.
Tesla's Leap to Full Autopilot 🚘
In a move eagerly anticipated by Tesla enthusiasts, the company has rolled out its latest Full Self Driving Beta Version 12 (FSD Beta V12) to select customers, marking a significant update in its self-driving technology. The new version stands out for its shift to an end-to-end approach to autonomous driving, evident in the considerable reduction of about 300,000 lines of code, replaced by a system that processes video inputs directly to control the vehicle.
The core of V12 lies in its departure from Tesla's traditional approach, which involved distinct modules for perception, planning, and control. Instead, V12 employs a single, expansive neural network that directly takes raw camera data and converts it into driving commands. This change is expected to enhance the efficiency and speed of decision-making, as the network is designed to learn the intricate connections between sensory input and driving actions. Additionally, the network's capability to continuously adapt to new driving scenarios hints at improved performance in diverse environments.
Tools of the Trade ⚒️
Multi Motion Brush in GEN-2: Animate and control the movement of up to five objects or areas within a single image using Runway’s GEN-2. You can independently manipulate each selected subject or area's motion.
Weave: Use generative AI for your use case without coding, be it interactive characters for games, chatbots for customer engagement, or automated content creation. It features user-friendly dataflow management, templates, app-ready implementation, and cost-effective, AI models for various automation tasks.
Canvas: An open-source, native macOS app that allows you to explore OpenAI's DALL·E capabilities like image generation, editing, and variations.
Open Source AI Video Search Engine: A search engine for videos to find specific answers within a large volume of video content by indexing and searching the content of videos. It also lets you interact with videos through a chat interface to learn more about the content.
😍 Enjoying so far, TWEET NOW to share with your friends!
Hot Takes 🔥
Gemini Pro is a quite capable model if you use it directly (using a key from Vertex AI), but most people experience it via Bard, which has additional “safety” features that make it less capable. Google’s marketing telling these two as equivalent is the root of the confusion. ~ Delip Rao
You have ChatGPT, Copilot, Bard, Gemini, Claude, Poe, Open source LLMs and so on...
Midjourney, Dall-e 3, 100 stable diffusion sites, Pika, Runway, Adobe...
For the general public - I feel - AI is even more hard and overwhelming to get into now, than 1 year ago. ~ Borriss
Meme of the Day 🤡
When everyone praises you for your PP presentation, but you used ChatGPT to generate it.
That’s all for today!
See you tomorrow with more such AI-filled content. Don’t forget to subscribe and give your feedback below 👇
Real-time AI Updates 🚨
⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!!
PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!