AI Learns and Operates Smartphones Like Humans ππ²
PLUS: Apple's Model to Analyze Anything Anywhere, OpenAI's New funding at $100 Billion Valuation
Todayβs top AI Highlights:
Ferret: Refer and Ground Anything Anywhere at Any Granularity
AppAgent: Multimodal Agents as Smartphone Users
InsightPilot: An LLM-Empowered Automated Data Exploration System
OpenAI Might Become a $100 Billion Startup
First-ever reliable browser automation Powered by GPT-4 Vision
& so much more!
Read time: 3 mins
Latest Developments π
Analyze Inputs of Any Kind and Granularity π
Researchers at Columbia University and Apple have released Ferret, a multimodal LLM that integrates spatial referring of any shape or granularity within an image. This model stands out for its advanced skills in linking words to specific areas within images, a task that blends visual understanding with language processing.
Key Highlights:
Unlike other models, Ferret accepts a wide array of input types, from specific points to boxes and even free-form shapes. This versatility allows it to interact with images at a level of detail that previous models could not, making it a leader in spatial understanding.
To train and refine Ferret, researchers developed the GRIT dataset, a large collection of about 1.1 million samples. This dataset is comprehensive, containing detailed information about spatial relationships and various levels of visual detail.
Tested on a new benchmark Ferret-Bench, Ferret has demonstrated its effectiveness in tasks such as detailed image description and complex spatial reasoning, outperforming other models in these areas. This proficiency translates to real-world applications like advanced image search, interactive education, and assistive technologies, where precise image understanding is crucial.
AI That Learns and Operates Smartphones like Humans π«³
AppAgent is a cutting-edge multimodal agent framework that enables AI to interact with smartphone applications in a manner akin to human users. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. AppAgent demonstrates an ability to perform a diverse array of tasks across various applications.
Key Highlights:
AppAgent learns to use smartphone apps either by watching humans or through its own exploration. It has been effectively tested on 50 tasks across 10 different apps, including social media, email, and image editing. This dual approach to learning builds a knowledge base that helps the agent adapt to various software settings.
Unlike typical phone assistants that need access to a phone's system backend, AppAgent works at the level of the graphical user interface (GUI), performing simple actions like tapping and swiping. This method makes the agent usable across many different apps and enhances security and privacy.
AppAgent integrates LLMs like GPT-4 Vision which allows the agent to interpret and respond to visual cues within the GUI of smartphone apps, moving beyond the limitations of text-only information processing. This capability enables a more holistic understanding of the app's functionality and operational logic.
Automate Data Exploration with LLMs ποΈ
InsightPilot by Microsoft is a new data exploration tool that leverages LLMs to simplify the complexities of data analysis. It is designed to make data exploration more accessible, especially for users who lack extensive expertise in data analysis or find traditional methods too cumbersome and time-intensive. By automating key aspects of data exploration and integrating advanced insight discovery tools, InsightPilot makes data analysis more efficient and user-friendly.
Key Highlights:
InsightPilot simplifies data analysis by automatically choosing the best ways to understand, summarize, and explain data. It uses something called intentional queries (IQueries) to guide users through their data, making it easier to find valuable insights without getting overwhelmed.
InsightPilot is built to understand what users are looking for. Users can ask high-level questions, and the system uses special methods to analyze the data. These methods, like 'Understand', 'Summarize', 'Compare', and 'Explain', turn complex data into easy-to-understand insights. This makes it suitable even for users who aren't experts in data analysis.
InsightPilot's design allows it to be effectively applied across a wide range of data sets. Whether it's a dataset with simple patterns or one with complex, multi-dimensional data, the system's integrated approach ensures that it can adapt and provide meaningful insights regardless of the dataset's nature or complexity.
OpenAI in Funding Discussions at $100B Valuation π°
OpenAI is reportedly in discussions for a new funding round that could value the company above $100 billion, making it one of the world's most valuable startups. This round may include an $8 billion to $10 billion investment from Abu Dhabi's G42 for a chipmaking venture, aimed at developing cost-effective alternatives to Nvidiaβs GPUs for AI workloads. Concurrently, OpenAI is finalizing a separate tender offer led by Thrive Capital, with an estimated company valuation of around $86 billion.
Tools of the Trade βοΈ
AI Employe: Powered by GPT-4 Vision, it automates complex browser-based tasks with human-like intelligence, handling data transfers and interpretation tasks. You can create workflows by simply describing tasks in the browser, similar to how you would instruct a human. It can also execute intricate tasks, interpret data, and provide insights from complex sources like graphs, tables, and image-based OCR.
Algorum Mystic GPT: Specializes in analyzing NIFTY 500 stocks using technical indicators, candle patterns, support/resistance levels, and fundamental data. It can analyze multiple stocks at once and offers insightful stock analysis, graphical representations of data, and ensures secure and private data handling.
LocGuessr: It can guess the location where a given picture was taken, using GPT-4 Vision. The tool analyzes uploaded images and predicts their geographical origin, accurately identifying the country 95% of the time, though it is less accurate with states, being correct 70% of the time.
GPT Router: Open-source API gateway offering a universal API for over 30 LLMs, featuring smart fallbacks, automatic retries, and reduced latencies for seamless model management and continuous service. It addresses the challenges of using generative AI models in production environments.
π Enjoying so far, TWEET NOW to share with your friends!
Hot Takes π₯
Each of us will be soon be walking around with a powerful local hyper-personalized private LLM running on our iPhones ~ Bindu Reddy
I haven't seen a "super prompt" that can boost all tasks. It's clear to me that a task optimized prompt can make current models perform way above a basic baseline prompt. I think we will look back in < 2 years and realize GPT-4 level models were even more capable than we realized ~ anton
Meme of the Day π€‘
Thatβs all for today!
See you tomorrow with more such AI-filled content. Donβt forget to subscribe and give your feedback below π
Real-time AI Updates π¨
β‘οΈ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss whatβs trending!!
PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!
I believe that Ferret is a collaboration between Apple and Cornell University, not Columbia University.