- AI Fire
- Posts
- 🔓 OpenAI was Leaked, Again?
🔓 OpenAI was Leaked, Again?
HuggingFace 9 FREE Courses + AI Cookbook

Read time: 5 minutes
AI could grade itself and tell you why it’s going to fail before it even tries. Microsoft just dropped 10 scales that might change how you trust and evaluate AI. Plus OpenAI’s latest challenge to get $250,000 in cash.
What are on FIRE 🔥
IN PARTNERSHIP WITH HUBSPOT
HubSpot offers an intuitive customer relationship management platform tailored for small businesses. Manage leads, track sales performance, and understand your customers with ease. Best of all, it’s completely free, with no limits on users or data, allowing you to store and manage up to 1,000,000 contacts.
AI INSIGHTS
After the LMArena controversy around benchmark hacking, the AI/ML community has been searching for better evaluation methods. Just in time, Microsoft researchers have introduced a creative new approach to evaluate AI models — ADeLe. It breaks down why a model succeeds or fails using 18 cognitive and knowledge-based “rubrics”, just like how a teacher grades a student’s skills.
ADeLe can predict if a model will succeed on a new task with 88% accuracy — before the model even tries it.
🧱 How It Works: ADeLe framework has 2 main parts:
Task Process: Uses rubrics to rate how demanding a task is across the 18 abilities.
System Process: Tests an AI model using ADeLe’s benchmark suite to profile its abilities
Covered 63 tasks across 20 benchmarks using 16,000 examples, analyzed 15 models.
📊 The 18 Measurement Scales Include:
Cognitive Abilities: e.g., Attention (AT), Reasoning, Abstraction, Metacognition.
Knowledge Areas: e.g., Social Knowledge (KNs), Natural Sciences (KNn), Formal Knowledge (KNf).
Other Factors: like how common a task is on the internet (prevalence) or question clarity.
🧪 Three Big Findings from Testing ADeLe on 20 AI Benchmarks
1. Current benchmarks aren’t as reliable as we think
Many benchmarks are marketed as testing specific skills but ADeLe reveals they often require more or different abilities than claimed.
Examples: TimeQA: Claims to measure time-based reasoning, but questions are all medium-level. No easy or hard ones — so it doesn't truly test a model’s range.
2. Every model has a unique personality
Using ADeLe, researchers created detailed profiles for 15 large language models — like GPT-4, Babbage-002, and LLaMA-3.1-405B — and found:
Each model has its own strengths and weaknesses shown through radar charts
Newer models tend to perform better but not always in every skill area.
Bigger models aren’t always better.
3. ADeLe predicts success or failure with 88% accuracy
Besides analyzing, ADeLe can also predict whether a model will succeed on a task.
It hit ~88% accuracy in forecasting how models like GPT-4o or LLaMA-3.1-405B would perform — a huge improvement over traditional methods.
Why It Matters: ADeLe feels like the start of an AI “report card” era, here we evaluate systems not just by results, but by the thinking behind them. I think this could soon become a standard framework in multimodal/robotic systems.
🎁 Today's Trivia - Vote, Learn & Win!
Get a 3-month membership at AI Fire Academy (500+ AI Workflows, AI Tutorials, AI Case Studies) just by answering the poll.
Which is not one of the 18 rubrics ADeLe uses to evaluate AI models? |
TODAY IN AI
AI HIGHLIGHTS
💬 Sam Altman’s goal for ChatGPT is to remember ‘your whole life’. It’s becoming an “AI-first” operating system - a "tiny reasoning model" with a trillion-token context window using all internal data.
🔥 xAI blamed Grok’s obsession with white genocide on an ‘unauthorized edit'. Over 52% of AI Fire readers also say it reflects Musk's personal beliefs embedded into Grok.
✨ A breakthrough brain-computer interface from UC Davis lets a man with ALS (same as Stephen Hawking’s disease) "speak" again with 97% accuracy. AI could finally restore lost abilities for those with disabilities?
🔗 'OpenAI to Z Challenge' offers a $250,000 cash prize for 1st place, $100,000 for 2nd, and top 5. You need to use AI to research, analyze, and report on potential lost settlements. Join here.
📱 Google adds built-in accessibility tools to Chromebooks for education. Chrome now supports OCR for scanned PDFs and offers custom zoom settings directly.
🌩️ The AI revolution is real but moving slow, similar to a historical "phony war" period before major conflict erupts. Don’t be fooled. This' is the calm before AI storm.
💰 AI Daily Fundraising: Pathos AI has raised $365 million, increasing its valuation to $1.6 billion. This oncology-focused AI startup secured investments from industry giant AstraZeneca, highlighting strong support.
AI SOURCES FROM AI FIRE
NEW EMPOWERED AI TOOLS
🛠️ ScoutDB is the world’s first agentic Mongo GUI. Find data 90% faster. Query using natural language. Visualize schema relationships on an infinite canvas. Not hours manually writing queries*
🔗 Tolt is an all-in-one affiliate marketing software for anyone or any startups.
📅 HubSpot Meeting Scheduler for 10x demos, sessions, supporting calls.
🤖 Mukh.1 lets you drag‑drop AI agents, RAG, tools and multi‑agent workflows, no-code.
✍️ 1Stroke generates context-aware responses for emails, chats, socials... 100% customizable.
*indicates a promoted tool, if any

AI QUICK HITS
🚀 Windsurf launched in-house AI models - cheaper than Claude 3.5 Sonnet.
📥 ChatGPT was leaked to soon record, transcribe & summarize meetings, intergrate MCP.
🔴 New Claude Neptune model undergoes red team review at Anthropic.
📜 Meta released science research, and Open Molecules 2025 dataset to speed up drugs discovery.
🙏 Anthropic’s lawyer was forced to apologize after Claude hallucinated a legal citation.
AI CHART
ByteDance has quietly open-sourced a powerful AI research assistant named DeerFlow, designed to automate complex research tasks using a multi-agent architecture.
Unlike traditional LLM wrappers, DeerFlow acts like a collaborative research team in code. It can also turn a research report into a podcast-style audio file — instantly.
⚙️ Core Features at a Glance
DeerFlow is packed with advanced capabilities:
Multi-modal search via Tavily, DuckDuckGo, Brave, Arxiv
Python code execution for data processing and analysis
OpenAI-compatible LLMs + Qwen for flexible reasoning
Notion-style block editing with sentence suggestions
Podcast/audio report generation via Volcengine TTS
Visual workflow design & debugging in LangGraph Studio
PowerPoint generation for presentations
DeerFlow is not just theoretical, it comes with working examples, like:
Bitcoin market trends – regulatory insights, historical charts
AI model analysis – e.g., OpenAI’s Sora and Claude workflows
Scientific concept exploration – like MCP in chemistry and electronics
Sport data analysis – Cristiano Ronaldo's career breakdown
All outputs are: Editable, Convertible to audio & Sharable in multiple formats. ByteDance is aiming to democratize deep research automation.
This reflects the shift from mono-agent GPT wrappers to multi-agent, task-specific orchestration. DeerFlow’s extensibility hints at long-term community-driven innovation — similar to how HuggingFace grew.
We’ll likely see integrations with notebooks, data pipelines, or CMS platforms soon. More similar open-source multi-agent systems will rise as researchers realize that orchestration > raw model power.
AI JOBS
We read your emails, comments, and poll replies daily
How would you rate today’s newsletter?Your feedback helps us create the best newsletter possible |
Hit reply and say Hello – we'd love to hear from you!
Like what you're reading? Forward it to friends, and they can sign up here.
Cheers,
The AI Fire Team
Reply