AI for Anything Daily Brief: Monday, 29 June 2026

The AI news you can actually use — decoded daily.

☕ The 60-second version

Ford is rehiring the experienced engineers it replaced with AI — a real-world signal that AI-only dev teams have hard limits in complex, safety-critical domains.
A developer used Claude Code + Opus to analyze their own MRI scan, showing how agentic AI tools are moving into high-stakes personal decision support.
HackerRank open-sourced its ATS resume scorer — then the internet discovered the same resume scored 90, 74, and 88 on three consecutive runs, exposing reliability gaps in AI hiring tools.

🔥 Today's big story

Ford Rehires 'Gray Beard' Engineers After AI Falls Short — What It Means for AI Skill-Builders

Ford's experiment confirms what benchmark debates obscure: AI coding tools underperform on complex, domain-specific, safety-critical engineering work without deep human oversight.
The rehire wave is a direct market signal — organizations are realizing they need humans who know how to direct AI, not just how to prompt it, especially in regulated industries.
For learners, this validates the practical mastery framing: the most durable AI skill is knowing when AI is wrong, not just when it's fast.

💡 Use this as a calibration exercise today: pick a complex task in your domain and run it through your AI tool of choice, then spend 15 minutes auditing the output for domain-specific errors a generalist model would miss. Document the failure modes — that's your real AI skill gap list. TechCrunch: Ford rehires 'gray beard' engineers after AI falls short

📰 Also today

Developer Uses Claude Code + Opus to Analyze Their Own MRI Scan

The workflow — feeding raw MRI data into Claude Code with Opus — produced a structured second-opinion breakdown that the author found genuinely useful alongside their doctor's read.
This is a live demonstration of agentic AI (Claude Code as orchestrator, Opus as reasoner) applied to unstructured medical imaging data — a workflow most people haven't tried yet.

💡 Try the 'second opinion' pattern on a complex document in your field: feed a dense report, contract, or diagnostic output to Claude Opus via Claude Code and ask it to surface assumptions, flag gaps, and list what it cannot verify. Treat the output as a checklist, not a verdict. Antoine.fi: I used Claude Code to get a second opinion on my MRI

HackerRank Open-Sources Its ATS — Same Resume Scores 90, Then 74, Then 88

HackerRank's open-source ATS tool is now publicly auditable, but early testers found the same resume scoring wildly differently across runs — a nondeterminism problem baked into LLM-based evaluation.
This is a practical lesson in AI reliability: temperature, prompt drift, and context window variability make AI scoring tools unreliable without explicit consistency controls.

💡 If you use any AI scoring or evaluation tool (for resumes, essays, code reviews), run the same input 3 times and compare outputs. If scores vary by more than 10%, set temperature to 0 and add an explicit scoring rubric to the system prompt — that's the fix. Dan Unparsed: HackerRank open-sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

GLM 5.2 Beats Claude in Semgrep's Cybersecurity Benchmarks

Semgrep's internal benchmarks show GLM 5.2 outperforming Claude on their cyber-specific eval suite — a reminder that 'best model' is always task-and-domain relative, not universal.
For AI practitioners, this is a signal to run your own domain benchmarks rather than trusting general leaderboards when choosing models for specialized workflows.

💡 Build a 10-question benchmark for your specific use case (your industry, your task type, your edge cases) and test the top 3 models you use. A personal eval suite is more valuable than any public leaderboard for your actual workflow. Semgrep Blog: GLM 5.2 beats Claude in our cyber benchmarks

🛠️ Use this today — The 'Domain Failure Audit' Prompt

Paste this into Claude Opus or GPT-5: 'I'm going to give you a task from my field: [describe a real complex task]. Complete it, then give me a self-critique listing: (1) three assumptions you made that a domain expert might challenge, (2) two things you could not verify without specialized knowledge, (3) one place where a novice might trust your output but shouldn't. Be specific.' Run this weekly on any high-stakes AI output in your work. It trains both the model and your own critical review instincts — the core skill Ford now realizes it should have kept.

⚡ The feed

Business

Tools

Research

Other

A Brown University professor is denouncing what they describe as mass AI-assisted exam fraud — raising urgent questions about assessment design in an AI-native learning environment.

📈 Tip of the day

Set temperature to 0 whenever you need consistent, auditable AI output — scoring, evaluation, extraction, or classification tasks. Temperature > 0 is for creativity; temperature = 0 is for reliability. Knowing when to switch is one of the highest-leverage prompt engineering decisions you can make, and it's free.

❓ FAQ

Why did Ford rehire engineers after using AI for coding?

Ford's AI coding tools underperformed on complex, safety-critical automotive engineering tasks that require deep domain knowledge. The company found that experienced engineers — particularly those who understood legacy systems and regulatory constraints — were necessary to catch errors AI models consistently made in specialized contexts.

Can Claude Code actually analyze medical images like MRI scans?

Claude Code can process and reason about MRI data when given structured inputs, acting as an orchestration layer for Claude Opus's reasoning. It cannot replace radiologist diagnosis, but it can surface structured observations and flag areas for attention. Users should treat outputs as a research aid, not clinical guidance, and always consult licensed medical professionals.

Why did HackerRank's AI resume scorer give three different scores for the same resume?

LLM-based scoring tools are nondeterministic by default — small variations in sampling temperature, prompt context, and token ordering produce different outputs each run. Without a fixed temperature (ideally 0) and a rigid scoring rubric, AI evaluators can vary significantly even on identical inputs. This is a known limitation of LLM-as-judge systems.

What does GLM 5.2 beating Claude on cybersecurity benchmarks mean for practitioners?

It means no single model is universally best across all domains. Semgrep's internal cyber-specific benchmark showed GLM 5.2 outperforming Claude — but these results are task-specific. Practitioners should build their own domain benchmarks and test top models on their actual use cases rather than relying on general public leaderboards.

Explore AI for Anything to learn and get certified in the tools that matter.