
Cloud outages don’t have to be a mystery—or a recurring fire drill. Host Dr. Darren interviews Dr. Helen Gu, professor at North Carolina State University and founder/CEO of InsightFinder, about how AI for cloud operations can detect, predict, and automatically fix outages before users feel the impact. ## Key Takeaways - AI can move beyond simple alerting to **predictive outage prevention**, spotting early warning signs before they become incidents. - **Unsupervised machine learning** helps discover hidden patterns in noisy machine data without requiring large sets of labeled examples. - Real-world cloud environments are complex, with thousands of parameters, dynamic workloads, and interacting microservices that make manual troubleshooting difficult. - A **closed-loop feedback system** lets teams review AI predictions, correct mistakes, and continuously improve model accuracy. - InsightFinder’s **composite AI** approach combines predictive AI, causal inference, behavior learning, and small language models for more reliable operations. - The same data-driven approach can support **cloud monitoring, edge environments, critical infrastructure, and other machine-generated data streams**. ## Chapters - 00:00 Introduction to AI that prevents cloud outages - 01:05 Helen Gu’s origin story in NASA-funded Mars research - 04:10 From video streaming on Mars to machine learning for reliability - 07:00 Why machine data is harder than it looks - 09:20 Unsurvised learning vs. supervised learning - 12:10 From research to Google Cloud anomaly detection - 14:40 Detection, prediction, and automatic remediation - 17:10 Why cloud systems are so complex - 19:45 The future of AI agents, models, and infrastructure monitoring - 23:10 Hallucinations, false positives, and feedback loops - 26:00 Composite AI and online learning in production - 29:10 Adapting AI models to different environments See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.