Minghua Ma is a Senior Researcher at Microsoft M365 Research. His work focuses on building AI systems that autonomously detect, diagnose, and resolve failures in large-scale cloud infrastructure, covering the full incident lifecycle from triage (Triangle) to repair (SysCraft), with multiple systems deployed in production at Microsoft. He received his Ph.D. from Tsinghua University in 2021, advised by Prof. Dan Pei in the Netman Group. He has published 60+ papers at venues including ICSE, FSE, EuroSys, KDD, VLDB, and MLSys. He is a Senior Member of CCF.
Internship Opportunities: I am seeking self-motivated undergraduate, master's, and Ph.D. intern students. If you are interested in working with me, please email me your CV.
Featured Research Highlights
Triangle
AI-powered incident triage research: from LLM-based interpretable triage (ISSRE'24), to the multi-agent Triangle system (ASE'25 & FSE'26), plus a comprehensive survey of triage in SE practice. Highlighted by Azure CTO in the Advancing Reliability blog.
AIOpsLab
Holistic framework for evaluating AI agents for autonomous cloud operations. Open-sourced by Microsoft. From design principles (SoCC'24) to the evaluation framework (MLSys'25), with adoption in multi-modal failure localization (TSC'26).
SysCraft
AI agents that go beyond coding to build and repair real-world software systems. Benchmarked LLM system-building capabilities (TOSEM'25) and developed an evidence-preserving framework for automated package repair (ISSTA'26).
SKILLGen
Transforms playbooks, historical incidents, and domain knowledge into actionable skills for AI agents. Starting from automated TSG generation (FSE'26), extending to broad skill synthesis. Deployed at Microsoft.
News
Experience
Microsoft
2021 – PresentTsinghua University
2016 – 2021Georgia Tech
2019 – 2020Awards
- 🏆 IEEE ISSRE 2025 Best Research Paper Candidate
"Too Many Cooks: Assessing the Need for Multi-Source Data in Microservice Failure Diagnosis" - 🏆 IEEE ISSRE 2024 Best Industry Paper Candidate
"Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction" - 🏆 IEEE ISSRE 2018 Best Research Paper
"Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection" - 🎓 Outstanding Graduate
Department of Computer Science and Technology, Tsinghua University, 2021
Teaching
- Mentor
- 2026 Spring: Boston University – EC-528 Cloud Computing
- TA
- 2017 Fall: Tsinghua University – Software Engineering
- 2017 Spring: Tsinghua University – Advanced Network Management
Services
- Organizer
- SANER 2027: Tool Demo Track Chair
- EASE 2026: Industry Track Chair
- The 4th CCF AIOps Challenge: Technical Chair
- PC Member
- 2026: FSE Industry, TheWebConf, COLM, AIOps Workshop
- 2025: FSE, FSE Industry, ASE, ISSRE, KDD, COLM
- 2024: FSE Industry, ISSRE, APSEC, KDD, TheWebConf, MILETS
- 2023: ASE, KDD, MILETS
- Journal Reviewer
- ACM Transactions on Intelligent Systems and Technology (TIST)
- ACM Transactions on Software Engineering and Methodology (TOSEM)
- IEEE Transactions on Knowledge and Data Engineering (TKDE)
- IEEE Transactions on Services Computing (TSC)
- IEEE Transactions on Cloud Computing (TCC)
- Neurocomputing
- Talks
- "LLM-based Root Cause Analysis for Cloud Incidents" (Keynote), CCF AIOps Challenge, Beijing, 2023.
- "Improving Cloud Reliability at Scale using Generative AI" (Invited), University of Michigan, Online, 2025.
Publications
* corresponding author, + equal contribution, 📝 paper, code, 📦 dataset, and 📎 BibTeX.
Loading publications...