Minghua Ma is a Senior Researcher at Microsoft M365 Research. His work focuses on building AI systems that autonomously detect, diagnose, and resolve failures in large-scale cloud infrastructure, covering the full incident lifecycle from triage (Triangle) to repair (SysCraft), with multiple systems deployed in production at Microsoft. He received his Ph.D. from Tsinghua University in 2021, advised by Prof. Dan Pei in the Netman Group. He has published 60+ papers at venues including ICSE, FSE, EuroSys, KDD, VLDB, and MLSys. He is a Senior Member of CCF.

Internship Opportunities: I am seeking self-motivated undergraduate, master's, and Ph.D. intern students. If you are interested in working with me, please email me your CV.

News

2026
Service I will serve as Tool Demo Track Co-Chair for SANER 2027.
2026
Teaching I will serve as a Mentor for EC-528 Cloud Computing at Boston University, thanks to the invitation from Prof. Yigong Hu.
2025
Service I will serve as Industry Track Co-Chair for EASE 2026.
2025
Impact Microsoft Azure CTO Mark Russinovich highlighted our agent-based triage system, Triangle, in the Advancing Reliability blog.
2024
Release Our team released the AIOpsLab framework to enable the design, development, and evaluation of autonomous AIOps agents.

Experience

Microsoft

2021 – Present
Senior Researcher, M365 Research Redmond, 2024–Present
Senior Researcher, DKI Beijing, 2023–2024
Researcher, MSRA Beijing, 2021–2023

Tsinghua University

2016 – 2021
Ph.D. Student Beijing

Georgia Tech

2019 – 2020
Visiting Scholar Atlanta

Awards

  • 🏆 IEEE ISSRE 2025 Best Research Paper Candidate
    "Too Many Cooks: Assessing the Need for Multi-Source Data in Microservice Failure Diagnosis"
  • 🏆 IEEE ISSRE 2024 Best Industry Paper Candidate
    "Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction"
  • 🏆 IEEE ISSRE 2018 Best Research Paper
    "Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection"
  • 🎓 Outstanding Graduate
    Department of Computer Science and Technology, Tsinghua University, 2021

Teaching

Services

  • Organizer
  • PC Member
    • 2026: FSE Industry, TheWebConf, COLM, AIOps Workshop
    • 2025: FSE, FSE Industry, ASE, ISSRE, KDD, COLM
    • 2024: FSE Industry, ISSRE, APSEC, KDD, TheWebConf, MILETS
    • 2023: ASE, KDD, MILETS
  • Journal Reviewer
    • ACM Transactions on Intelligent Systems and Technology (TIST)
    • ACM Transactions on Software Engineering and Methodology (TOSEM)
    • IEEE Transactions on Knowledge and Data Engineering (TKDE)
    • IEEE Transactions on Services Computing (TSC)
    • IEEE Transactions on Cloud Computing (TCC)
    • Neurocomputing
  • Talks
    • "LLM-based Root Cause Analysis for Cloud Incidents" (Keynote), CCF AIOps Challenge, Beijing, 2023.
    • "Improving Cloud Reliability at Scale using Generative AI" (Invited), University of Michigan, Online, 2025.

Publications

* corresponding author, + equal contribution, 📝 paper, code, 📦 dataset, and 📎 BibTeX.

Loading publications...