Hangzhou Diagens Biotechnology Co., Ltd. (2526.HK, “Diagens”) today officially launched DoctorBench, a medical AI evaluation platform, and unveiled its inaugural global medical foundation model leaderboard in Hong Kong. WiseDiag Technology’s WiseDiag-v2, Google’s Gemini-3.1-Pro-Preview, and OpenAI’s GPT-5.4 secured the top three positions.
For the first time, the evaluation framework places “real-world clinical performance” at the center, constructing a multi-dimensional benchmarking system that closely mirrors authentic diagnostic and treatment scenarios.
As medical foundation models accelerate their transition from laboratory research to clinical application worldwide, the industry has long lacked a metric that genuinely measures a model’s “clinical competence.” Existing evaluations predominantly focus on medical knowledge recall, failing to capture a model’s comprehensive performance in complex clinical contexts. This gap between benchmarking and clinical reality has become a global obstacle hindering the deployment of medical AI.
OpenAI previously launched HealthBench, signaling that leading players are beginning to take this challenge seriously. However, medicine is inherently localized — diagnostic and treatment guidelines, language conventions, and patient populations vary significantly across countries and regions, rendering any single evaluation system insufficient for universal applicability.
Driven by a profound understanding of this global challenge, Diagens developed the DoctorBench platform. The platform’s creation is rooted in nearly a decade of deep collaboration by a cross-disciplinary team. Diagens brought together experts in basic medicine, clinical medicine, artificial intelligence, and the healthcare industry, tightly integrating rigorous clinical logic with cutting-edge deep learning algorithms. This enables DoctorBench to both comprehend the boundaries of AI technology and grasp the intricate demands of clinical practice, using that standard to construct its evaluation framework.
The core philosophy of DoctorBench is no longer to test a model’s “knowledge base,” but to assess its clinical communication and decision-making ability — its capacity to “think like a doctor.” The platform features three leaderboard tracks: the Medical Leaderboard (LLM), the Multimodal Leaderboard (VLM), and the Agent Leaderboard — evaluating textual diagnostic ability, multimodal understanding, and multi-turn decision-making with tool-use inside a simulated clinical environment respectively.
On the evaluation mechanism, DoctorBench pioneers a multi-dimensional architecture combining “2 Core Dimensions (Safety and Accuracy) + 3 General Dimensions (Interaction Quality, Information Prioritization, Proactive Inquiry) + 5 Specialized Modules (Evidence & Citation, Explainable Reasoning, Actionability, Personalized Adaptation, Emotional Support).” It is equipped with “Scenario-Adaptive Weighting,” dynamically adjusting the weight of each dimension according to the risk level of different clinical scenarios, making the scoring logic closely aligned with real-world diagnostic decision-making.
Crucially, the platform designates “Medical Factual Accuracy” and “Safety and Risk Control” as inviolable red lines with a “one-vote veto” power. Any model that exhibits critical deviations on issues affecting patient safety will be unable to achieve a high score, regardless of outstanding performance in other dimensions. This design stems from the team’s deep understanding of the essence of medicine: in a field where lives are at stake, safety is always the paramount principle and leaves no room for compromise.
“The advancement of medical AI is a long-distance race concerning the health and well-being of all humanity. It demands not only disruptive technological innovation and deep cross-disciplinary, cross-regional collaboration, but also an absolute reverence for and unwavering commitment to life and health,” said Dr. Song Ning, Founder of Diagens. He expressed the hope of joining hands with more global research institutions, clinical centers, and industry partners, so that truly capable technologies can be recognized, trusted, and ultimately used to benefit every patient.