I studied at Sichuan University (SCU) from 2020 to 2024,
where I majored in Computer Science & Technology. My
Major GPA (CS courses):
3.79/4, 89.39/100;
Overall GPA: 3.78/4, 89.25/100
During my time at Sichuan University, I worked as a research assistant at
MachineILab
from 2022 to 2024, advised by
Prof. JiZhe Zhou.
I participated in one National Natural Science Foundation of China project and one National Key
R&D
Program of China.
My research interests lie in large language models (LLMs), retrieval-augmented generation
(RAG),
and Recommender Systems (RecSys). I focus on both theoretical
foundations and
practical applications of LLM-based systems.
My previous research was primarily focused on topics within computer vision, such as tampering
detection and object recognition tasks. I have contributed to the design of high-impact benchmarks,
such as HiBench, and comprehensive surveys like WebAgents.
My work has been published in top-tier conferences, including
NeurIPS 2024 (spotlight) and AAAI 2025, and has accumulated
80+ citations, with an h-index of 4.
News
🎤 2025-08-07 - Invited to give a talk "Understanding Hierarchical Data with
Large Language Models: RAG, Structural Reasoning, and Future Directions" at KDD 2025
Reasoning Day in Toronto, Canada! 🎙️
🏅 2025-06-18 - Achieved 12th place globally in KDD Cup 2025 -
Meta CRAG-MM Multimodal Retrieval Challenge among hundreds of international teams! 🌍
🏆 2025-05-16 - Our benchmark paper HiBench was accepted to KDD Benchmark
Track! 🎉
📚 2023-08-01 - Participated in NUS
Summer School research
program at National University of Singapore, completed face recognition project! 🇸🇬
Selected Publications
[KDD'25] A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with
Large
Foundation Models
Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei,
Shanru Lin, Hui Liu, Philip S. Yu, Qing Li
With the advancement of web techniques, they have significantly revolutionized various aspects of
people's lives. Despite the importance of the web, many tasks performed on it are repetitive and
time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious
daily tasks, one of the most promising approaches is to advance autonomous agents based on
Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously
without fatigue or performance degradation. In the context of the web, leveraging AI Agents --
termed WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically
enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of
parameters have exhibited human-like language understanding and reasoning capabilities, showing
proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be
utilized to develop powerful AI Agents that automatically handle web tasks, providing significant
convenience to users?' To fully explore the potential of LFMs, extensive research has emerged on
WebAgents designed to complete daily web tasks according to user instructions, significantly
enhancing the convenience of daily human life. In this survey, we comprehensively review existing
research studies on WebAgents across three key aspects: architectures, training, and
trustworthiness. Additionally, several promising directions for future research are explored to
provide deeper insights.
[KDD'25] HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
Zhuohang Jiang*, Pangjing Wu*, Ziran Liang*, Peter Q. Chen*, Xu Yuan*, Ye Jia*, Jiancheng Tu*,
Chen
Li, Peter H.F. Ng, Qing Li
Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to
reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for
structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs),
overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial
for human cognition, particularly in memory organization and problem-solving. It also plays a key
role in various real-world tasks, such as information extraction and decision-making. To address
this gap, we propose HiBench, the first framework spanning from initial structure generation to
final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs
systematically. HiBench encompasses six representative scenarios, covering both fundamental and
practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519
queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict
different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs
from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing
LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more
complex structures and implicit hierarchical representations, especially in structural modification
and textual reasoning. Based on these findings, we create a small yet well-designed instruction
dataset, which enhances LLMs' performance on HiBench by an average of 88.84% (Llama-3.1-8B) and
31.38% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, this https
URL, to encourage evaluation.
[AAAI'25] Mesoscopic Insights: Orchestrating Multi-Scale & Hybrid Architecture for Image
Manipulation Localization
Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng,
Chi-Man Pun, Jizhe Zhou
The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing
gaps overlooked by both. Image manipulation localization (IML), a crucial technique to pursue truth
from fake images, has long relied on low-level (microscopic-level) traces. However, in practice,
most tampering aims to deceive the audience by altering image semantics. As a result, manipulation
commonly occurs at the object level (macroscopic level), which is equally important as microscopic
traces. Therefore, integrating these two levels into the mesoscopic level presents a new perspective
for IML research. Inspired by this, our paper explores how to simultaneously construct mesoscopic
representations of micro and macro information for IML and introduces the Mesorch architecture to
orchestrate both. Specifically, this architecture i) combines Transformers and CNNs in parallel,
with Transformers extracting macro information and CNNs capturing micro details, and ii) explores
across different scales, assessing micro and macro information seamlessly. Additionally, based on
the Mesorch architecture, the paper introduces two baseline models aimed at solving IML tasks
through mesoscopic representation. Extensive experiments across four datasets have demonstrated that
our models surpass the current state-of-the-art in terms of performance, computational complexity,
and robustness.
[NIPS'24] IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation
Detection &
Localization
Xiaochen Ma*, Xuekang Zhu*, Lei Su*, Bo Du*, Zhuohang Jiang*, Bingkui Tong*, Zeyu Lei*, Xinyu
Yang*,
Chi-Man Pun, Jiancheng Lv, Jizhe Zhou
A comprehensive benchmark is yet to be established in the Image Manipulation Detection &
Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading
model evaluations, severely undermining the development of this field. However, the scarcity of
open-sourced baseline models and inconsistent training and evaluation protocols make conducting
rigorous experiments and faithful comparisons among IMDL models challenging. To address these
challenges, we introduce IMDL-BenCo, the first comprehensive IMDL benchmark and modular codebase.
IMDL-BenCo:i) decomposes the IMDL framework into standardized, reusable components and
revises the model construction pipeline, improving coding efficiency and customization
flexibility;ii) fully implements or incorporates training code for state-of-the-art models
to establish a comprehensive IMDL benchmark; and iii) conducts deep analysis based on the
established benchmark and codebase, offering new insights into IMDL model architecture, dataset
characteristics, and evaluation standards. Specifically, IMDL-BenCo includes common processing
algorithms, 8 state-of-the-art IMDL models (1 of which are reproduced from scratch), 2 sets of
standard training and evaluation protocols, 15 GPU-accelerated evaluation metrics, and 3 kinds of
robustness evaluation. This benchmark and codebase represent a significant leap forward in
calibrating the current progress in the IMDL field and inspiring future breakthroughs.
Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph
Reasoning Zhuohang Jiang*, Bingkui Tong*, Xia Du, Ahmed Alhammadi, Jizhe Zhou
Zhuohang Jiang*, Bingkui Tong*, Xia Du, Ahmed Alhammadi, Jizhe Zhou
To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret
the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following
this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three
stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous
scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation
oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene
graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation &
Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by
endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow
direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the
capturing of subtle context changes.
[ICONIP'23] TPTGAN: Two-Path Transformer-Based Generative Adversarial Network Using
Joint Magnitude Masking and Complex Spectral Mapping for Speech Enhancement
Zhaoyi Liu, Zhuohang Jiang, Wendian Luo, Zhuoyao Fan, Haoda Di, Yufan Long, Haizhou Wang
In this paper, we propose a two-path transformer-based metric generative adversarial network
(TPTGAN) for speech enhancement in the time-frequency domain. The generator consists of an
encoder, a two-stage transformer module, a magnitude mask decoder and a complex spectrum decoder.
Published in ICONIP 2023.
IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer
Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou
Due to limited datasets, there is currently no pure ViT-based approach for IML to serve as a
benchmark, and CNNs dominate the entire task. Nevertheless, CNNs suffer from weak long-range and
non-semantic modeling. To bridge this gap, based on the fact that artifacts are sensitive to image
resolution, amplified under multi-scale features, and massive at the manipulation border, we
formulate the answer to the former question as building a ViT with high-resolution capacity,
multi-scale feature extraction capability, and manipulation edge supervision that could converge
with a small amount of data. We term this simple but effective ViT paradigm IML-ViT, which has
significant potential to become a new benchmark for IML. Extensive experiments on five benchmark
datasets verified our model outperforms the state-of-the-art manipulation localization methods.
Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing
on Low-level Features
Xiaochen Ma, Zhuohang Jiang, Xiong Xu, Chi-Man Pun, Jizhe Zhou
This necessitates IML models to carry out a semantic understanding of the entire
image. In this paper, we
reformulate the IML task as a high‑level
vision task that greatly benefits from low‑level features. We propose a method to enhance the
Masked
Autoencoder (MAE) by incorporating
high‑resolution inputs and a perceptual loss supervision module, which we term Perceptual MAE
(PMAE). While
MAE has demonstrated an
impressive understanding of object semantics, PMAE can also comprehend low‑level semantics with
our
proposed
enhancements. This
paradigm effectively unites the low‑level and high‑level features of the IML task and outperforms
state‑of‑the‑art tampering localization methods
on five publicly available datasets, as evidenced by extensive experiments.
Selected Projects
HiBench: Benchmark for Hierarchical Reasoning First Author & Team Leader • KDD 2025 Benchmark Track • 2025
Designed and developed the first comprehensive benchmark for evaluating LLMs' capability on
hierarchical structure reasoning.
The benchmark encompasses six representative scenarios with 39,519 queries across varying
hierarchical complexity.
Key contributions: (1) Led the architectural design and implementation of the
evaluation framework,
(2) Coordinated a multi-institutional team across different time zones, (3) Open-sourced the
complete toolkit
including dataset, evaluation metrics, and baseline implementations. The benchmark has been accepted
as an
oral presentation at KDD 2025 and is being adopted by multiple research groups for
hierarchical reasoning evaluation.
Meta CRAG-MM: Multimodal Retrieval Challenge Team Leader • KDD Cup 2025 • 2025
Led a team to achieve 12th place globally among hundreds of international teams in
the Meta CRAG-MM
Multimodal Retrieval Challenge. Key contributions: (1) Designed novel multimodal
fusion architectures
combining vision and language understanding, (2) Implemented efficient retrieval-augmented
generation pipelines,
(3) Coordinated team efforts in model development, hyperparameter optimization, and submission
strategies.
The challenge focused on developing AI systems capable of understanding and retrieving information
from
multimodal content, which aligns with current trends in large multimodal models.
IMDL-BenCo: Benchmark for Image Manipulation Detection & Localization Co-First Author • NeurIPS 2024 Benchmark Track — Spotlight • 2024
Developed the first comprehensive benchmark and codebase for Image Manipulation Detection &
Localization (IMDL).
Key contributions: (1) Implemented GPU-accelerated evaluation metrics for fair and
efficient comparison,
(2) Designed modular codebase architecture enabling easy customization and extension, (3)
Co-authored the manuscript
that received a Spotlight Award at NeurIPS 2024. The benchmark includes 8
state-of-the-art models,
15 evaluation metrics, and comprehensive robustness evaluation protocols, significantly advancing
the field's
standardization and reproducibility.
Invited Talks
Understanding Hierarchical Data with Large Language Models: RAG, Structural Reasoning, and
Future Directions Invited Talks • Reasoning Day @ KDD 2025 • Toronto, ON, Canada • Aug 2025
Invited to deliver a presentation at the prestigious KDD 2025 Reasoning Day workshop.
The talk will explore cutting-edge developments in leveraging Large Language Models for hierarchical
data understanding,
with particular focus on Retrieval-Augmented Generation (RAG) systems and structural reasoning
capabilities. Key topics:
(1) Novel approaches to hierarchical data representation in LLM contexts,
(2) Integration of structural reasoning with retrieval-augmented generation,
(3) Future research directions in reasoning-enhanced AI systems,
(4) Practical applications and deployment considerations for hierarchical reasoning in real-world
scenarios.
This invitation recognizes the impact of our HiBench work and positions our research at the
forefront of LLM reasoning capabilities.
Education
Sichuan University, Chengdu, Sichuan, China
B.E. in Computer Science and Technology • Sep. 2020 to Jun. 2024
Hong Kong Polytechnic University, Hongkong, China
PHD. in Computer Science and Technology • Sep. 2024 to Present
Experience
National University of Singapore (NUS) Summer School Participant • Aug. 2023
• Participated in intensive research program at School of Computing
• Completed face recognition project using CNN-based feature extraction and similarity matching
• Gained international research experience and cross-cultural collaboration skills
DICALab, Sichuan University Research Assistant • Sep. 2022 to Jun. 2024
Advisor: Prof. JiZhe
Zhou
• Developed graph-based frameworks for privacy-sensitive object detection
• Participated in National Natural Science Foundation of China project
• Contributed to National Key R&D Program of China
• Co-authored multiple publications in top-tier conferences and journals