My research interests mainly lie on robotics fields, specifically, robotic manipulation (including contact-rich manipulation, dexterous manipulation, long-horizon manipulation, etc.), robot learning (including imitation learning, multimodal learning, data collection methods, in-context learning, reinforcement learning, etc.), and grasping. I am currently a member of SJTU Machine Vision and Intelligence Group (MVIG). My ultimate goal is to enable robots to perform various tasks in the real world under any circumstances, improving the quality of human life.
Photo @ İstanbul, Türkiye 🇹🇷 Credit to Jingjing Chen
We introduce a physically grounded interaction frame that decouples motion and force control axis from demonstrations. By combining a global vision policy and a high-frequency local policy with hybrid force-position control, Force Policy improves contact stability, force regulation, and generalization on real contact-rich tasks.
LIDEA transfers human demonstrations through dual-stage 2D feature distillation and embodiment-agnostic 3D geometry alignment. This cross-embodiment design makes human-to-robot imitation more reliable and improves generalization to new setups.
We introduce an object-centric history representation built upon point tracks, compressing long-horizon observations into task-relevant object memory for diverse visuomotor policies. This efficient design consistently outperforms both Markovian and prior history-based baselines, improving decision quality and task success.
DQ-RISE quantizes dexterous hand states and couples them with arm diffusion through a continuous relaxation for structured arm-hand learning. This balances the action space and yields more efficient learning in dexterous manipulation.
We introduce AnyDexGrasp, a data-efficient dexterous grasping method that transfers across different robotic hands, built upon the intermediate contact-centric grasp representations. It achieves high real-world success in cluttered scenes with over 150 novel objects, demonstrating scalable cross-hand grasp generalization.
We develop AirExo-2 for low-cost, large-scale in-the-wild collection and convert human demonstrations into pseudo-robot data. Together with the generalizable visuomotor policy RISE-2 that integrates 3D perception and 2D visual foundation models, this pipeline reaches strong performance without teleoperated data.
We formulate object-centric knowledge as a semantic keypoint graph template, and use a coarse-to-fine matching strategy to inject them into policy learning. This design improves category-level abstraction and boosts generalization across objects.
We propose modal-level exploration to generate diverse multi-modal interaction data, then learn from the most informative trials and segments. This self-improvement loop raises data efficiency and steadily strengthens policy capability over time.
We propose a bidirectionally expanded action head that unfolds action sequences in a coarse-to-fine manner. This design preserves the capability of policy backbone while enabling logarithmic-time inference for faster manipulation control.
FoAR is a force-aware policy that fuses vision with high-frequency force/torque sensing using a future-contact-guided gating module. This enables phase-adaptive control and delivers more accurate, robust contact-rich manipulation.
MBA is a plug-and-play module that cascades action diffusion for object motion generation and motion-guided robot action generation. Integrated into existing policies, it consistently improves manipulation performance in various tasks.
CAGE is a data-efficient generalizable policy utilizing visual foundation models and causal attention. With 50 demonstrations in a single domain, it generalizes to unseen backgrounds, objects, and viewpoints while outperforming prior methods.
S2I is a segment-level selection and optimization framework for mixed-quality demonstrations that plugs into existing policies. Using only a few expert references, it improves downstream performance and makes suboptimal data more usable.
RISE is an end-to-end imitation policy that predicts continuous actions directly from single-view point clouds. With only 50 demonstrations per task, it outperforms representative 2D and 3D baselines in accuracy, efficiency and generalization.
We contribute Open X-Embodiment, a 1M+ trajectory real-robot dataset spanning 22 embodiments, plus large RT-X models trained at scale. This breadth enables strong cross-embodiment co-training gains and advances robotic foundation models.
AirExo is a low-cost portable dual-arm exoskeleton for joint-level teleoperation and in-the-wild demonstration collection. Pre-training with scalable in-the-wild data improves sample efficiency and robustness.
RH20T is a real-world dataset of 110k+ sequences across diverse skills, robots, viewpoints, and contexts with synchronized visual, force, audio, tactile, and action signals. Its scale and multimodal quality make it a great training source for one-shot and generalizable manipulation.
We propose a flexible handover framework with real-time robust grasp-trajectory generation and future grasp prediction. This improves adaptability to dynamic handover scenes and raises success on moving-object grasps.
We reformulate reactive grasping around target-referenced semantic consistency rather than only temporal smoothness. Tracking in generated grasp spaces improves grasp reliability for dynamic objects.
AnyGrasp is a unified model for static and dynamic general grasping that predicts accurate dense full-DoF grasps efficiently. It remains robust under severe depth noise, improving real-world deployment reliability.
TransCG is a large-scale real-world transparent object depth completion benchmark. We also propose a lightweight baseline DFNet for depth completion of transparent objects. This closes a key sensing gap and improves perception for transparent objects.
We propose graspness, a geometry-driven grasp quality measure for identifying graspable regions in clutter via look-ahead search. A learned graspness predictor enables fast, accurate grasp detection in practice.
Proposes that we can learn "jargons" like "ResNet" and "YOLO" from academic paper citation information, and such citation information can be regarded as the searching results of the corresponding "jargon". For example, when searching "ResNet", the engine should return the "Deep Residual Learning for Image Recognition", instead of the papers that contains word "ResNet" in their titles, as current scholar search engines commonly return.
Academic Services
Reviewer for Conferences :
IEEE International Conference on Robotics and Automation (ICRA), 2023, 2024, 2025, 2026
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, 2024, 2025, 2026
Conference on Robot Learning (CoRL), 2025, 2026
International Conference on Learning Representations (ICLR), 2025
Advances in Neural Information Processing Systems (NeurIPS), 2025, 2026
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Reviewer for Journals :
IEEE Robotics and Automation Letters (RA-L)
IEEE Transactions on Cybernetics (T-CYB)
IEEE/ASME Transactions on Mechatronics (T-MECH)
IEEE Transactions on Automation Science and Engineering (T-ASE)
Talks
[Mar. 2024] Echo AI Talk . Towards Efficient Robot Imitation Learning from Human Demonstrations. Thanks Zhenfei Yin for invitation. [Replay]
[Nov. 2024] Zhixingxing (智猩猩) . Towards Efficient Robot Imitation Learning from Human Demonstrations. [Replay]
[Mar. 2025] THU Yang Gao Group . Towards Generalizable Imitation Learning from Human Demonstrations. Thanks Chuan Wen for invitation.
[Aug. 2025] 3DCVer (3D视觉工坊) . Towards Generalizable Imitation Learning from Human Demonstrations. [Replay]
[Aug. 2025] Sharpa . Towards Generalizable Imitation via Scalable Data and Robust Policy. Thanks Kaifeng Zhang for invitation.
[Sept. 2025] Galbot . Building Capable and Generalizable Policies for Robotic Manipulation. Thanks Caowei Meng for invitation.
[Sept. 2025] Shenlanxueyuan (深蓝学院) . Towards Generalizable Imitation via Scalable Data and Robust Policy.
I share some of my notes in the courses I took at graduate school in this page. More notes during my undergraduate study can be found in this repository.