Rajeev Jain

Rajeev Jain

Principal Software Engineer · ML Infrastructure · HPC · Scientific Computing

I design and engineer production software systems at the intersection of scientific computing, ML infrastructure, and high-performance computing at Argonne National Laboratory, with a joint appointment at the University of Chicago. My focus is the gap between prototype and production: parallel training on new accelerator hardware, I/O that doesn’t bottleneck at scale, Python platforms that teams can actually maintain.

Rajeev Jain

Work

  • UXarrayLead developer · open-source climate analysis
    Python library for unstructured climate grid analysis — the standard tool for DOE labs, NCAR, and universities working with MPAS, ICON, SAM, and next-generation meshes. Conservative zonal averaging via Gauss-Legendre quadrature; grid I/O for ESMF, MPAS, SCRIP, and HEALPix; MCP server for AI-agent dataset exploration across local and HPC execution. Docs · GitHub · MCP article
  • Pangu-Weather on Aurora60,000+ Intel GPUs · Argonne Leadership Computing Facility
    PyTorch reimplementation of Pangu-Weather using the Spectral Fourier Neural Operator for DOE exascale Earth system modeling. First stable portable DDP baseline on Aurora: PMIX/PALS environment mapping, XPU/CUDA device branching, device-aware mixed precision with gradient scaling on CUDA and bf16 on Intel XPU. Article
  • CANDLE / IMPROVECore contributor · R&D 100 Award 2023
    HPO and benchmarking infrastructure for cancer drug response models — 15+ researchers across Argonne, LLNL, and ORNL. 10,000+ training experiments across Summit, Theta, and Cori using Parsl and Swift/T. Published in Briefings in Bioinformatics, 2025.
  • FLASH-XI/O and compression lead · R&D 100 Award 2022
    Checkpoint and restart redesign for a million-line multiphysics engine. Async HDF5 with Argobots plus SZ3/ZFP compression: 40–70% checkpoint overhead reduction and 50%+ storage savings on Summit. Cross-checkpoint restart between AMReX and Paramesh — removing a hard constraint that forced full restarts when switching solvers. SC24 paper
  • MeshKitPI and software lead · DOE NEAMS · 2009–2016
    Open-source C++ toolkit for automated nuclear reactor core mesh generation. Parallel CoreGen: 712 processors, 101 million hexahedral elements, 14 GB MONJU reactor mesh in under 7 minutes — a job the serial path couldn’t run at all. Blog post · Source

Selected papers

Full list on Google Scholar · 22+ publications

Recent talks

Writing

Recognition

  • R&D 100, 2023 CANDLE — cancer AI infrastructure across Argonne, LLNL, and ORNL
  • R&D 100, 2022 FLASH-X — multiphysics simulation engine
  • IMR 2010 Best Paper — reactor core mesh generation with lattice hierarchy encoding
  • ATPESC 2015 Scholar — Argonne training program on extreme-scale computing

Funding

  • Active DOE SEATS — Software Ecosystem for Advancing Climate Tools and Services
  • Active NSF Raijin — collaborative research in climate model analysis
  • 2017–2023 DOE ECP CANDLE — core contributor
  • 2009–2016 DOE NEAMS — principal investigator, MeshKit

Roles

  • 2009–present Argonne National Laboratory — Principal Specialist in Research Software Engineering
  • 2023–present University of Chicago — Staff At-Large, cancer pharmacogenomics and Earth system science
  • 2007–2009 Arizona State University — Research and teaching assistant, structural and computational mechanics

Education

  • 2020 M.S. Computer Science — University of Chicago
  • 2009 M.S. Structural Engineering — Arizona State University
  • 2006 B.Tech. Mechanical Engineering — IIT ISM Dhanbad