Profile Picture of Shaurya Rohatgi

Shaurya Rohatgi

Applied Scientist | LLM & IR Researcher | PhD in Informatics

I specialize in building state-of-the-art retrieval systems and large language models for scientific applications.

AI & LLM Research Expertise โœจ

I work at the intersection of language models and scientific discovery, developing systems that make AI more useful, reliable, and accessible.

๐Ÿ”ง LLM Systems

Expertise in fine-tuning, optimizing, and deploying large language models with a focus on efficiency and performance.

๐Ÿ” Retrieval Systems

Specialized in building advanced search and retrieval systems that integrate with language models to enhance factual accuracy.

๐Ÿค– AI Agents

Design of AI systems that can autonomously plan, reason, and execute complex tasks using a combination of tools and language models.

๐Ÿงช Scientific AI

Building AI systems that accelerate research through literature analysis, hypothesis generation, and experimental assistance.

Experience ๐Ÿ’ผ

2023 - Present
AllSci company logo

Applied Scientist

Leading the development of advanced LLM systems and RAG pipelines for scientific applications, with a focus on parameter-efficient fine-tuning and improving factual accuracy in domain-specific tasks.

2023
University of Chicago logo

Computational Scientist

Designed smaRT, an AI system using transformer models for automated ticket classification and resolution with 92% accuracy in intent recognition.

2021 - 2022
Allen Institute logo

Research Intern

Developed S2AMP, a dataset for academic mentorship prediction, and engineered a large-scale paper clustering system using BERT-based embeddings. Contributed to the Semantic Scholar platform's ML pipelines.

2018 - 2023
Penn State University logo

Research Assistant

Led the MathSeer project implementing transformer models for mathematical formula search, developed neural ranking models for scientific document retrieval, and enhanced the CiteSeerX digital library with ML-based document classification.

2014 - 2017
Tata Research logo

Researcher

Built dialogue-based natural language understanding systems and developed clustering algorithms for email categorization with 87% classification accuracy.

Education ๐ŸŽ“

Penn State University logo

Pennsylvania State University

PhD - Information Sciences and Technology

August 2017 - May 2023

Thesis: Design and Data Mining Techniques for Large-Scale Scholarly Digital Libraries and Search Engines

IIITM logo

Indian Institute of Information Technology and Management

Integrated Post Graduate - Information Technology

July 2009 - June 2014

Selected Publications ๐Ÿ“š

For full list of publications, please visit: Google Scholar, Semantic Scholar

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023

Hsu, T. Y. E., Hsu, Y. L., Rohatgi, S., Huang, C. Y., Ng, H. Y. S., Rossi, R., Kim, S., Yu, T., et al. (2025)

arXiv preprint arXiv:2501.19353

Fighting fire with fire: The dual role of LLMs in crafting and detecting elusive disinformation

Lucas, J., Uchendu, A., Yamashita, M., Lee, J., Rohatgi, S., Lee, D. (2023)

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The ACL OCL Corpus: Advancing Open Science in Computational Linguistics

Rohatgi, S., Qin, Y., Aw, B., Unnithan, N., Kan, M. Y. (2023)

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

S2AMP: A high-coverage dataset of scholarly mentorship inferred from publications

Rohatgi, S., Downey, D., King, D., Feldman, S. (2022)

Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

Accelerating substructure similarity search for formula retrieval

Zhong, W., Rohatgi, S., Wu, J., Giles, C. L., Zanibbi, R. (2020)

Advances in Information Retrieval: 42nd European Conference on IR Research (ECIR)

Featured Projects ๐Ÿš€

Llama-hub GitHub repository interface

Llama-index

Engineered a plugin-based tool for integrating custom data sources with Language Models (LLMs). The project has garnered over 2.3k stars on GitHub.

RefStudio application interface screenshot

Refstudio

Contributed to a specialized text editor for reference-heavy writing, incorporating Language Model support to streamline the academic writing workflow.

S2QA research question answering interface

S2QA: Research Question Answering

Pioneered a research Q&A tool employing Semantic Scholar and GPT-4 to provide authoritative answers drawn from top-tier research papers.

Skills ๐Ÿ’ป

โšก CUDA driver installation ninja: I do it right on the first tryโ€”no explosions! โšก

Technologies:

PyTorch, TensorFlow, ElasticSearch, Docker, Kubernetes, Apache Airflow, Sagemaker

LLM Tools:

Hugging Face, vLLM, LangChain, LlamaIndex, PEFT, RAG pipelines, LLMOps

Programming:

Python, Java, C++, SQL, PySpark

Systems:

Linux, AWS, Git, CI/CD

Awards and Certifications ๐Ÿ†

Academic Service ๐Ÿค

Program Committee member for conferences including TheWebConf, JCDL, CLEF, and AI4SG. Regularly review papers for top-tier NLP and IR venues.