
Shaurya Rohatgi
Applied Scientist | LLM & IR Researcher | PhD in Informatics
I specialize in building state-of-the-art retrieval systems and large language models for scientific applications.
AI & LLM Research Expertise โจ
I work at the intersection of language models and scientific discovery, developing systems that make AI more useful, reliable, and accessible.
๐ง LLM Systems
Expertise in fine-tuning, optimizing, and deploying large language models with a focus on efficiency and performance.
๐ Retrieval Systems
Specialized in building advanced search and retrieval systems that integrate with language models to enhance factual accuracy.
๐ค AI Agents
Design of AI systems that can autonomously plan, reason, and execute complex tasks using a combination of tools and language models.
๐งช Scientific AI
Building AI systems that accelerate research through literature analysis, hypothesis generation, and experimental assistance.
Experience ๐ผ

Applied Scientist
Leading the development of advanced LLM systems and RAG pipelines for scientific applications, with a focus on parameter-efficient fine-tuning and improving factual accuracy in domain-specific tasks.

Computational Scientist
Designed smaRT, an AI system using transformer models for automated ticket classification and resolution with 92% accuracy in intent recognition.

Research Intern
Developed S2AMP, a dataset for academic mentorship prediction, and engineered a large-scale paper clustering system using BERT-based embeddings. Contributed to the Semantic Scholar platform's ML pipelines.

Research Assistant
Led the MathSeer project implementing transformer models for mathematical formula search, developed neural ranking models for scientific document retrieval, and enhanced the CiteSeerX digital library with ML-based document classification.

Researcher
Built dialogue-based natural language understanding systems and developed clustering algorithms for email categorization with 87% classification accuracy.
Education ๐

Pennsylvania State University
PhD - Information Sciences and Technology
August 2017 - May 2023
Thesis: Design and Data Mining Techniques for Large-Scale Scholarly Digital Libraries and Search Engines

Indian Institute of Information Technology and Management
Integrated Post Graduate - Information Technology
July 2009 - June 2014
Selected Publications ๐
For full list of publications, please visit: Google Scholar, Semantic Scholar
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023
Hsu, T. Y. E., Hsu, Y. L., Rohatgi, S., Huang, C. Y., Ng, H. Y. S., Rossi, R., Kim, S., Yu, T., et al. (2025)
arXiv preprint arXiv:2501.19353
Fighting fire with fire: The dual role of LLMs in crafting and detecting elusive disinformation
Lucas, J., Uchendu, A., Yamashita, M., Lee, J., Rohatgi, S., Lee, D. (2023)
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
The ACL OCL Corpus: Advancing Open Science in Computational Linguistics
Rohatgi, S., Qin, Y., Aw, B., Unnithan, N., Kan, M. Y. (2023)
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
S2AMP: A high-coverage dataset of scholarly mentorship inferred from publications
Rohatgi, S., Downey, D., King, D., Feldman, S. (2022)
Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
Accelerating substructure similarity search for formula retrieval
Zhong, W., Rohatgi, S., Wu, J., Giles, C. L., Zanibbi, R. (2020)
Advances in Information Retrieval: 42nd European Conference on IR Research (ECIR)
Featured Projects ๐

Llama-index
Engineered a plugin-based tool for integrating custom data sources with Language Models (LLMs). The project has garnered over 2.3k stars on GitHub.

Refstudio
Contributed to a specialized text editor for reference-heavy writing, incorporating Language Model support to streamline the academic writing workflow.

S2QA: Research Question Answering
Pioneered a research Q&A tool employing Semantic Scholar and GPT-4 to provide authoritative answers drawn from top-tier research papers.
Skills ๐ป
โก CUDA driver installation ninja: I do it right on the first tryโno explosions! โก
Technologies:
PyTorch, TensorFlow, ElasticSearch, Docker, Kubernetes, Apache Airflow, Sagemaker
LLM Tools:
Hugging Face, vLLM, LangChain, LlamaIndex, PEFT, RAG pipelines, LLMOps
Programming:
Python, Java, C++, SQL, PySpark
Systems:
Linux, AWS, Git, CI/CD
Awards and Certifications ๐
Winner Nittany AI Challenge'18 - ProFound: A Professor Search Engine
Team Lead - Project was funded for $17,500 at The Pennsylvania State University.
Academic Service ๐ค
Program Committee member for conferences including TheWebConf, JCDL, CLEF, and AI4SG. Regularly review papers for top-tier NLP and IR venues.