Yogesh Kulkarni

I am a Computer Science Ph.D. student at the School of Computing and Augmented Intelligence (SCAI), Arizona State University, advised by Dr. Pooyan Fazli and part of the People and Robots Laboratory (PeRL). My research revolves around extending Multimodal Large Language Models (MLLMs) for Video Understanding and aligning with human preferences using Reinforcement Learning.

Previously, I graduated from the University of Southern California (USC) with a Master's in Computer Science. At USC, I was a Graduate Research Assistant at the USC Institute for Creative Technologies (ICT), where I worked with 3D Point Clouds—particularly at the intersection of GANs, Diffusion Models, and Gaussian Splatting for style transfer. In Summer 2023, I had the privilege to intern at Nokia Bell Labs, where I contributed to efficient geo-distributed LLM training across heterogeneous clusters.

My journey began with a Bachelor's in Computer Engineering from the Pune Institute of Computer Technology. I grew up in New Delhi, India.

Email  /  CV  /  LinkedIn  /  Scholar  

profile photo

Research

My research aims to enhance the reasoning capabilities of large video-language models, particularly in understanding complex spatial relationships and temporal sequences. A central theme of my work is centered around post-training refinement of existing state-of-the-art (SOTA) models through simple on-policy Reinforcement Learning (for eg. Direct Preference Optimization (DPO).

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni, Pooyan Fazli
arXiv, 2025
project page / arXiv

This paper introduces VideoPASTA, which improves video models by training them with specially crafted "bad examples" (adversarial preference pairs) that target common errors in spatial, temporal, and cross-frame understanding. It shows this targeted approach is highly efficient, achieving significant performance gains using only 7k preference pairs.

VideoSAVi preview image
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni, Pooyan Fazli
arXiv, 2025
project page / arXiv

This paper presents VideoSAVi, a method to teach video models better spatial and temporal reasoning without needing human supervision. It works by having the model critique its own reasoning errors to automatically create preference data for training, achieving strong results on benchmarks efficiently.

EnsembleNTLDetect preview image
EnsembleNTLDetect: An Intelligent Framework for Electricity Theft Detection in Smart Grid
Yogesh Kulkarni, Sayf Hussain Z, Krithi Ramamritham, Nivethitha Somu
ICDM (workshops), 2021
DOI

This paper introduces "EnsembleNTLDetect," a robust framework designed to detect electricity theft in smart grids using consumer energy data. It uses a combination of techniques to handle missing data, data imbalance, and high dimensionality, employing an ensemble machine learning model to achieve high accuracy in identifying theft patterns compared to existing methods.

Kryptonite preview image
Kryptonite: An Adversarial Attack Using Regional Focus
Yogesh Kulkarni, Krisha Bhambani
ACNS, 2021
Springer

This paper proposes "Kryptonite," an efficient adversarial attack that fools image classifiers (especially for medical images) by adding tiny, hard-to-see noise mainly to the most important part of the image (Region of Interest). It causes significant misclassification with less image distortion compared to other methods.

AnImAYoung preview image
Intensive Image Malware Analysis and Least Significant Bit Matching Steganalysis
Yogesh Kulkarni, Anurag Gorkar
IEEE Big Data, 2020
DOI

This paper introduces "AnImAYoung," a framework to detect malware hidden within images using various methods like embedding code in metadata (EXIF) or using steganography (LSB Matching). It analyzes image data/metadata for suspicious code and uses an efficient machine learning ensemble to detect hidden data in image pixels. The system is designed to be fast and accurate, making it suitable for analyzing large volumes of image data.

Miscellanea

Academic Service


Reviewer, ARR (ACL Rolling Review) 2025
Reviewer, ICCV 2025
Reviewer, CVPR 2025

Teaching

Graduate Teaching Associate, Arizona State University
CSE 485: Computer Science Capstone I, Spring 2025
CSE 240: Intro to Programming Languages, Fall 2024, Spring 2025
CSE 220: Programming for Computer Engineering, Fall 2024

Website's --> original source code.