Umar Khalid

PhD in Computer Engineering

AI Research Scientist specializing in Computer Vision, Generative AI, and Diffusion Models

San Francisco, CA Axon Enterprises
Umar Khalid
Nov 15, 2024: EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing accepted at AAAI Creative AI for Live Interactive Performances Read on arXiv

About Me

I am an AI Research Scientist II at Axon Enterprises, where I specialize in computer vision and applied machine learning with a strong focus on AI security, Federated Learning, and Generative AI. I earned my Ph.D. in Computer Engineering from the University of Central Florida in 2024, where my research focused on the effective and efficient use of diffusion models for editing in computer vision.

My research has advanced AI-driven 3D and 4D content generation, including work with Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS). I have developed cutting-edge AI solutions including Image-to-Video (I2V), Text-to-Video (T2V), Text-to-Image (T2I), and LipSync technologies, enabling rapid, high-quality content creation for AR/VR applications.

Previously, I worked as a Generative AI Researcher at Samsung Research America, completed internships at Meta and Microsoft, and worked as a Computer Vision Algorithm Developer in Shanghai. I have published multiple papers at top-tier conferences including ECCV, CVPR, MICCAI, and ICRA.

Research Interests

Generative AI

Diffusion models, T2I, I2V, T2V generation

3D/4D Vision

NeRFs, 3D Gaussian Splatting

Scene Editing

Text-driven 3D editing

AR/VR

Immersive visualization

AI Security

Federated Learning, OOD detection

Multimodal AI

LLMs with vision & speech

Industry Experience

Dec 2024 - Present

AI Research Scientist II

Axon Enterprises Inc.

Palo Alto, California

Leading AI research initiatives in computer vision and multimodal AI systems for public safety applications.

May 2024 - Aug 2024

Machine Learning Intern

Meta Inc.

Menlo Park, California

Developed advanced generative AI models for content creation and editing.

Sep 2023 - May 2024

Generative AI Researcher

Samsung Research America

Cambridge, Massachusetts

Research on diffusion models and generative AI for mobile applications.

May 2023 - Aug 2023

Machine Learning Intern

Microsoft Inc.

Redmond, Washington

Worked on AI-powered features for enterprise applications.

May 2022 - Aug 2022

Machine Learning Intern

CHEP Inc.

Orlando, Florida

2019 - 2021

Computer Vision Algorithm Developer

MengBao Inc.

Shanghai, China

2015 - 2016

Machine Learning Software Engineer

PacSquare Inc.

Islamabad, Pakistan

Education

2019 - 2024

PhD in Computer Engineering

University of Central Florida

Orlando, Florida

Dissertation: "Effective and Efficient Use of Diffusion Models for Editing in Computer Vision"

2019 - 2022

MS in Computer Engineering

University of Central Florida

Orlando, Florida

2016 - 2018

MS in Computer Science

Shanghai Jiao Tong University

Shanghai, China

2011 - 2015

BE in Electrical Engineering

National University of Sciences and Technology (NUST)

Islamabad, Pakistan

Publications

AAAI 2025

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

Umar Khalid, Kashif Munir, Hasan Iqbal, Azib Farooq, Jing Hua, Nazanin Rahnavard, Chen Chen, Victor Zhu, Zhengping Ji

AAAI Creative AI for Live Interactive Performances 2025

A system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts using reflective reasoning framework and Chain-of-Thought reasoning.

Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision.

ECCV 2024

LatentEditor: Text Driven Local Editing of 3D Scenes

Umar Khalid*, Hasan Iqbal*, Nazmul Karim, Muhammad Tayyab, Jing Hua, Chen Chen

European Conference on Computer Vision (ECCV) 2024

An innovative framework for precise and locally controlled editing of neural fields using text prompts, leveraging denoising diffusion models for faster and more adaptable NeRF editing.

While neural fields have made significant strides in view synthesis and scene reconstruction, editing them poses a formidable challenge due to their implicit encoding of geometry and texture information from multi-view inputs. In this paper, we introduce LatentEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods.

ECCV 2024

Free-Editor: Zero-shot Text-driven 3D Scene Editing

Nazmul Karim*, Hasan Iqbal*, Umar Khalid, Chen Chen, Jing Hua

European Conference on Computer Vision (ECCV) 2024

A novel training-free 3D scene editing technique that enables users to edit 3D scenes without model retraining, achieving 20x faster editing than SOTA methods.

Text-to-Image (T2I) diffusion models have recently gained traction for their versatility and user-friendliness in 2D content generation and editing. However, training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. In this study, we introduce Free-Editor, a novel, training-free 3D scene editing technique that effectively addresses the issue of multi-view style inconsistency through the implementation of a single-view editing scheme.

ECCV 2024

3DEgo: 3D Editing on the Go!

Umar Khalid*, Hasan Iqbal*, Nazmul Karim, Azib Farooq, Chen Chen, Jing Hua

European Conference on Computer Vision (ECCV) 2024

A streamlined framework for directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts, utilizing 3D Gaussian Splatting.

We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization.

CVPR 2025

SPF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

Umar Khalid, Nazmul Karim, Hasan Iqbal, Jing Hua, Chen Chen, Nazanin Rahnavard

CVPR 2025 (Under Review)

A progressive sampling framework for view-consistent 4D scene editing using diffusion models.

ICRA 2025

SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-driven Video Editing

Umar Khalid, Nazmul Karim, Mohsen Joneidi, Chen Chen, Nazanin Rahnavard

IEEE International Conference on Robotics and Automation (ICRA) 2025

A novel spectral-shift-aware adaptation framework for fine-tuning diffusion models for video editing with 10x faster training.

MICCAI 2023

Unsupervised Anomaly Detection in Medical Images Using Masked Diffusion Model

Hasan Iqbal, Umar Khalid, Chen Chen, Jing Hua

International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2023

CVPR Workshop 2022

RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

Umar Khalid, Ashkan Esmaeili, Nazmul Karim, Nazanin Rahnavard

CVPR Workshop on Robust Vision 2022

A self-supervised based Out-of-distribution technique that maps ID class embeddings in 1-dimensional sub-space for efficient OOD detection.

CVPR Workshop 2022

CNLL: A Semi-supervised Approach For Continual Noisy Label Learning

Nazmul Karim, Umar Khalid, Ashkan Esmaeili, Nazanin Rahnavard

CVPR Workshop on Continual Learning 2022

The first study investigating semi-supervised learning in continual learning with noisy labels.

Get In Touch

I'm always interested in discussing research collaborations, speaking opportunities, or new projects in AI and computer vision.