Umar Khalid

AI Research Scientist II
Axon Enterprises Inc.
San Francisco, California, USA

PhD  Computer Engineering
Center for Research in Computer Vision (CRCV)
University of Central Florida

Email: umarkhalidai@outlook.com

Google Scholar  LinkedIn  GitHub  Résumé

I am a computer scientist specializing in computer vision and applied machine learning, with a strong focus on AI security, Federated Learning, and Generative AI. I have developed cutting-edge AI solutions  including Image-to-Video (I2V), Text-to-Video (T2V), Text-to-Image (T2I), and LipSync technologies, enabling rapid, high-quality content creation tailored to client needs.  I have also developed multimodal systems that integrate Large Language Models (LLMs) with  image, speech and video data.

Experience

2024 – AI Research Scientist II – Axon Enterprises Inc – Palo Alto, California
2024 – Machine Learning Intern – Meta Inc – Menlo Park, California
2023 – 2024 –  Generative AI Researcher – Samsung Research America, Cambridge, MA
2023 – Machine Learning Intern – Microsoft Inc – Redmond, Washington
2022 – Machine Learning Intern – CHEP Inc– Orlando, FL
2021 –  Adjunct Professor – Florida Polytechnic University, Lakeland, Florida
2019 – Computer Vision Algorithm Developer – MengBao Inc. – Shanghai, China
2015 – Machine Learning Software Engineer – PacSquare Inc. – Islamabad, PAK.

Education

2024 – PhD Computer Engineering – University of Central Florida, Orlando
Dissertation Title: Effective and Efficient Use of Diffusion Models for Editing in Computer Vision
Advisors: Dr. Chen Chen, Dr. Nazanin Rahnavard

2022 – MS Computer Engineering – University of Central Florida, Orlando
2018 – MS Electrical Engineering – Shanghai Jiao Tong University, Shanghai
2014 – BE Electrical  Engineering – National University of Sciences and Technology, Islamabad

Research Interests

Multi-Modals, LLMs, Diffusion Models, Video Generation, 3D Scene Generation, Federated Learning, AI Security, Continual Learning

 

Selected Publications

SPF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

CVPR, 2025 Submission
Umar Khalid, Nazmul Karim, Hasan Iqbal, Jing Hua, Chen Chen

We introduce SPF-4D, a framework designed to maintain both temporal and view consistency while editing dynamic 3D scenes. SPF-4D achieves this by leveraging progressive noise sampling during the forward diffusion phase and refining latent iteratively in the reverse diffusion phase. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, SPF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views.

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

CVPR, 2025 Submission
Umar Khalid, Hasan Iqbal, Jing Hua, Nazanin Rahnavard, Chen Chen

Editing complex visual content based on ambiguous instructions remains a challenging problem in vision-language modeling. While existing models can contextualize content, they often struggle to grasp the  underlying intent within a reference image or scene, leading to misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system designed to interpret such instructions in conjunction with reference visuals, producing precise and context-aware editing prompts. Leveraging Chain-of-Thought (CoT) reasoning and KL-Divergence Target Optimization (KTO) alignment technique, EVLM captures subjective editing preferences without requiring binary labels.

Free-Editor: Zero-shot Text-driven 3D Scene Editing

ECCV, 2024 
Umar Khalid, Nazmul Karim, Hasan Iqbal, Jing Hua, Chen Chen

https://free-editor.github.io/

In our work, we propose a novel training-free 3D scene editing technique, Free-Editor, which allows users to edit 3D scenes without further re-training the model during test time. Our proposed method successfully avoids the multi-view style inconsistency} issue in SOTA methods with the help of a single-view editing scheme. Specifically, we show that editing a particular 3D scene can be performed by only modifying a single view. To this end, we introduce an Edit Transformer that enforces intra-view consistency and inter-view style transfer by utilizing self-view and cross-view attention, respectively. Since it is no longer required to re-train the model and edit every view in a scene, the editing time, as well as memory resources, are reduced significantly.

3DEgo: 3D Editing on the Go!

ECCV, 2024 
Umar Khalid,  Hasan Iqbal, Azib Farooq, Jing Hua, Chen Chen

https://free-editor.github.io/

Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data.

LatentEditor: Text Driven Local Editing of 3D Scenes

ECCV, 2024 
Umar Khalid, Nazmul Karim, Hasan Iqbal, Jing Hua, Chen Chen

https://latenteditor.github.io/

In this paper, we introduce LatentEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods. To enhance editing precision, we introduce a delta score to calculate the 2D mask in the latent space that serves as a guide for local modifications while preserving irrelevant regions. Our novel pixel-level scoring approach harnesses the power of InstructPix2Pix (IP2P) to discern the disparity between IP2P conditional and unconditional noise predictions in the latent space. The edited latents conditioned on the 2D masks are then iteratively updated in the training set to achieve 3D local editing. Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models, bridging the gap between textual instructions and high-quality 3D scene editing in latent space.

Unsupervised anomaly detection in medical images using masked diffusion model

MICCAI, 2023 
Umar Khalid,  Hasan Iqbal,  Jing Hua, Chen Chen

https://mddpm.github.io/https://free-editor.github.io/

In this study, we present a method called masked-DDPM (mDPPM), which introduces masking-based regularization to reframe the generation task of diffusion models. Specifically, we introduce Masked Image Modeling (MIM) and Masked Frequency Modeling (MFM) in our self-supervised approach that enables models to learn visual representations from unlabeled data. To the best of our knowledge, this is the first attempt to apply MFM in DPPM models for medical applications.

SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-driven Video Editing

ICRA, 2025 
Umar Khalid, Nazmul Karim, Mohsen Joneidi, Chen Chen, Nazanin Rahnavard

We propose SAVE, a novel spectral-shift-aware adaptation framework, in which we fine-tune the spectral shift of the parameter space instead of the parameters themselves. Specifically, we take the spectral decomposition of the pre-trained T2I weights and only update the singular values while freezing the corresponding singular vectors. In addition, we introduce a spectral shift regularizer aimed at placing tighter constraints on larger singular values compared to smaller ones. This form of regularization enables the model to grasp finer details within the video that align with the provided textual descriptions. We also offer theoretical justification for our proposed regularization technique. Since we are only dealing with spectral shifts, the proposed method reduces the adaptation time significantly (10 times) and has fewer resource constraints for training.

RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

CVPRW, 2022
Umar Khalid, Ashkan Esmaeili, Nazmul Karim, Nazanin Rahnavard

A self-supervised based Out-of-distribution technique that maps the “ID”  class embeddings in 1-dimensional sub-space to perform efficient OOD detection at the inference.

CNLL: A Semi-supervised Approach For Continual Noisy Label Learning

CVPRW 2022
Nazmul Karim, Umar Khalid, Ashkan Esmaeili, Nazanin Rahnavard

The first study that investigated semi-supervised learning in the realm of continual learning in the presence of noisy labels.