Hey! I'm Dileep 👋

Thanks for stopping by — I'm really glad you're here.

A Little About My Journey

I grew up in India, fascinated by technology and what it could do. A few years back, I made the leap to move to the United States to study Computer Engineering — and that decision changed everything. Today, I'm a Senior Machine Learning Engineer building production ML systems that serve millions of people in healthcare.

My work spans the full ML lifecycle: feature engineering, model training, MLOps pipelines, HIPAA compliance, and production monitoring on AWS. Outside of healthcare, I build things that genuinely fascinate me — like QuantEdge v6.0, a live institutional quant analytics platform combining 8 ML models.

But there's more to my story than just work.

Who I Am Beyond Work

I believe life gets interesting when you stay curious about everything.

When I'm not working, you'll find me:

📸 Behind my camera — capturing stories through photography and drone shots
🏍️ On long rides — there's something freeing about the open road
🥾 Hiking trails — nature has a way of putting everything in perspective
📈 Analyzing market patterns — I built a whole platform for this
✈️ Traveling — new places, new perspectives, always learning

Each of these passions fuels my creativity and makes me better at what I do.

Let's Connect

📧 dileepkreddy5@gmail.com 💼 LinkedIn 💻 GitHub 📹 YouTube

Thanks for visiting. Feel free to explore and stay in touch.

Cheers,
Dileep

DILEEP KUMAR REDDY KAPU

Senior Machine Learning Engineer | AI Engineer

📍 Albuquerque, NM — Open to Relocation

📧 dileepkreddy5@gmail.com

💼 LinkedIn

💻 github.com/dileepkreddy5

Professional Summary

Six years building and deploying ML systems in enterprise healthcare has taught me that the hardest part isn't the model — it's making it production-grade, trustworthy, and something a team can actually ship on. I specialize in the full journey from research to deployment: fine-tuning LLMs with Hugging Face, building agentic workflows with LangChain, and training production models in PyTorch. Whether it's an evaluation framework, a secure inference pipeline, or a multi-agent system, I care about building AI that enterprise teams can rely on in the real world.

Key Skills

ML & AI

PyTorch

TensorFlow

scikit-learn

XGBoost

LightGBM

Hugging Face

MLflow

Weights & Biases

GenAI & LLMs

LangChain

LlamaIndex

RAG Architectures

Vector Databases

OpenAI API

Prompt Engineering

FastAPI

Infrastructure & MLOps

AWS SageMaker

AWS (EC2, Lambda, S3, ECR, ECS)

Azure ML

Snowflake

Docker & Kubernetes

GitHub Actions CI/CD

Terraform

Apache Airflow

Apache Spark

Programming & Data

Python

SQL

Pandas / NumPy

PySpark

REST APIs

Git

Professional Experience

Senior Software Engineer (ML & Data Platforms)

New Mexico Health Care Authority — Santa Fe, NM

Jun 2025 – Present

Senior Machine Learning Engineer (Contract)

Optum (Client: UnitedHealth Group) — Remote

Jun 2024 – May 2025

Machine Learning Engineer

University of New Mexico (UNM Health — IT) — Albuquerque, NM

Jan 2023 – May 2024

Software Engineer (Machine Learning)

Carelon Global Solutions (Elevance Health) — Bangalore, India

Jun 2020 – Dec 2022

Certifications

AWS Machine Learning – Specialty

AWS Solutions Architect – Professional

ML & Deep Learning – IIIT-B

NLP Specialization – DeepLearning.AI

Education

Master of Science in Computer Engineering

University of New Mexico, Albuquerque, NM

Dec 2024

Bachelor of Engineering — Electronics & Communication Engineering

Gitam University, Bangalore, India

May 2021

Projects

📊

QuantEdge v6.0

Institutional quantitative AI platform combining 8 ML models: LSTM temporal forecasting, XGBoost/LightGBM ensemble, 5-state HMM regime detection, GJR-GARCH volatility, FinBERT NLP sentiment, and 100K-path Monte Carlo simulation. Deployed live on AWS ECS Fargate with Redis, PostgreSQL, CloudFront CDN, and Terraform IaC.

LSTM XGBoost HMM GJR-GARCH FinBERT FastAPI React AWS ECS Terraform

🚀 Live Demo 🔗 GitHub

🕸️

Enterprise Hybrid Graph-RAG System

Hybrid retrieval-augmented generation system combining knowledge graph reasoning with semantic vector search using Amazon Bedrock (Claude 3). Designed weighted graph + vector ranking to improve contextual accuracy and explainability. Implemented evaluation metrics (Precision@K, Recall@K, MRR) and stage-wise latency instrumentation for performance visibility.

Amazon Bedrock Knowledge Graph Vector Search RAG Claude 3 LangChain

🔗 GitHub

⚡

Real-Time ML Feature Store

Production-style real-time ML feature store and low-latency inference system. Kafka-compatible streaming ingestion (Redpanda) with dual-store architecture (Redis for online serving, Parquet for offline training). Multi-worker FastAPI inference service with end-to-end training and deployment pipeline.

Kafka / Redpanda Redis FastAPI Parquet Prometheus Docker

🔗 GitHub

🌊

Real-Time Agentic Streaming Data Platform

Cloud-native platform architected for streaming-first, agent-driven data ingestion and enrichment. Autonomous agent logic enriches and routes events before persistence — decisions before storage. Designed for both analytical queries and GenAI workloads, addressing real-time decisioning in regulated, cost-sensitive environments where latency, governance, and transactional consistency all matter simultaneously.

AWS MSK / Kafka Lambda Spark Streaming Agentic AI Apache Iceberg GenAI Workloads

🔗 GitHub

🔒

Secure LLM Gateway

Enterprise AI security middleware — LangChain + FastAPI gateway with prompt injection detection, PII scrubbing, and rate limiting. Multi-provider LLM routing (OpenAI, Anthropic, Azure OpenAI) with RBAC, token-budget enforcement, and HIPAA-compliant audit logging for enterprise multi-tenant deployments.

LangChain FastAPI PII Detection RBAC OpenAI Anthropic AWS ECS

🔗 GitHub

Research

IEEE TNNLS · Under Review

Source-Attributed Feature Drift Detection in Multi-Source Healthcare ML Pipelines: Theory, Taxonomy, and a Learning-System Framework

Dileep Kumar Reddy Kapu · IEEE Transactions on Neural Networks and Learning Systems (TNNLS) · 2025

Production healthcare ML systems break not from algorithmic failures but because something upstream changed quietly — a vendor pushed a schema migration, the annual ICD-10-CM update redistributed code frequencies, a new health center came onto the feed. Standard monitoring tells you something is wrong. This work tells you which source, and why, in 23 minutes instead of 4 hours.

The core insight is mathematical: KL divergence decomposes additively under product measures. Any monitor watching the aggregate feature vector discards the per-source terms that make attribution possible. Moving monitoring to the source-table boundary — before any join — recovers it. This paper formalizes that principle, proves when attribution succeeds and when simultaneous multi-source shifts make it provably impossible (Corollary 1), and builds a complete four-component system — ASDM — that runs end-to-end in 130ms.

FIVE-TYPE DRIFT TAXONOMY — Derived from 87 Production Incidents

SED — Schema Evolution

EHR vendor upgrades that pass ETL validation but shift feature distributions. Sudden step pattern. Hardest to localize — 15+ features move at once.

ECD — Encoding Convention

ICD-10-CM annual updates redistributing diagnosis code frequencies over weeks. Most operationally damaging — standard PSI monitoring sees nothing.

SOD — Source Onboarding

New demographically distinct source added to HIE feed. Step-change at registry date. Attribution provably hardest here — overlapping feature dimensions.

SPD — Seasonal Population

Medicaid enrollment cycles, flu-season surges. Expected variation — suppressed unless amplitude exceeds 2 SD from prior year or persists 4+ weeks.

NCD — Null Convention

Vendor changes NULL encoding to sentinel −1. Invisible to PSI and KS monitoring. Immediately visible via dedicated missingness-rate statistics.

ASDM SYSTEM — Four Components, 130ms End-to-End

① SDF — Source Distribution Fingerprinter

Runs at source-table boundary before any join. Computes per-source KS/JS divergence and attribution ratio R̂s. 94ms wall-clock, fully parallel across 12 sources.

② ADC — Adaptive Drift Classifier

XGBoost classifier on 14 domain-aware features. Distinguishes schema-level from value-level changes. Macro-F1 = 0.831. Determines remediation action, not just alert.

③ SIE — SHAP-Weighted Impact Estimator

Translates divergence into estimated model impact using production SHAP values. A 0.1 JS shift in a high-importance missingness field outranks a 0.4 KS shift in noise. Quarantine threshold τ = 0.35.

④ HAL — HIPAA Audit Logger

Zero PHI in any monitoring artifact. Append-only structured records per batch. Expert root-cause annotations during incident resolution become ADC training data — compliance generated the labeled dataset.

0.831 ADC Macro-F1

91.6% Detection Recall

91% MTTR Reduction

287 Detection Delay (instances)

500M+ Patient Feature Vectors

130ms End-to-End Latency

VALIDATED ACROSS THREE TIERS

Tier 1 — Production: 71 confirmed incidents from Environment B (3–5TB/month, 25+ ML pipelines, 500M+ patient feature vectors over 18 months). 91.6% recall, 3.2% FPR, 87.7% source localization accuracy — all baselines including ADWIN, CUSUM, and Bu et al. beaten with statistical significance (Wilcoxon, Bonferroni-corrected α=0.01).

Tier 2 — MIMIC-IV: External validation on 190,835 admissions at Beth Israel Deaconess Medical Center (2008–2019). ASDM correctly detected ICD-9→ICD-10 boundary (ECD), EMAR introduction (NCD), attributed each to the correct source module, and suppressed year-over-year influenza surges as expected seasonal variation (SPD).

Tier 3 — Benchmarks: 287-instance mean detection delay vs. 1,847 for ADWIN across five synthetic generators (Cohen's d = 2.31). 80% reduction on gradual drift (Rotating Hyperplane: 388 vs. 1,932 instances).

Why it matters beyond healthcare: Any ML system assembling features from heterogeneous independent sources — financial services, public-sector analytics, digital advertising — faces the same attribution problem. The taxonomy mechanisms are domain-specific. The monitoring principle — move to the source boundary before the join — is not.

💻 Code 📄 Preprint

Research Interests

🔍

ML Robustness & Drift Detection

Monitoring production ML systems for distribution shift, source attribution, and data quality degradation in regulated environments.

🏥

Healthcare AI & Compliance

HIPAA-compliant ML systems for clinical workflows, patient risk stratification, and multi-source healthcare data infrastructure.

📈

Quantitative Finance ML

Ensemble methods, regime detection, volatility modeling, and uncertainty quantification for financial time series.

🤖

LLM Safety & RAG

Secure LLM deployment, prompt injection prevention, and retrieval-augmented generation for enterprise and regulated domains.

PHOTOGRAPHY & DRONE SHOTS

Capturing moments. Exploring perspectives.

📸

🌄

🌆

🌌

🏔️

🌊

Gallery coming soon. Stay tuned for aerial views, travel stories, and moments worth capturing.

📹 YouTube (Drone footage) 📧 dileepkreddy5@gmail.com

VENTURES

Something's Brewing... ☕🍫

☕

BUILDING SOMETHING NEW

Beyond ML engineering and photography, I'm exploring entrepreneurship with a close friend. The intersection of AI, data, and real-world problems is where the most interesting opportunities live.

More details coming soon.

WANT TO FOLLOW THE JOURNEY?

📧 dileepkreddy5@gmail.com 💼 LinkedIn

Stay tuned. 😊

– Dileep