Hey! I'm Dileep 👋

Thanks for stopping by — I'm really glad you're here.

A Little About My Journey

I grew up in India, fascinated by technology and what it could do. A few years back, I made the leap to move to the United States to study Computer Engineering — and that decision changed everything. Today, I'm a Senior Machine Learning Engineer building production ML systems that serve millions of people in healthcare.

My work spans the full ML lifecycle: feature engineering, model training, MLOps pipelines, HIPAA compliance, and production monitoring on AWS. Outside of healthcare, I build things that genuinely fascinate me — like QuantEdge v6.0, a live institutional quant analytics platform combining 8 ML models.

But there's more to my story than just work.

Who I Am Beyond Work

I believe life gets interesting when you stay curious about everything.

When I'm not working, you'll find me:

  • 📸 Behind my camera — capturing stories through photography and drone shots
  • 🏍️ On long rides — there's something freeing about the open road
  • 🥾 Hiking trails — nature has a way of putting everything in perspective
  • 📈 Analyzing market patterns — I built a whole platform for this
  • ✈️ Traveling — new places, new perspectives, always learning

Each of these passions fuels my creativity and makes me better at what I do.

Let's Connect

Thanks for visiting. Feel free to explore and stay in touch.

Cheers,
Dileep
Dileep Kumar Reddy Kapu

DILEEP KUMAR REDDY KAPU

Senior Machine Learning Engineer  |  AI Engineer
📍 Albuquerque, NM — Open to Relocation

Professional Summary

Six years building and deploying ML systems in enterprise healthcare has taught me that the hardest part isn't the model — it's making it production-grade, trustworthy, and something a team can actually ship on. I specialize in the full journey from research to deployment: fine-tuning LLMs with Hugging Face, building agentic workflows with LangChain, and training production models in PyTorch. Whether it's an evaluation framework, a secure inference pipeline, or a multi-agent system, I care about building AI that enterprise teams can rely on in the real world.

Key Skills

ML & AI

PyTorch
TensorFlow
scikit-learn
XGBoost
LightGBM
Hugging Face
MLflow
Weights & Biases

GenAI & LLMs

LangChain
LlamaIndex
RAG Architectures
Vector Databases
OpenAI API
Prompt Engineering
FastAPI

Infrastructure & MLOps

AWS SageMaker
AWS (EC2, Lambda, S3, ECR, ECS)
Azure ML
Snowflake
Docker & Kubernetes
GitHub Actions CI/CD
Terraform
Apache Airflow
Apache Spark

Programming & Data

Python
SQL
Pandas / NumPy
PySpark
REST APIs
Git

Professional Experience

Senior Software Engineer (ML & Data Platforms)
New Mexico Health Care Authority — Santa Fe, NM
Jun 2025 – Present
Senior Machine Learning Engineer (Contract)
Optum (Client: UnitedHealth Group) — Remote
Jun 2024 – May 2025
Machine Learning Engineer
University of New Mexico (UNM Health — IT) — Albuquerque, NM
Jan 2023 – May 2024
Software Engineer (Machine Learning)
Carelon Global Solutions (Elevance Health) — Bangalore, India
Jun 2020 – Dec 2022

Education

Master of Science in Computer Engineering
University of New Mexico, Albuquerque, NM
Dec 2024
Bachelor of Engineering — Electronics & Communication Engineering
Gitam University, Bangalore, India
May 2021

Projects

📊

QuantEdge v6.0

Institutional quantitative AI platform combining 8 ML models: LSTM temporal forecasting, XGBoost/LightGBM ensemble, 5-state HMM regime detection, GJR-GARCH volatility, FinBERT NLP sentiment, and 100K-path Monte Carlo simulation. Deployed live on AWS ECS Fargate with Redis, PostgreSQL, CloudFront CDN, and Terraform IaC.

LSTM XGBoost HMM GJR-GARCH FinBERT FastAPI React AWS ECS Terraform
🚀 Live Demo   🔗 GitHub
🕸️

Enterprise Hybrid Graph-RAG System

Hybrid retrieval-augmented generation system combining knowledge graph reasoning with semantic vector search using Amazon Bedrock (Claude 3). Designed weighted graph + vector ranking to improve contextual accuracy and explainability. Implemented evaluation metrics (Precision@K, Recall@K, MRR) and stage-wise latency instrumentation for performance visibility.

Amazon Bedrock Knowledge Graph Vector Search RAG Claude 3 LangChain
🔗 GitHub

Real-Time ML Feature Store

Production-style real-time ML feature store and low-latency inference system. Kafka-compatible streaming ingestion (Redpanda) with dual-store architecture (Redis for online serving, Parquet for offline training). Multi-worker FastAPI inference service with end-to-end training and deployment pipeline.

Kafka / Redpanda Redis FastAPI Parquet Prometheus Docker
🔗 GitHub
🌊

Real-Time Agentic Streaming Data Platform

Cloud-native platform architected for streaming-first, agent-driven data ingestion and enrichment. Autonomous agent logic enriches and routes events before persistence — decisions before storage. Designed for both analytical queries and GenAI workloads, addressing real-time decisioning in regulated, cost-sensitive environments where latency, governance, and transactional consistency all matter simultaneously.

AWS MSK / Kafka Lambda Spark Streaming Agentic AI Apache Iceberg GenAI Workloads
🔗 GitHub
🔒

Secure LLM Gateway

Enterprise AI security middleware — LangChain + FastAPI gateway with prompt injection detection, PII scrubbing, and rate limiting. Multi-provider LLM routing (OpenAI, Anthropic, Azure OpenAI) with RBAC, token-budget enforcement, and HIPAA-compliant audit logging for enterprise multi-tenant deployments.

LangChain FastAPI PII Detection RBAC OpenAI Anthropic AWS ECS
🔗 GitHub

Research

IEEE TNNLS · Under Review

Source-Attributed Feature Drift Detection in Multi-Source Healthcare ML Pipelines: Theory, Taxonomy, and a Learning-System Framework

Dileep Kumar Reddy Kapu  ·  IEEE Transactions on Neural Networks and Learning Systems (TNNLS)  ·  2025

Production healthcare ML systems break not from algorithmic failures but because something upstream changed quietly — a vendor pushed a schema migration, the annual ICD-10-CM update redistributed code frequencies, a new health center came onto the feed. Standard monitoring tells you something is wrong. This work tells you which source, and why, in 23 minutes instead of 4 hours.

The core insight is mathematical: KL divergence decomposes additively under product measures. Any monitor watching the aggregate feature vector discards the per-source terms that make attribution possible. Moving monitoring to the source-table boundary — before any join — recovers it. This paper formalizes that principle, proves when attribution succeeds and when simultaneous multi-source shifts make it provably impossible (Corollary 1), and builds a complete four-component system — ASDM — that runs end-to-end in 130ms.

FIVE-TYPE DRIFT TAXONOMY — Derived from 87 Production Incidents
SED — Schema Evolution
EHR vendor upgrades that pass ETL validation but shift feature distributions. Sudden step pattern. Hardest to localize — 15+ features move at once.
ECD — Encoding Convention
ICD-10-CM annual updates redistributing diagnosis code frequencies over weeks. Most operationally damaging — standard PSI monitoring sees nothing.
SOD — Source Onboarding
New demographically distinct source added to HIE feed. Step-change at registry date. Attribution provably hardest here — overlapping feature dimensions.
SPD — Seasonal Population
Medicaid enrollment cycles, flu-season surges. Expected variation — suppressed unless amplitude exceeds 2 SD from prior year or persists 4+ weeks.
NCD — Null Convention
Vendor changes NULL encoding to sentinel −1. Invisible to PSI and KS monitoring. Immediately visible via dedicated missingness-rate statistics.
ASDM SYSTEM — Four Components, 130ms End-to-End
① SDF — Source Distribution Fingerprinter
Runs at source-table boundary before any join. Computes per-source KS/JS divergence and attribution ratio R̂s. 94ms wall-clock, fully parallel across 12 sources.
② ADC — Adaptive Drift Classifier
XGBoost classifier on 14 domain-aware features. Distinguishes schema-level from value-level changes. Macro-F1 = 0.831. Determines remediation action, not just alert.
③ SIE — SHAP-Weighted Impact Estimator
Translates divergence into estimated model impact using production SHAP values. A 0.1 JS shift in a high-importance missingness field outranks a 0.4 KS shift in noise. Quarantine threshold τ = 0.35.
④ HAL — HIPAA Audit Logger
Zero PHI in any monitoring artifact. Append-only structured records per batch. Expert root-cause annotations during incident resolution become ADC training data — compliance generated the labeled dataset.
0.831 ADC Macro-F1
91.6% Detection Recall
91% MTTR Reduction
287 Detection Delay (instances)
500M+ Patient Feature Vectors
130ms End-to-End Latency
VALIDATED ACROSS THREE TIERS
Tier 1 — Production: 71 confirmed incidents from Environment B (3–5TB/month, 25+ ML pipelines, 500M+ patient feature vectors over 18 months). 91.6% recall, 3.2% FPR, 87.7% source localization accuracy — all baselines including ADWIN, CUSUM, and Bu et al. beaten with statistical significance (Wilcoxon, Bonferroni-corrected α=0.01).

Tier 2 — MIMIC-IV: External validation on 190,835 admissions at Beth Israel Deaconess Medical Center (2008–2019). ASDM correctly detected ICD-9→ICD-10 boundary (ECD), EMAR introduction (NCD), attributed each to the correct source module, and suppressed year-over-year influenza surges as expected seasonal variation (SPD).

Tier 3 — Benchmarks: 287-instance mean detection delay vs. 1,847 for ADWIN across five synthetic generators (Cohen's d = 2.31). 80% reduction on gradual drift (Rotating Hyperplane: 388 vs. 1,932 instances).

Why it matters beyond healthcare: Any ML system assembling features from heterogeneous independent sources — financial services, public-sector analytics, digital advertising — faces the same attribution problem. The taxonomy mechanisms are domain-specific. The monitoring principle — move to the source boundary before the join — is not.

Research Interests

🔍
ML Robustness & Drift Detection
Monitoring production ML systems for distribution shift, source attribution, and data quality degradation in regulated environments.
🏥
Healthcare AI & Compliance
HIPAA-compliant ML systems for clinical workflows, patient risk stratification, and multi-source healthcare data infrastructure.
📈
Quantitative Finance ML
Ensemble methods, regime detection, volatility modeling, and uncertainty quantification for financial time series.
🤖
LLM Safety & RAG
Secure LLM deployment, prompt injection prevention, and retrieval-augmented generation for enterprise and regulated domains.

PHOTOGRAPHY & DRONE SHOTS

Capturing moments. Exploring perspectives.

VENTURES

Something's Brewing... ☕🍫

BUILDING SOMETHING NEW

Beyond ML engineering and photography, I'm exploring entrepreneurship with a close friend. The intersection of AI, data, and real-world problems is where the most interesting opportunities live.

More details coming soon.

WANT TO FOLLOW THE JOURNEY?

Stay tuned. 😊

– Dileep