Added NIH spending category predictions for recent grants

data update
Author

Noam Ross

Published

January 23, 2026

We have added predicted NIH spending categories to grants in the Grant Witness database that do not yet have official categorizations from NIH.

Why predict spending categories?

NIH categorizes grants by research topic through the Research, Condition, and Disease Categorization (RCDC) Process. These categories—such as “Alzheimer’s Disease,” “Cancer,” “HIV/AIDS,” and over 300 others—help track federal research spending across different health conditions and research areas. However, the RCDC categorization process typically lags behind grant awards by months, as they are assigned in bulk once per year.

For Grant Witness, this created a significant gap in our ability to analyze terminated grants by research area. Many of the grants terminated in 2025 and late 2024 did not yet have RCDC categories assigned, making it difficult to understand which health conditions and research topics have been most affected by terminations.

How the predictions work

We developed a machine learning model to predict which RCDC categories apply to grants that don’t yet have official NIH categorizations. The model was trained on 88,000+ historical NIH grants where RCDC categories are already assigned. It uses:

  • Grant terms: 44,000+ unique terms
  • Funding institute: 27 NIH Institutes and Centers
  • Study section: 181 study sections
  • CFDA code: 4 program codes

The model uses 342 separate XGBoost classifiers (one per category) to handle the multi-label nature of grant categorization, where each grant can belong to multiple categories (for example, a project studying cardiovascular effects of diabetes would be categorized under both “Heart Disease” and “Diabetes”). Post-hoc adjustment prediction thresholds based on whether related categories are also predicted. This addresses the fact that categories are not independent, and has the effect of strongly reducing false positives of rare categories, with a small increase in false negatives generally.

Grants with model-predicted categories are marked in our database with categories_predicted = TRUE to distinguish them from official NIH RCDC categorizations. As NIH releases updated RCDC data in the future, we will replace our predictions with the official categorizations.

Model performance

We validated the model on a held-out set of 10,000 grants with known RCDC categories. Performance metrics weighted by category prevalence:

  • AUC: 0.984 - Excellent discrimination ability
  • F1 Score: 0.804 - Strong balance between precision and recall
  • False Positive Rate: 4.7% - Low rate of incorrect category assignments
  • False Negative Rate: 7.8% - Some true categories may be missed

Performance by major disease area:

Disease Area AUC FPR FNR
Cancer 0.995 2.1% 3.4%
Alzheimer’s Disease 0.998 1.8% 1.9%
Heart Disease 0.995 3.5% 3.4%
Diabetes/Obesity/Metabolic 0.992 5.4% 2.5%

Performance is best for major disease categories (AUC >0.99) and somewhat lower for very rare categories, which have false-negative rates up to 15%.

What this enables

With spending categories now available for recent grants, users can:

  • Filter the NIH grants table by research topic to see terminations in specific disease areas
  • Analyze termination patterns by health condition (e.g., “How many cancer research grants were terminated?”)
  • Identify which research topics have been disproportionately affected
  • Track terminated grants in areas of public health concern