Methodology
How predictions are generated and validated
Overview
SgTxGNN uses a dual-method approach combining Knowledge Graph (KG) predictions and Deep Learning (DL) predictions to identify drug repurposing candidates. Predictions validated by both methods (KG+DL) have higher confidence.
Prediction Pipeline
Step 1: Knowledge Graph Prediction (KG)
The Knowledge Graph method uses TxGNN’s biomedical knowledge graph containing:
- 17,080 biomedical entities (drugs, diseases, genes, proteins)
- 80,127 drug-disease relationships
- Biological pathway connections
KG predictions identify drugs that share biological pathways or targets with diseases.
Step 2: Deep Learning Prediction (DL)
The Deep Learning method uses TxGNN’s graph neural network model:
- Trained on known drug-disease relationships
- Learns complex patterns in the knowledge graph
- Outputs confidence scores (0.0-1.0) for each drug-disease pair
Step 3: Dual Validation (KG+DL)
Predictions that appear in both KG and DL results are marked as “KG+DL” with higher confidence:
- 1,217 dual-validated predictions in SgTxGNN
- These predictions have convergent evidence from two independent methods
Evidence Classification
L1-L5 Evidence Levels
| Level | Definition | Criteria |
|---|---|---|
| L1 | Multiple Phase 3 RCTs | ≥2 completed Phase 3 trials with positive results |
| L2 | Single RCT or Phase 2 | 1 RCT or ≥2 Phase 2 trials |
| L3 | Observational Studies | Cohort or case-control studies |
| L4 | Preclinical/Mechanistic | In vitro, animal studies, or mechanistic evidence |
| L5 | Prediction Only | AI prediction without clinical evidence |
Evidence Sources
Evidence is collected from:
- ClinicalTrials.gov - Clinical trial registry
- PubMed - Biomedical literature
- DrugBank - Drug mechanism and interaction data
- Singapore HSA - Local registration status
Prediction Quality
Confidence Scores
DL predictions include confidence scores:
- >0.99: Very high confidence
- 0.95-0.99: High confidence
- 0.90-0.95: Moderate confidence
- 0.50-0.90: Lower confidence (still above threshold)
Filtering Criteria
All predictions meet these minimum criteria:
- DL score ≥ 0.50 (above random chance)
- Drug is registered with Singapore HSA
- Drug has valid DrugBank mapping
Data Processing
Singapore HSA Data
- Drug registration data from data.gov.sg
- 5,485 registered products processed
- Active ingredients mapped to DrugBank IDs
- 745 unique drugs with successful mapping
TxGNN Integration
- DrugBank IDs matched to TxGNN knowledge graph
- Predictions generated for all mapped drugs
- Results filtered by confidence threshold
- Final dataset: 31,543 predictions
Limitations
Model Limitations
- TxGNN trained on historical data (may miss recent discoveries)
- Some drugs/diseases not in knowledge graph
- Predictions are computational hypotheses, not clinical evidence
Data Limitations
- HSA data may not include all marketed products
- Some ingredient mappings may be imprecise
- Evidence collection limited to English literature
Interpretation
- L5 predictions require clinical validation
- High DL scores don’t guarantee clinical efficacy
- Always consult healthcare professionals
Reproducibility
Code & Data
- Source code: GitHub
- TxGNN model: Harvard Dataverse
- HSA data: data.gov.sg
Version Information
| Component | Version |
|---|---|
| TxGNN Model | v1.0 (Nature Medicine 2023) |
| HSA Data | March 2026 |
| SgTxGNN | v1.0.0 |
Disclaimer
Predictions are computational hypotheses for research purposes only. Clinical validation is required before any therapeutic application.
Predictions are computational hypotheses for research purposes only. Clinical validation is required before any therapeutic application.