Technology

From cosmology to oncology

Studying dark matter

Dark matter is a property of the entire universe, yet it cannot be measured directly.

To overcome this, astrophysicists infer the unmeasurable properties of dark matter from individual pictures of galaxies.

These pictures are effectively combined into a single unified dataset to decipher dark matter’s general properties.

Oncology application

Concr teamed up with astrophysicists to adapt these established algorithms for genuine integration of disparate oncology datasets, creating a single holistic model of patient response.

By overcoming the data integration barrier, Concr not only enables powerful analytics, but also provides the critical advantage when working with limited or incomplete data.

Case study

Accurate prediction of in vitro drug response with 300x less data

Key takeaways


Isolated high-confidence results
by accurately predicting molecular features most representative of efficacy for therapeutics

300x more data-efficient: 2-3 cell lines were sufficient to achieve same RMSE, compared to 600 cell lines using other methods

Ability to generalise to novel drugs and cell lines across therapeutic classes and indications

Authors

Matthew Griffiths, Eilish Middlehurst, Matthew Foster

Background and aim

Data availability and quality are often subpar for meaningful and accurate analysis to be performed.

Here we address this challenge using data from the Genomics of Drug Sensitivity in Cancer1 to predict IC50 for specified drug cell-line pairing.

Data & Modelling

Data input:

  • 810 cell lines (WXS and RNAseq)
  • 175 compounds (SMILES)
  • 118,595 dose response curves (IC50)

Modelling:

Illustration 1

Infer cell phenotype

For each drug the dataset was split 80:20 into a training:validation dataset with an even ratio of sensitive/resistant cell-lines for each drug. All response data for Olaparib and Niraparib was excluded from training.

Concr model was trained on the validation set to predict IC50 values and using the SMILES molecular description and the dose-response data to predict efficacy of Niraparib and Olaparib.

Results

Predicted vs observed IC50 are comparable. The model is able to generate its own uncertainty, so it is possible to extract most confident predictions. Concr model had a RMSE* of 0.46, comparable to the best possible RMSE of 0.4.

Graph 1

Concr model was given the dose-response data for 2 randomly selected cell-lines for Olaparib and 3 for Niraparib. The model made accurate prediction of the IC50 for the unseen 800 cell-lines (RMSE = 0.51, 0.53, respectively).

Graph 2

Accuracy was comparable to that achieved by state of the art methods2,3 with 700 cell-lines (RMSE = 0.45)

References

1Iorio, F., et al. (2016). A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 166(3),740–754

2Chang, Y., et al. (2018).Cancer Drug Response Profile scan (CDRscan): A Deep Learning Model That Predicts Drug Effectiveness from Cancer Genomic Signature. Scientific Reports, 8(1), 1–11.

3Rahman, R., et al. (2017). Heterogeneity aware random forest for drug sensitivity prediction. Scientific Reports, 7(1), 1–11.

*RMSE = Root-mean-square deviation

Case study

Concr modelling is generalisable across drugs and cancer types

Key takeaways:

Concr modelling can be generalised across to drugs it's 'blind' to

Concr modelling shows early evidence of being generalisable across cancer types

Background and aim

We have previously demonstrated accurate response predictions using the Genomics of Drug Sensitivity in Cancer1 (GDSC) comprehensive dataset. However most drugs do not have a database of 800+ cell lines with their associated IC50. Hence we set out to assess generalisability of our modelling across cancer types, and to drugs the models are ‘blind’ to.

Data & Modelling

Data input:

  • 810 cell lines (WXS and RNAseq)
  • 175 compounds (SMILES)
  • 118,595 dose response curves (IC50)
  • Precision Panc cell line data

Modelling:

Concr model was trained on GDSC data to predict IC50 values, and using the SMILES molecular description and the dose-response data to predict drug efficacy. The model was applied directly to the Precision Panc cell line dataset to test predictive accuracy.

For drug generalisability, the process of progressively performing cell line viability experiments was simulated and the process was repeated 30 times to provide aggregate statistics. In this study no structural information about the therapy was provided.

Results

Case studies booklet 2

The overall predictions of the GDSC model have r2 score=0.414. Error bars=uncertainty in the prediction; color of the points=cell line; shape=drug used.

Case studies booklet 2 2

After <20 iterations the model accurately chose cell lines with IC50 values less than the targeted cutoff (left panel). After 50 iterations majority of the most sensitive cell lines have been selected, and after 100 nearly all of them have (right panel).

References

1Iorio, F., et al. (2016). A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 166(3),740–754

Case study

Identifying and validating breast cancer biomarkers for cohort stratification

Key takeaways:

Superior predictive accuracy compared to other methods

7x less patient data required for model training compared to the next best approach

First of its kind: disease-free survival stratification

Authors

Matthew Griffiths, Uzma Asghar, Matthew Foster

Background and aim

Better patient stratification in early-phase trials through more effective biomarkers would not only improve efficacy and faster drug approval, but also reduce length and cost of trials, ultimately passing the savings on to healthcare providers.

In this study we used Concr advanced statistical modelling to predict cell profiles of responding breast cancer patients and their overall and disease-free survival, segmented into risk groups.

Data & Modelling

Data input - TCGA:

  • 1098 breast cancer patients
  • Therapy used
  • Outcome (OS, DFS)
  • Tumour data (WGS, RNASeq, Illumina 450k Methylation)

Modelling:

Concr hierarchical bayesian multi-omic model to identify risk profiles was created excluding 500 patients who received alkylating therapy.

The dataset was split into 5 random cohorts: 1 was used as validation, and 4 were used to train the OS and DFS models, repeated 5 times.

Results

Case studies booklet 2 3

Figure 1. Concr custom Bayesian cell admixture model identified recurring multi-omic cell profiles in the TCGA cohort excluding patients who received alkylating therapy. These cell profiles could then be used to infer the subpopulation breakdown of the patients tumours who had received alkylating agents.

Case studies booklet 2 4

Figure 2. Using k-fold validation, risk profiles were associated with the cell profiles identified by the model, segmenting patients by DFS and OS (who received alkylating therapy). Results shown are aggregated predictions of five independent validations. AUC accuracy = 0.88, superior to other methods1 and using 7x less patients than next best approach.

Case studies booklet 2 6

Figure 3. Our models identified specific recurring subtypes of tumour cells commonly seen across the cohort, stratified by the genetic, transcriptomic and methylation status into subtypes 1-11. A ‘risk’ score was calculated for each therapy, with significant variation observed across the treatments and subtypes (whiskers = upper / lower quartiles of the confidence interval).

References

1Dubourg-Felonneau, et al. (2018). A Framework for Implementing Machine Learning on Omics Data. 1–5.


TCGA - The Cancer Genome Atlas
OS - overall survival
DFS - disease-free survival

Read our latest news

News

Partnerships

Our partners leverage Concr advanced predictive modelling across every stage of therapeutic development to create shared value for the benefit of cancer patients.

Partnerships

Nightingale

Our single platform effectively integrates diverse data parcels to generate meaningful insights.

Platform