Science
View as Markdown

Mutational signature estimation with Hierarchical Dirichlet Process

Mutational signature estimation with Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is a popular and elegant model that has gained traction in various fields, including cancer genomics. While researching methods for mutational signature estimation, I noticed that several studies employed HDPs for this purpose. However, many of these studies were published in biological journals and often lacked detailed explanations of how HDPs can be effectively applied to mutational signature estimation.

Nicola Roberts’ dissertation includes some descriptions of the mathematical model in the appendix, but even these are not fully comprehensive. In this blog post, I aim to bridge this gap by introducing the problem of mutational signature estimation using HDPs in a mathematically rigorous manner. I will begin by discussing a simpler mixture component model for mutational signatures, then explore the Dirichlet Process model, and finally delve into the Hierarchical Dirichlet Process model.

Introduction

A key characteristic of cancer is the elevated mutation rate in somatic cells. Recognizing the patterns of somatic mutations in cancer is essential for understanding the underlying mechanisms that drive the disease (Fischer et al., 2013). This knowledge is crucial for advancing cancer treatment and prevention.

One type of somatic mutation is the single nucleotide variant (SNV), also referred to as a single base substitution. This involves the replacement of one base pair in the genome with another. There are six possible base pair mutations: C:G>A:T, C:G>G:C, C:G>T:A, T:A>A:T, T:A>C:G, and T:A>G:C. When considering the context of the immediate 5’ and 3’ neighboring bases, the total number of mutations increases to 96, calculated as 6×4×4=966 \times 4 \times 4 = 96. These are known as trinucleotide mutation channels. Examples of such mutations include ACA:TGT>AAA:TTT and ACT:TGA>AGT:TCA. For simplicity, we will abbreviate these mutations as ACA>AAA and ACT>AGT respectively.

Mutational Signature

A mutational signature θ\theta with respect to trinucleotide mutation channels is a discrete probability distribution over the 96 mutational channels i.e. θΔ96\theta\in \Delta^{96} where

Δn={(x1,x2,,xn)Rnxi0 for all i and i=1nxi=1}.\Delta^{n} = \left\{ (x_1, x_2, \ldots, x_{n}) \in \mathbb{R}^{n} \mid x_i \geq 0 \text{ for all } i \text{ and } \sum_{i=1}^{n} x_i = 1 \right\}.

A mutational signature θ\theta can be visualised as in the following.

500

The framework for studying somatic mutations through mutational signatures was introduced in a landmark study by Alexandrov et al. (2013), where over 7,000 bulk-sequenced cancer samples were analyzed. Conceptually, a mutational signature represents a biological process acting on the genome, leaving a distinct imprint captured in a probability vector, θ\theta. A comprehensive list of mutational signatures and their proposed underlying biological processes is available through the COSMIC project. Some of the mutational signatures and their associated probability vectors θ\theta in the COSMIC project are consistently identifiable across most examined cohorts and signature identification algorithms (cite).

To estimate mutational signatures and their activities, Alexandrov et al. employed Non-Negative Matrix Factorization (NMF). NMF and its variations remain the most widely used methods for estimating mutational signatures and mutational signature activities.

Dirichlet Process (DP)

A key mathematical tool for our study is the Dirichlet Process (DP). Formally, the Dirichlet process, denoted as DP(α,H)\mathrm{DP}(\alpha, H), is a distribution over probability measures. When we sample GDP(α,H)G \sim \mathrm{DP}(\alpha, H), we obtain a probability measure GG. The Dirichlet Process has two parameters: the concentration parameter α>0\alpha>0, and the base measure HH. The support of GG is a subset of the support of the base measure HH.

To sample from the Dirichlet process, we begin by drawing an infinite sequence of i.i.d. samples θ~1,θ~2,H\tilde{\theta}_1, \tilde{\theta}_2, \ldots \sim H. Then, we draw a vector e={e1,e2,}Stick(α)\boldsymbol{e} = \{e_1, e_2, \ldots\} \sim \operatorname{Stick}(\alpha), where Stick(α)\operatorname{Stick}(\alpha) refers to the stick-breaking process. Intuitively, the stick-breaking process works by repeatedly breaking off and discarding a random fraction of a “stick” that is initially of length 11. Each broken off part of the stick has a length eke_k. The stick-breaking process generates a sequence eΔ\boldsymbol{e} \in \Delta^\infty, where the total length of all the broken pieces is 1.

Finally, we construct the random measure GG as:

G=k=1ekδθ~kG = \sum_{k=1}^{\infty} e_k \delta_{\tilde{\theta}_k}

where δθ~k\delta_{\tilde{\theta}_k} is a Dirac delta function centered at θ~k\tilde{\theta}_k. The resulting measure GG is discrete, with countable support given by {θ~1,θ~2,}\{\tilde{\theta}_1, \tilde{\theta}_2, \ldots\}.

Mathematical properties of Dirichlet Process

The base measure HH serves as the mean of the random measure GG. Specifically, for any measurable set AA, the expectation of G(A)G(A) under the Dirichlet process is given by [@millerDirichletProcessModels]:

EGDP(α,H)[G(A)]=H(A)\mathrm{E}_{G \sim \mathrm{DP}(\alpha, H)}[G(A)]=H(A)

As α\alpha \rightarrow \infty, the distribution of GG converges to HH in the weak topology, meaning that larger values of α\alpha make GG more closely resemble the base measure HH.

In summary, GG is centered around the prior distribution HH, with the concentration parameter α\alpha controlling how tightly GG clusters around HH. A higher α\alpha reduces variability, causing GG to approximate HH more closely.

References

Recommended material for Dirichlet Process

NoteBook Impelementation of the Dirichlet Process

Modelling mutational signature with DP

Dirichlet processes have been utilized in previous studies to model trinucleotide mutations associated with mutational signatures (cite) and (cite). Although Appendix B (p. 234) and Chapter 4 (p. 132) in (cite) provide a mathematical overview of the model, they lack sufficient detail and are not fully comprehensive. This section aims to address that gap by deriving the Dirichlet Process model for trinucleotide mutations in a thorough and accessible manner. We will begin by introducing a more intuitive approach to modeling trinucleotide mutations using a mixture model. To illustrate, consider the trinucleotide mutations within a single sample, which can be encoded as x{1,,96}x \in \{1, \ldots, 96\}, representing the 96 possible trinucleotide mutation types.

Step 1: Known number of mutational signatures KK

Let’s begin by assuming we know the number of active mutational signatures, denoted as KK. Each signature is represented by a discrete probability distribution, θ~kΔ96\tilde{\theta}_k \in \Delta^{96}, where k=1,,Kk=1, \ldots, K. Since we assume that we have no prior knowledge of the specific mutational signatures, we model these distributions as:

θ~kDir(196196) for k=1,,K,\begin{aligned} \tilde{\theta}_k & \sim \operatorname{Dir}\left(\frac{1}{96} \cdot \mathbf{1}_{96}\right) \text{ for } k=1,\ldots, K\\ \end{aligned},

where Dir\operatorname{Dir} denotes the Dirichlet distribution, serving as a symmetric prior. Similarly, we assume no prior information about the activity of these mutational signatures. We model the activity of the KK mutational signatures eΔK\boldsymbol{e} \in \Delta^K in a sample as:

eα,KDir(αK1K).\begin{aligned} \boldsymbol{e} \mid \alpha,K & \sim \operatorname{Dir}\left(\frac{\alpha}{K} \cdot \mathbf{1}_K\right). \end{aligned}

where α\alpha is a hyperparameter controlling the concentration of the distribution. Next, let zi=kz_i=k represent the event that the ii-th mutation was generated by the kk-th mutational signature. If we observe a total of MM trinucleotide mutations in our sample, we draw each zi{{1,,K}}z_i\in \{\{1,\ldots, K\}\} according to the mutational signature activity e\boldsymbol{e}, as follows:

zieCategorical(e) for i=1,,M,\begin{aligned} z_i \mid \boldsymbol{e} & \sim \operatorname{Categorical} \left(\boldsymbol{e}\right) \text{ for } i=1,\ldots, M, \end{aligned}

indicating that the probability P(zi=k)=ek\mathrm{P}\left(z_i=k\right)=e_k, where eke_k is the activity level of signature kk. Finally, the observed trinucleotide mutation xi{1,,96}x_i \in\{1, \ldots, 96\} is drawn from a categorical distribution based on the corresponding mutational signature zi=kz_i=k and its distribution θ~k\tilde{\theta}_k :

{1,,96}xizi=k,θ~kCategorical(θ~k) for i=1,,M\begin{aligned} \{{1,\ldots, 96}\} \ni x_i \mid z_i=k, \tilde{\theta}_k & \sim \operatorname{Categorical} \left(\tilde{\theta}_{k}\right) \text{ for } i=1,\ldots, M \end{aligned}

This completes our model, which describes how the mutational signatures {θ~1,,θ~K}\left\{\tilde{\theta}_1, \ldots, \tilde{\theta}_K\right\} generate the observed trinucleotide mutations {x1,,xM}\left\{x_1, \ldots, x_M\right\}

Graphical model for known number of components KK

We can visualize the process that generates trinucleotide mutations in a sample using a graphical model:

300

Here the grey background color indicates that xix_i is an observed random variable.

Step 2: Equivalent model as mixture

In the next step, we aim to integrate out the indicator variable zi=kz_i=k in our model, seeking a different representation of the same model. As before, we draw the mutational signature activities as follows:

eα,KDir(αK1K).\begin{aligned} \boldsymbol{e} \mid \alpha,K & \sim \operatorname{Dir}\left(\frac{\alpha}{K} \cdot \mathbf{1}_K\right). \end{aligned}

Let H=Dir(196196)H=\operatorname{Dir}\left(\frac{1}{96} \cdot \mathbf{1}_{96}\right). Similar to the last step, we generate the mutational signatures as

θ~kH for k=1,,K.\begin{aligned} \tilde{\theta}_k & \sim H \text{ for } k=1,\ldots, K. \end{aligned}

Now, instead of assuming that each observation i=1,,Mi=1, \ldots, M is first assigned a mutational signature zi{1,,K}z_i \in\{1, \ldots, K\} and then drawn from the corresponding signature θ~zi\tilde{\theta}_{z_i} we directly draw θi\theta_i from {θ~1,,θ~K}\left\{\tilde{\theta}_1, \ldots, \tilde{\theta}_K\right\}, with probabilities {e1,,eK}\left\{e_1, \ldots, e_K\right\} for each observation:

θie,θ~1,θ~KG=k=1Kekδθ~k(θi) for i=1,,M.\begin{aligned} \theta_i \mid \boldsymbol{e}, \tilde{\theta}_1, \ldots \tilde{\theta}_K & \sim G=\sum_{k=1}^K e_k \delta_{\tilde{\theta}_k}\left(\theta_i\right) \text{ for } i=1,\ldots, M. \end{aligned}

Here, GG represents a finite mixture distribution with support on the acting mutational signatures. Finally, we draw the trinucleotide mutation as follows:

xiθiCategorical(θi) for i=1,,M.\begin{aligned} x_i \mid \theta_i & \sim \operatorname{Categorical} \left(\theta_i\right) \text{ for } i=1,\ldots, M. \end{aligned}

Step 3: Unknown, unbounded number of components KK

Let

Δ={(θ1,θ2,,)Rθi0 for all i and i=1θi=1}.\Delta^{\infty} = \left\{ (\theta_1, \theta_2, \ldots, ) \in \mathbb{R}^{\infty} \mid \theta_i \geq 0 \text{ for all } i \text{ and } \sum_{i=1}^{\infty} \theta_i = 1 \right\}.

We now extend the model from the previous step to allow for an unknown number of components KK, with any KNK \in \mathbb{N} being possible. As in the previous step, we generate the underlying mutational signatures θ~kΔ96\tilde{\theta}_k\in \Delta^{96} from HH as follows:

θ~kH for k=1,2,\begin{aligned} \tilde{\theta}_k & \sim H \text{ for } k=1,2,\ldots \end{aligned}

Previously, we sampled a finite set of KK mutational signatures θ~k\tilde{\theta}_k, but now we generate an infinite sequence of signatures. Accordingly, the mutational signature activities e\boldsymbol{e} now lie in Δ\Delta^{\infty}. To model these activities, we use the Stick-breaking process, which defines a distribution over Δ\Delta^{\infty}. Thus, we sample the mutational signature activities as:

eαStick(α).\begin{aligned} \boldsymbol{e} \mid \alpha &\sim \operatorname{Stick}(\alpha). \end{aligned}

Intuitively, the stick-breaking process works by repeatedly breaking off and discarding a random fraction of a “stick” that is initially of length 11. Each broken off part of the stick has a length eke_k. The stick-breaking process generates a sequence eΔ\boldsymbol{e} \in \Delta^\infty, where the total length of all the broken pieces is 11. Consequently, the mixture distribution GG, which previously had finite support, now has countably infinite support {θ~1,θ~2,}\left\{\tilde{\theta}_1, \tilde{\theta}_2, \ldots\right\} and takes the form:

P(Δ96)G=k=1ekδθ~k(θi),\begin{aligned} \mathcal{P}(\Delta^{96}) \ni G&=\sum_{k=1}^{\infty} e_k \delta_{\tilde{\theta}_k}\left(\theta_i\right) , \end{aligned}

where P(Δ96)\mathcal{P}(\Delta^{96}) denotes the space of probability distributions on Δ96\Delta^{96}. We draw θi\theta_i from the support of GG, which is {θ~1,θ~2,}\left\{\tilde{\theta}_1, \tilde{\theta}_2, \ldots\right\}, as follows:

θiGG for i=1,,M\begin{aligned} \theta_i \mid G & \sim G \text{ for } i=1,\ldots, M \\ \end{aligned}

Finally, we observe the mutations as

{1,,96}xiθiCategorical(θi) for i=1,,M\begin{aligned} \{{1,\ldots, 96}\} \ni x_i \mid \theta_i & \sim \operatorname{Categorical} \left(\theta_i\right) \text{ for } i=1,\ldots, M \end{aligned}

Step 4: Short hand notation

We summarize the first three lines in the last equation and say that GG was genereated by the Dirichlet Process DP\mathrm{DP} with the parameters α\alpha and HH. This way the above equations simplify to

GDP(α,H)Δ96θiGG for i=1,,M{1,,96}xiθiCategorical(θi) for i=1,,M\begin{aligned} G &\sim \mathrm{DP}(\alpha, H)\\ \Delta^{96}\ni \theta_i \mid G &\sim G \text{ for } i=1,\ldots, M \\ \{{1,\ldots, 96}\} \ni x_i \mid \theta_i & \sim \operatorname{Categorical} \left(\theta_i\right) \text{ for } i=1,\ldots, M \end{aligned}

This concludes our derivation of how we can model trinucleotide mutations with a Dirichlet Process, as used in (cite). The Dirichlet Process approach presents an alternative approach to estimate mutational signature and signature activity as alternative to the classical approach based on Non-negative Matrix Factorisation used in (cite).

Graphical Model of Dirichlet Process

We can visualise the Dirichlet process as a graphical model:

ThegraphicalmodelofaDirichletProcess.

Hierarchical Dirichlet Process (HDP)

By taking composition of multiple Dirichlet Processes, we obtain a Hierarchical Dirichlet Process (HDP) (cite). Recall that Dirichlet processes excel at modelling mixtures with an unknown number of components. HDPs are tailored for the situation when we have groups of mixture components and/or hierarchies between mixture components. For our model of trinucleotide mutations, we can use HDP to account for multiple samples. We expect from biology that there are general trends of mutational signature activity. However, we expect those trends to be more similar within a sample than between samples. In a mixture modelling approach, this would be captured by separate mixture models for each sample and information sharing between the weights of each sample. In the Dirichlet Process approach, this can be reflected by the HDP model. As before, we start with the prior distribution HH. We draw a common base distribution G0G_0 from a Dirichlet Process as

G0DP(α0,H).\begin{aligned} G_0 &\sim \mathrm{DP}(\alpha_0, H).\\ \end{aligned}

Let’s assume we have NN samples. For each sample, we draw a seperate Dirichlet Process GjG_j from the common base distribution G0G_0 as

GjG0DP(αj,G0) for j=1,2,N.\begin{aligned} G_j \mid G_0 &\sim \mathrm{DP}(\alpha_j, G_0) \text{ for } j = 1,2,\ldots N. \end{aligned}

In other words, we have a Dirichlet Process G0G_0 at the top and other Dirichlet Processes GjG_j at the lower hierarchies. As described in the section of the mathematical properties of the Dirichlet Process, each GjG_j will vary around G0G_0. This allows the different samples GjG_j to share information via G0G_0. Let’s assume we observe MjM_j mutations in the jj-th sample. We draw the mutations as before:

θjiGjGj for i=1,2,Mj,xjiθjiCategorical(θji).\begin{aligned} \theta_{j i} \mid G_j &\sim G_j \text{ for } i = 1,2,\ldots M_j , \\ x_{ji} \mid \theta_{j i} & \sim \operatorname{Categorical} \left(\theta_{j i}\right). \end{aligned}

The corresponding graphical model is:

500

Estimation of mutational signature and mutational signature activity

Let’s assume we have the mutational signatures (θk)kNΔ96\left(\theta_k\right)_{k\in \mathbb{N}}\subset \Delta^{96} acting on the genome. Borrowing notation from the model with a finite number of mutational signatures, let zjiNz_{ji}\in \mathbb{N} denote the mutational signature that generated the mutation xji{1,,96}x_{ji}\in \{1,\ldots,96\}. By fitting a HDP to our data we will obtain an estimate z^ji\hat{z}_{ji} of zjiz_{ji}. Our estimate θ^k\hat{\theta}_k for the associated probability vector θkΔ96\theta_k \in \Delta^{96} of mutational signature kk is

θ^k=l=196j=1Ni=1MjI(z^ji=k)I(xji=l)δlj=1Ni=1MjI(z^ji=k),\hat{\theta}_k=\frac{\sum_{l=1}^{96}\sum_{j=1}^N\sum_{i=1}^{M_j} \mathbb{I}(\hat{z}_{ji}=k)\mathbb{I}(x_{ji}=l)\delta_l}{\sum_{j=1}^N\sum_{i=1}^{M_j} \mathbb{I}(\hat{z}_{ji}=k)},

where δlΔ96\delta_l\in \Delta^{96} has mass 11 at position ll. Further, we estimate the mutational signature activity ekje_{kj} of mutational signature kk in sample jj by

e^kj=1/Mji=1MjI(z^ji=k).\hat{e}_{kj}=1/M_j\sum_{i=1}^{M_j} \mathbb{I}(\hat{z}_{ji}=k).

Advantages of HDP approach compared to the classical NMF approach for mutational signature estimation

The Hierarchical Dirichlet Process (HDP) approach offers several advantages over the classical Non-negative Matrix Factorization (NMF) method. It allows for the incorporation of prior knowledge and the imposition of group structures, such as the expectation that certain features should behave in coordinated ways (cite). In mutational signature analysis, two key tasks---(1) de novo discovery of mutational signatures and (2) estimation of signature activity---can be performed simultaneously using HDP. The method can match the observed mutational catalog to an existing signature library while also identifying new signatures by pseudo-counting the existing library as observational data. Additionally, HDP facilitates the direct learning of the number of signatures from the data, overcoming a common challenge in NMF, which often struggles to select the appropriate number of signatures. While some NMF variants can quantify uncertainty, HDP provides this capability more easily.

References

Python Implementations 1

Python Implementations 2

Recommended Material for HDP

AI Chat

Ask me anything about Daniel's experience, skills, or background!