Science
View as Markdown Suggest changes

Extending HDP for Mutational Signatures: Incorporating Evolutionary Trees

· Reading time: 8 min
Extending HDP for Mutational Signatures: Incorporating Evolutionary Trees

In my previous blog post, I introduced the Hierarchical Dirichlet Process (HDP) as an elegant approach for mutational signature estimation. The HDP model allows us to simultaneously discover mutational signatures and estimate their activities across multiple samples while learning the number of signatures directly from the data.

However, the standard HDP approach treats each sample independently, missing a crucial biological insight: evolutionarily related cells should exhibit more similar mutational signature activities. In this follow-up post, I’ll present an extension of the HDP model that incorporates phylogenetic tree structures to capture this evolutionary relationship.

The Biological Motivation

With the advent of high-resolution techniques such as single-cell sequencing, we can now model the varying activities of mutational signatures within a tumor at unprecedented resolution. A key insight is that evolutionarily related cells should have more similar mutational signature activities than distantly related ones.

In cancer, a subclone represents a set of tumor cells descending from the same ancestor and hence sharing mutations. For many measurement modalities, there are established methods to infer relationships between subclones, often represented as tree structures. Examples include:

For our purposes, the specific tree construction method is less important than having a reliable tree topology. Depending on the type of mutational event under consideration and the available data, we can use any of these methods to infer relationships.

Consider the following example tree, where edges represent evolutionary relatedness:

300

Note that we have varying numbers of cells per node. In particular, we can have 0 cells attached to a node of the tree. Let’s drop the number of cells per node and enumerate the nodes:

500

Let T\mathcal{T} represent the topology of the tree. Without loss of generality, we assume a single tree for our input data. If multiple trees exist, such as when separate trees are constructed for different tumors, we can combine them by introducing a new root node and connecting the root nodes of the individual trees to this new root.

This tree consists of nodes V(T)={v1,v2,v3,v4,v5}V(\mathcal{T})=\{\color{#BEBEBE} v_1, \color{#BB6438} v_2, \color{#B25870} v_3, \color{#7630A9} v_4, \color{#6672C9} v_5\}. Each node can represent a subclone or serve as a connecting node. For each node vjv_j we observe trinucleotide mutations xj,1,,xj,Mj{1,,96}x_{j,1},\ldots, x_{j,M_j}\in \{1,\ldots,96\}. Note that Mj=0M_j=0 is allowed. Let ejΔe_{\cdot j}\in \Delta^{\infty} denote the activity catalogue in node jj.

The key biological insight: Subclones with closer evolutionary history should have more similar mutational signature activity. Let dd be a measure of dissimilarity on Δ\Delta^{\infty}. In the above example we expect that d(e4,e5)d(e2,e5)d(\color{#7630A9} e_{\cdot 4}, \color{#6672C9} e_{\cdot 5}) \leq d(\color{#BB6438} e_{\cdot 2}, \color{#6672C9} e_{\cdot 5}). When we have access to tree topology T\mathcal{T}, we can incorporate T\mathcal{T} into our model as prior knowledge when inferring z^ji\hat{z}_{ji} and consequently θ^k\hat{\theta}_k and e^kj\hat{e}_{kj}.

The Tree-Structured HDP Model

Building on the standard HDP framework from my previous post, I now extend it to incorporate tree structure. The key modification is that instead of having all samples share a common base distribution G0G_0, we now have a hierarchical structure where each node in the tree has its own Dirichlet Process that depends on its parent.

Sticking to the color coding from above, the graphical model of the proposed model looks as follows:

500

Each variable GG is a Dirichlet Process. For simplicity, we have omitted the scaling parameters αj\alpha_j for each Dirichlet process from the graph. In this example, we observe trinucleotide mutations in nodes v2,v4,v5v_2, v_4, v_5 but not in v1,v3v_1, v_3.

The key insight: Recall from the mathematical properties of the Dirichlet Process that G4G_4 and G5G_5 vary around G3G_3. Similarly, G2G_2 and G3G_3 vary around G1G_1. Hence, we can expect that the distribution of G5G_5 is more similar to G4G_4 than to G2G_2. This means that e5\color{#6672C9} e_{\cdot 5} should be more similar to e4\color{#7630A9} e_{\cdot 4} than to e2\color{#BB6438} e_{\cdot 2}, which directly addresses our biological motivation.

In the above example, G2G_2 is the base probability distribution of G1G_1. Therefore, the support of G2G_2 is a subset of the support of G1G_1, i.e., supp(G2)supp(G1)=θ~1G1,θ~2G1,\operatorname{supp}(G_2) \subset \operatorname{supp}(G_1) = {\tilde{\theta}_1^{G_1}, \tilde{\theta}_2^{G_1}, \dots}. A major shift in signature activity from node v1v_1 to v2v_2 could present a challenge for the model. It would be valuable to investigate whether such a shift is biologically plausible and, if so, to test the model in that scenario.

Mathematical Formulation

The sampling of G0G_0, θji\theta_{j i}, and xjix_{ji} works exactly as in the classical HDP model from my previous post:

G0DP(α0,H),θjiGjGj for i=1,2,Mj,xjiθjiCategorical(θji).\begin{aligned} G_0 &\sim \mathrm{DP}(\alpha_0, H),\\ \theta_{j i} \mid G_j &\sim G_j \text{ for } i = 1,2,\ldots M_j, \\ x_{ji} \mid \theta_{j i} & \sim \operatorname{Categorical} \left(\theta_{j i}\right). \end{aligned}

The key innovation is in the sampling of GjG_j. Consider a tree topology T\mathcal{T}. We write (vi,vs)T(v_i,v_s)\in \mathcal{T} if and only if the ii-th node is a child of the ss-th node. Then we model the Dirichlet Process at each node vjv_j of the tree T\mathcal{T} as:

GjGsDP(αj,Gs) for (vi,vs)T.\begin{aligned} G_j \mid G_s &\sim \mathrm{DP}(\alpha_j, G_s) \text{ for } (v_i,v_s)\in \mathcal{T}. \end{aligned}

In words, at each node of the tree we draw GjG_j from a Dirichlet Process with an individual scaling parameter αj\alpha_j and the base probability distribution GsG_s from the parent node. By the mathematical properties of the Dirichlet Process, GjG_j varies around the probability distribution GsG_s of its parent node. Hence, distribution properties of GsG_s get propagated to GjG_j and further down the tree, ensuring that evolutionarily related nodes have similar mutational signature activities.

Challenges and Limitations

There are several challenges anticipated in making the proposed model practical. The first set of challenges relates to the input data requirements. To construct a tree of relatedness, we need sequencing at a subtumor level, such as sequencing at single-cell or subclonal level. The tumor sample must contain a sufficient number of trinucleotide mutations, typically seen in late-stage tumors. To be able to measure a sufficient number of the trinucleotide mutations, sequencing must cover a large enough genomic region.

Next Steps and Future Directions

Implementation

The current implemented prototype is a fork of the HDP implementation by Nicola Roberts in R with some helper functions. As a next step, an implementation of the package in Python would be interesting to make it more accessible to the broader community.

Benchmarking Against Standard Methods

A crucial next step is to benchmark the tree-structured HDP against standard approaches. The evaluation strategy would involve:

  1. Dataset Selection: Choose a dataset suitable for the proposed method with available tree topology
  2. Method Comparison: Estimate mutational signatures using:
    • Classical NMF approach: ΘNMF\Theta^{\text{NMF}} and ENMFE^{\text{NMF}}
    • Tree-structured HDP: ΘTree-HDP\Theta^{\text{Tree-HDP}} and ETree-HDPE^{\text{Tree-HDP}}
  3. Evaluation: Compare how well the estimated signature activities DNMFD^{\text{NMF}} and DTree-HDPD^{\text{Tree-HDP}} respect the known tree topology T\mathcal{T}

Alternative Approaches

Instead of the Hierarchical Dirichlet Process approach, we could utilize variations of Non-negative Matrix Factorization. Given observed mutation catalogue XX and a constructed tree topology T\mathcal{T}, we could introduce a penalization term RT(E)R_{\mathcal{T}}(E) to ensure that the tree topology is respected:

minS,E0D(X,SE)+λRT(E).\min_{S, E \geq 0} D(X, S E) + \lambda R_{\mathcal{T}}(E).

A desirable property of the function RTR_{\mathcal{T}} would be that for samples i,j{1,...,l}i,j\in \{1,...,l \} attached to nodes in the graph T\mathcal{T} which are close to each other, we have similar EiE_{\cdot i} and EjE_{\cdot j}.

Alternatively, we could explore hierarchical NMF approaches. There is abundant literature on NMF and its variations, including Sugahara et al., Ding et al., Ferreira et al., and Schmidt & Raphael.

Extension to Other Mutational Events

Instead of focusing solely on trinucleotide mutational events, we could study other mutational events with the proposed model. For example, we could study chromosomal instability events using the approach described in Drews et al.. However, this would introduce additional requirements to the input data, such as high read depth, and such datasets are not yet widely available.

Recommended material for chromosomal instability events

Comparison with Existing Literature

It would be beneficial to explore how the proposed approach compares to existing literature, such as Alam et al., to understand the theoretical and practical advantages of our tree-structured approach.

Conclusion

This follow-up post has presented an extension of the standard HDP model for mutational signature estimation that incorporates evolutionary tree structure. By leveraging the biological insight that evolutionarily related cells should have more similar mutational signature activities, the tree-structured HDP model provides a principled way to incorporate phylogenetic information into mutational signature analysis.

The key innovation lies in the hierarchical structure where each node in the tree has its own Dirichlet Process that depends on its parent, ensuring that distribution properties propagate down the tree. This addresses a fundamental limitation of standard approaches that treat samples independently.

While there are practical challenges related to data requirements and implementation, the theoretical framework provides a solid foundation for future work in this area. The next steps involve implementation, benchmarking, and exploring alternative approaches to validate and improve upon this initial proposal.

AI Chat

Ask me anything about Daniel's experience, skills, or background!