Keywords

1 Introduction

Fragment-based approach is a powerful strategy, used both in academia and pharmaceutical industry to develop potent compounds from fragments. Five drugs designed with this approach were approved by the FDA (Bollag et al. 2012; Tap et al. 2015; Perera et al. 2017; Schoepfer et al. 2018), one of which as recently as 2021 (Souers et al. 2013). Fragments are usually small compounds with low molecular weight, having about 20 heavy atoms (Kirsch et al. 2019; Schuffenhauer et al. 2005). The principle of this strategy is to dock a library of fragments onto the protein surface, and to select those that specifically bind the target with optimal target affinity. One or several initial poses are then extended to form a complete chemical compound.

While this strategy is generally applied to chemical compounds for the design, fragment-based approach has been utilized to predict complexes formed by ssRNAs of known sequence with RBP (RNA-Binding Protein) (Hall et al. 2015; de Beauchene et al. 2016; Kappel and Das 2019). To predict the interaction of RNA-RBP complexes, fragments libraries must embrace a large diversity of RNA fragments, including for instance chemical modifications for the design. Such a diversity is also crucial towards a fragment-based design of therapeutic molecules, most of which require a good affinity towards the target to achieve the desired activity (e.g. antagonist, agonist).

A recent approach (González-Alemán et al. 2021) initially performs a sampling of mononucleotides with Multiple Copy Simultaneously Search (MCSS). From a library of mononucleotides, the principle of MCSS is to randomly dock copies of each nucleotide to obtain a set of fragment poses docked on the target with known orientation and position. However, the assembly of consecutive nucleotides into an optimal oligonucleotide, either of known (ssRNA docking) or unknown (ssRNA design) sequence, cannot reasonably be performed in a brute-force fashion due to the punishing combinatorial explosion of candidates. The success of fragment-based approaches hinges critically on the density of poses which, in turn, greatly impacts the number of candidate positions/sequences. The issue is even worsened by a consideration of modified nucleotides, increasing the basis of the induced exponential growth.

In this work, we revisit the fragment-based docking and design through the prism of color coding (see Fig. 1), an algorithmic technique introduced by Alon, Yuster and Zwick (Alon et al. 1994) which captures self avoidance, a necessary condition for ssRNA design and docking. This elegant technique initially addressed the problem of finding sparse motifs in graphs, and has been utilized in the context Bioinformatics for searching (Dost et al. 2008; Shlomi et al. 2006) and counting occurrences of motifs in biological networks (Alon et al. 2008). In Sect. 2, we show how to adapt color coding, further optimized by a clique decomposition, to obtain exact and probabilistic algorithms for fragment-based docking through energy minimization. Section 2.3 describes how to perform leverage tour techniques by relaxing the requirement to be compatible with a given nucleotide sequence. The framework is finally further extended to produce equilibrium statistics for virtually any feature of interest. Section 3 illustrates the practicality of proposed algorithms in the context of 7 RNA binding proteins, and Sect. 4 discusses some limitations of the approach, and future extensions.

Fig. 1.
figure 1

General workflow: Starting from a graph of docked fragments, for which pairwise connectivity has been assessed by an external tool, our method considers various – random or deterministic – colorings from which solutions to hard computational problems can be obtained.

2 Method and Algorithms

Let us make a few assumptions explicit: Firstly, our docking is assumed to be rigid on the protein level, so that ssRNA fragments (nucleotides or k-mers) can be individually docked onto the protein without overly losing precision; secondly, we assume that the length or nucleotide composition of the input ssRNA forbids the adoption of secondary structure elements; Thirdly, the ssRNA/protein system can be assumed to be at the thermodynamic equilibrium so that minimizing the free-energy coincides with maximizing the probability of the joint configuration. Under these assumptions, the fragment-based docking and design of ssRNAs interacting with a protein can both be reformulated as graph problems.

Definitions and Notations. Namely, we denote as fragment f a nucleotide sequence, each associated with a sequence r(f) and a set of relative 3D atomic coordinates. A fragment pose, or pose x, represents a fragment docked onto a protein surface, and is defined by the 3D position of its atoms relative to the protein. An ordered pair of poses is said to be compatible if their spatial occupancy does not induce unresolvable geometric clashes, and enables the sequential connection of the two fragments into a longer RNA. Compatibility is an oriented relation (associated with the polarity of RNA), whose assessment is a problem in its own right, and the object of specialized tools such as MolPy (Chevrollier 2019) or the upcoming NUCLEAR González-Alemán et al. (2023) in the context of shorter ssRNAs.

Through a systematic docking of fragments, e.g. using constrained molecular dynamics, onto the surface of the target protein, followed by an evaluation of the connectivity of resulting fragment poses, one obtains a poses connectivity graph. It is defined as a directed graph, i.e a pair \(G=(V,E)\) where V is a set of fragment poses, and any directed edge \((v,v')\in E\) implies the possibility to connect v and \(v'\). In the following, we denote by \(n:=|V|\) the number of poses in the graph. Any path of the connectivity graph can be associated with a joint ssRNA/protein conformation, called complex in the following.

Next, we associate a notion of free-energy \(\varDelta G(x)\) to any complex \(x= (v_1,\ldots ,v_k)\), defined as:

$$ \varDelta G(x) = \sum _{i=1}^{k} \delta (v_i) + \sum _{i=1}^{k-1} \delta '(v_i,v_{i+1}) $$

where \(\delta :V\rightarrow \mathbb {R} \cup \{+\infty \}\) and \(\delta ':V\times V\rightarrow \mathbb {R} \cup \{+\infty \}\) are terms, specific to a fragment docking procedure, which capture the contributions of individual and pairwise-connected fragments respectively.

However some pairs of fragments may clash, occupying overlapping or overly proximal geometric regions in the 3D space, leading some of the paths of the connectivity graph do not always represent promising candidates. Trivial instances of such a clash occur within complexes that reuse the same pose twice. Beyond such a simple cases, pairwise clashes can be modeled using a clash function \(C : V\times V \rightarrow \{\textsf {True},\textsf {False}\}\). A path x of length k is self-avoiding, also named a k -path, iff it nodes are pairwise distinct. A path is clash-free if and only if its nodes are pairwise non-clashing, i.e. \(\forall 1\le i<j\le k, C(x_i,x_j) = \textsf {False}\). Note that clash-avoidance induces self-avoidance as long as, for each pose v, one has \(C(v,v) = \textsf {True}\).

Problem Statement and Complexity Aspects. Assuming thermodynamic equilibrium, the most stable/probable complex, for a given nucleotides sequence r of length k, is the one having Minimum Free-Energy. Moreover, fragment assembly should be restricted to complexes compatible with the sequence. The computation of such a complex can be restated as follows:

figure a

Computational complexity-wise, for a general input graph, trivial sequence c and unit-valued energy function (\(\delta (\cdot )=-1\), \(\delta '(\cdot ,\cdot )=0\)), MFEdock solves the problem of deciding the existence of a Hamiltonian path in G, implying NP-hardness. The problem remains robustly intractable even when restricted to subclasses of input graphs that can be drawn on a protein surface, such as grid graphs (Itai et al. 1982). Moreover, for the complete compatibility graph and unit-valued energy, solving MFEdock answers the existence of a k-set of non-clashing nodes in the graph \((V,\{v,v' \mid C(v,v')=\textsf {True}\})\), thus solving the Max Independent Set (MIS) problem. However, MIS is not only NP-hard, but also remains intractable (W[1] hard for k) from the perspective of parameterized complexity on geometric instances, e.g. graphs stemming from intersections of segments/discs (Marx 2006). Taken together, these results indicate a robust computational hardness of the problem, motivating the exploration of alternatives and heuristics.

2.1 Ensuring Self-avoidance Through Color Coding

Given the dire complexity status of MFEdock, we initially address a restricted version of the problem that only considers clashes resulting from the reuse of certain poses. In other words, we optimize the energy optimization over self-avoiding paths, which is equivalent to setting \(C(\cdot ,\cdot )=\textsf {False}\). In this setting, the algorithmic problem remains NP-hard but simplifies into a, practically solvable, Fixed-Parameter Tractable (FPT) problem for the path length k, using the color coding technique (Alon et al. 1994).

Classic Color Coding. The key principle of color coding is to associate a coloring \(\kappa : V \rightarrow [1,k]\) to the input graph \(G=(V,E)\), and to replace the (hard) search for a path (or motif) of length k (k-path) with the (easier) search of a colorful path, using each of the k colors exactly once. Colorful paths can be optimized for, and counted, in time that linear on \(n+|E|\), and only exponential on k. For a single coloring, the set of colorful paths is only a subset of k-paths, the optimal k-path may be overlooked. One may then use derandomization to turn this approach into an efficient, deterministic and exact algorithm. To that purpose, one needs to construct a family of colorings which, taken as a whole, represents every possible k-path. Naor et al. (1995) propose an explicit construct for such a family, consisting of \(e^k k^{\mathcal {O}(\log k)} \log n\) colorings. Iterating the search for optimal colorful paths over the family yields an exact algorithm in overall time \(\mathcal {O}((2e)^k k^{\mathcal {O}(\log k)} (n+|E|)\log n)\), critically using \(\mathcal {O}(2^k n)\) memory.

Well-Colored Path as a Memory-Frugal Alternative. To work around this substantial memory requirement, we instead consider a variant of color coding based on well-colored paths. A well-colored path is a k-path whose colors in \(\kappa \) are not only distinct, but occur in a specific order, assumed to be \(1 \rightarrow 2\rightarrow \cdots \rightarrow k\) without loss of generality. For the sake of simplicity, we say that a coloring \(\kappa \) hits (resp. misses) a k-path x when x is well-colored (resp. not well-colored) by x. For any given coloring, the optimal/MFE k-path can be obtained in \(\mathcal {O}(k.(n+|E|))\) time, e.g. using simple dynamic programming. To be well-colored only constrains the colors assigned to the k nodes of x, leaving only one possibility out of the \(k^k\) possible colorings, so the odds of a random uniform coloring hitting x is simply \(\mathbb {P}(x \text { well colored}) = {1}/{k^k}\). Iterating over \(\alpha \) independently-draw random colorings \(\kappa _1,\ldots ,\kappa _\alpha \), the probability of a k-path being missed by all colorings is then

$$\begin{aligned} \mathbb {P}(x \text { missed by}\, \alpha \,\text {random colorings}) = \left( 1-\frac{1}{k^k}\right) ^\alpha . \end{aligned}$$
(1)

This property holds for any k-path, including the MFE path \(x^\star \). Consequently, for any targeted tolerance \(\varepsilon \in (0,1)\), it suffices to set

$$\alpha := \left\lceil \frac{\log {\varepsilon }}{\log \left( 1-\frac{1}{k^k}\right) }\right\rceil \in \varTheta (k^k\,\log {{1}/{\varepsilon }})$$

and we obtain a probabilistic algorithm that returns \(x^\star \) with probability \(1-\varepsilon \), and runs in total time \(\mathcal {O}(k^{k+2}\,\log {({1}/{\varepsilon })}\,(n+|E|))\) and memory linear in both k and n. Derandomization can also be used in the context of well-colored paths. Here, the constructs of Alon et al. (1995), coupled with the earlier results of Schmidt and Siegel (1990), provide a family of \(k^{\mathcal {O}(k)}\log n\) colorings that hits every k-path, thus implying an exact deterministic algorithm for MFEdock w/o clash constraints. Its complexity is now in \(\mathcal {O}(k^{\mathcal {O}(k)}n.\log n)\) for time, marginally higher than for colorful paths, while using a, much reduced, linear memory.

Rejecting from Suboptimals to Produce the Clash-Free MFE Complex. In order to recover the MFE clash-free complex, and thus provide a solution for MFEdock, we elicit to extract it from the list of \(\varDelta \) (self-avoiding) suboptimals, defined as having energy distance at most \(\varDelta \) from the self-avoiding MFE. The list of \(\varDelta \)-suboptimals can be produced using an adapted version of the Waterman/Byers scheme (Waterman and Byers 1985). It starts by computing the well-colored MFE for a given coloring \(\kappa \) using the following dynamic programming scheme:

figure b

Once the \(\textrm{mfe}\) matrix computed in \(\mathcal {O}(k.(n+|E|))\) time, the exhaustive list of \(\varDelta \)-subopts can be obtained using a modified backtrack:

figure c

Such a backtrack essentially runs in time and memory in \(\varTheta (k\,D)\), where D is the total number of \(\varDelta \)-subopts, and is expected to grow exponentially with \(\varDelta \).

An exact, exponential-time in the worst-case, algorithm for the clash-free MFE then starts by computing the global self-avoiding MFE \(E_{\text {SA}}^\star \), using a derandomizing family \(\boldsymbol{\kappa }:=(\kappa _i)_i\) of colorings. It then iterates again several times over the whole family using increasing values of \(\varDelta \) until

$$\varDelta \ge \varDelta _{\max } := E^{-}_{\text {clash-free}} - E_{\text {SA}}^\star ,$$

where \(E^{-}_{\text {clash-free}}\) denotes the clash-free MFE observed for \(\varDelta \)-suboptimals over \(\boldsymbol{\kappa }\) so far. At this point, the algorithm may simply return the clash-free MFE complex within the \(\varDelta _{\max }\) subopts, i.e. the structure \(S^{\star }\) achieving \(E^{-}_{\text {clash-free}}\), since this structure is then the clash-free MFE, and a valid solution to MFEdock.

Indeed, for any clash-free complex \(S'\ne S^{\star }\), if \(S'\) is found in the combined list of \(\varDelta \)-suboptimals, then it has higher energy than \(E^{-}_{\text {clash-free}}\) by definition. If \(S'\) is not listed as a \(\varDelta _{\max }\)-subopt for \(\boldsymbol{\kappa }\) then, for any coloring \(\kappa \) that hits \(S'\), one has \( \textrm{mfe}_{\kappa } + \varDelta _{\max } \le \varDelta G(S')\). Since \(E^\star _{SA} \le \textrm{mfe}_{\kappa } \), one concludes with

$$ \varDelta G(S^{\star }) = E^{-}_{\text {clash-free}} \le E^\star _{SA} +\varDelta _{\max } \le \textrm{mfe}_{\kappa } + \varDelta _{\max } \le \varDelta G(S').$$

The energy of any alternative \(S'\) is thus higher than that of \(S^{\star }\), from which we conclude that our algorithm is correct.

Of course, the practical performances of the algorithm may critically depend on the \(\varDelta _{\max }\) value, i.e. the energy difference between the MFE self-avoiding and clash-free complexes. To mitigate the issue, we introduce in the next Sect. 2.2 an optimization based on cliques, which we illustrate in Fig. 2 along with our algorithm.

Fig. 2.
figure 2

Example of \(\varDelta \)-suboptimal color coding based on monochromatic cliques. From a MFEdock instance (A), including compatibility arcs (blue) and clashes edges (red), a clique cover is heuristically computed (gray). A family of coloring is then generated (B; random or deterministic), and a dynamic programming algorithm allows to build a list of (self-avoiding) k-paths (C). Among those, only presents a clash (red box) and is filtered out to obtain a merged list of clash-free \(\varDelta \)-suboptimals. Notably, the two colorings above are sufficient to hit all clash-free paths. Moreover, , a valid k-path which features two clashing nodes, cannot be well-colored in the current clique cover. (Color figure online)

2.2 Reducing Clashes Through Monochromatic Clique Covers

During the initial docking phase, individual fragments usually cluster around hotspots on the protein surface. On the one hand, such an accumulation is beneficial to the resolution of the selected poses as, to a degree, it enables simulating flexibility. On the other hand, high local densities may result in a combinatorial explosion of clashes, drastically reducing the density of clash-free complexes, all the while hindering the performances of our algorithms. Instead, we would rather focus our effort on a subset of self-avoiding paths featuring a good density of clash-free associated complexes.

Cliques of Clashing Nodes Can be Safely Set to a Single Color. To achieve such a goal, we trivially remark that the nodes of a clashing clique \(\mathcal {C}\), i.e. a set of pairwise clashing poses, may not occur more than once within a reasonable candidate complex. Nicely, such clashes can be avoided by enforcing the monochromaticity of \(\kappa \) with respect to \(\mathcal {C}\), the use of a single color for all \(v\in \mathcal {C}\). This restriction is conservative with respect to clash-free complexes, since any clash-free k-path, hit by a coloring \(\kappa \) and featuring a node \(v_c\in \mathcal {C}\), is also hit by a monochromatic coloring \(\kappa '\) such that

$$\kappa ': v \rightarrow {\left\{ \begin{array}{ll} \kappa (v_c) &{} \text {if }v\in \mathcal {C}\\ \kappa (v) &{} \text {otherwise.} \end{array}\right. }$$

Meanwhile, any k-path that borrowed two or more nodes from \(\mathcal {C}\) is no longer admissible, thereby increasing the density of clash-free complexes within the search space. Moreover, from the perspective of derandomization, this restriction enables the complete clash-free paths to be hit by a smaller family of colorings, since whole cliques can be treated as a single nodes.

This observation, and overall strategy, provably generalizes to collections of disjoint cliques in the clash graph. Two overlapping cliques \(\mathcal {C}\) and \(\mathcal {C}'\), however, should not be forced to be simultaneously monochromatic. Indeed the color of \(\mathcal {C}\) would, due to its overlap with \(\mathcal {C}'\), spread to the latter. This would result in treating \(\mathcal {C}\cup \mathcal {C}'\) as a single clique, thereby potentially eliminating some clash-free k-paths from the search space. In order to minimize runtime, and maximize the density of clash-free paths within the runspace, we preprocess clashes by decomposing them as a clique cover, a partition of nodes into a set of cliques \(\mathfrak {C}=\{\mathcal {C}_i\}\) while attempting to maximize the number of clashing pairs occurring within a clique \(\mathcal {C}_i\). Though not strictly equivalent, this problem is related to min clique cover and likely hard.

A Pragmatic Solution to Decompose Clashes into Non-overlapping Cliques. To pragmatically solve the problem, we implemented an greedy heuristic for min clique cover which initializes the cover \(\mathfrak {C}:= \varnothing \) and, at each iteration, starts from the node \(v^+\) having max degree node in the remaining graph. It initializes a clique \(\mathcal {C}:=\{v^+\}\) and a list of common neighbors \(\mathcal {N} := \text {neighbors}(v^{+})\). Then, until \(\mathcal {N}=\varnothing \), it alternates:

  1. 1.

    Choose a node \(v\in \mathcal {N}\) having max degree within \(\mathcal {N}\), is added to \(\mathcal {C}\);

  2. 2.

    Update of list of common neighbors through \(\mathcal {N} := \mathcal {N} \cap \text {neighbors}(v);\)

The clique \(\mathcal {C}\) is then added to the cover \(\mathfrak {C}\), removed from the clique graph for future iterations (choice of \(v^{+}\), construction of \(\mathcal {C}\)...) until all nodes have been removed from the clash graph and added as part of a clique to \(\mathfrak {C}\). While this heuristic does not provide formal guarantees regarding its result, we found it performs adequately for our typical instances, as shown in Sect. 3.2.

2.3 Rational ssRNA Design as a Relaxation of Docking

Rational design in the context of a fragment-based docking usually requires two properties to be fulfilled by the designed RNA aptamer: Positive design requires the ligand to have optimal affinity, or low interaction free-energy, towards the target protein or targeted pocket; Negative design constrains the ligand to be specifically binding to a given region of the protein. Interestingly, both criteria are at least partially addressed by a simple relaxation of the MFEdock problem.

The required modification simply consists in partially specifying (e.g. using IUPAC codes), or even disregarding altogether (e.g. poly-N mask), the input ssRNA sequence without added complexity. In this setting, solving the MFEdock problem provides an MFE complex, from which both an ssRNA sequence \(r^\star := r(x_1).r(x_2)\cdots r(x_k)\) and its MFE conformation can be derived. More precisely, we can show that: i) No alternative sequence has higher affinity than \(r^\star \) towards the protein (positive design); ii) The binding site induced on the protein surface by \(x^\star \) is the most likely target for \(r^\star \) (negative design). Admittedly, this approach does not enable targeting of a specific site or pocket, since the best complex location of the returned designs is induced by the MFE criterion. Nevertheless, by generating suboptimals and only retaining the first occurrence of each sequence (i.e associated with their MFE complex), one can produce a diversity of sequences that are both stable, and specifically target various sites.

2.4 Equilibrium Statistics

While clearly an important – computationally challenging – problem, docking through energy minimization is hindered by its single focus on the MFE conformation. Indeed, at the thermodynamic equilibrium, the probability of a clash-free complex x follows a Boltzmann distribution

$$ \mathbb {P}(X=x\mid r) = \frac{e^{-\beta .\varDelta G(x)}}{\mathcal {Z}_r} \text { where } \mathcal {Z}_r=\sum _{\begin{array}{c} x'\text { clash-free}\\ \text {and comp. with }r \end{array}} e^{-\beta .\varDelta G(x')} $$

is the partition function for a nucleotide sequence r, \(\mu = RT\) with R the Boltzmann constant and T the absolute temperature. Since the number of valid complexes typically grows (at least) exponentially with k, the probability of the MFE complex becomes abysmally small in larger systems. As an extreme example, for the clique input graph, the number of complexes grows in \(\varTheta (n^k)\) when \(k\ll n\), and even \(n! \asymp (n/e)^{n}\) when \(k=n\), thereby completely crushing the probability of any single complex.

Boltzmann Statistics. This motivates a computation of equilibrium statistics, i.e expected properties of the system under a Boltzmann distribution. Such properties are measured by a set of real-valued feature functions \(\{f_1, f_2, \ldots \}\), each mapping a valid complex to some numerical value in \(\mathbb {R}\). Features can represent any relevant quantity (free-energy, %occupancy of druggable pocket...), provided that they can be effectively computed from a fully-specified complex. The expectation of a feature f is defined as:

$$\begin{aligned} \mathbb {E}(f(X)\mid r) = \sum _{\begin{array}{c} x\text { self avoiding}\\ \text {and comp. with }r \end{array}}f(x)\times \mathbb {P}(X=x\mid r)\end{aligned}$$

and can be interpreted as a collective variables. Probabilities can also be computed as expectations of (0/1)-valued features. Indeed, setting \(f_c(r)= 1\) or 0 depending on the presence/absence of a contact with a targeted residue A, the expectation simplifies into

$$\begin{aligned} \mathbb {E}(f_c(X)\mid r) &= \sum _{\begin{array}{c} x \end{array}} f_c(x)\times \mathbb {P}(x\mid r) = \sum _{\begin{array}{c} x\text { s.t. }f_c(x)=1 \end{array}} \mathbb {P}(x\mid r) = \mathbb {P}\left( f_c(X) = 1\mid r\right) .\end{aligned}$$

Higher moments of the distributions can finally be computed from the expectations of \(f, f^2, f^3\ldots \) enabling access to finer characteristics of the distribution, such as its variance/stddev, skewness, kurtosis...or even correlations between multiple features. Complexity-wise, computing the partition function is provably harder than its, already-hard, optimization version (MFEdock). Worse, as defined for optimization, families of coloring used for derandomization would typically introduces a bias in the subsequent estimates, and thus cannot be used.

Statistical Estimators from Colored Statistics. To work around those hurdles, we adopt an approach that estimates the expectation based on a sequence of random colorings \(\kappa _1,\kappa _2 \ldots \) Namely, we introduce the color-restricted expectation of a feature f given a coloring \(\kappa \) as:

$$\begin{aligned} \mathbb {E}(f(X) \mid r,\kappa ) &= \sum _{\begin{array}{c} x\text { clash-free,}\\ \text {comp. with }r\\ \text {well col. by }\kappa \end{array}} f(x)\,\mathbb {P}(x\mid r,\kappa ) = \sum _{\begin{array}{c} x\text { clash-free,}\\ \text {comp. with }r\\ \text {well col. by }\kappa \end{array}} f(x)\,\frac{e^{-\beta \varDelta G(x)}}{\mathcal {Z}_{r,\kappa }} \text { where }\mathcal {Z}_{r,\kappa } = \sum _{\begin{array}{c} x'\text { clash-free,}\\ \text {comp. with }r\\ \text {well col. by }\kappa \end{array}}e^{-\beta \varDelta G(x')}. \end{aligned}$$

To estimate this quantity, we first introduce a dynamic programming scheme to compute the (coloring-restricted) partition function:

$$\begin{aligned} \mathcal {Z}_\kappa &= \sum _{\begin{array}{c} v\in V\\ \text {such that }r(v)=r_{1} \end{array}} \mathcal {Z}_{v,1} {} & {} \text {and}& \mathcal {Z}_{v,m} &= {\left\{ \begin{array}{ll} e^{-\beta E(v)} &{}\text {if }m=k\\ \displaystyle \sum _{\begin{array}{c} (v,v')\in E\text { s.t.}\\ \kappa (v') = m+1\\ \text {and }r(v')=r_{m+1} \end{array} } e^{-\beta ( E(v)+ E'(v,v'))} \mathcal {Z}_{v',m+1}& \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

A stochastic backtrack then consists in, starting from \(\mathcal {Z}_k\), the repeated choice a node with probability proportional to its contribution to \(\mathcal {Z}_\kappa \), recursing until the \(m=k\) condition is met. The returned random complex is then Boltzmann distributed within well-colored k-paths. The average value of f on a set of generated complexes, further filtered to retain only clash-free paths, represents an unbiased estimator for \(\mathbb {E}(f(X) \mid r,\kappa )\). Our, provably consistent, final estimator takes a collection of random uniformly-distributed colorings, and returns:

$$\begin{aligned} \widehat{f} (\kappa _1,\kappa _2\ldots \kappa _M) = \frac{\sum _{i=1}^M \mathcal {Z}_{r,\kappa _i}\times \mathbb {E}(f(X)\mid r,\kappa _i)}{\sum _{j=1}^M \mathcal {Z}_{r,\kappa _j}} \end{aligned}$$
(7)

3 Results

Implementation. We implemented our algorithms for MFEdock (optimization; subopts; ± sequence constraints) and statistical estimators into the ColorDocking software, a collection of Python scripts interfacing C code, freely downloadable with datasets and further information to reproduce experiments at https://gitlab.inria.fr/amibio/colordocking. All experiments were performed on a PBS cluster with a Linux kernel, using 125 GB of memory. Each calculation was done on a single CPU.

Datasets. We selected seven ssRNA/protein complexes to validate our method. Among them, six complexes are RNA-RRM complexes (RRM; 1B7F, 1CVJ, 2MGZ, 2YH1, 3NNH and 4BS2) and the remaining one is a Pumilio domain (PUF; 3BX3). This generally coincides with the benchmark selected by de Beauchene et al. (2016), removing two structures: 4N0T, which natively interacts with a double-stranded RNA; and 5BZV, another PUF which we saw as redundant with 3BX3. To provide a realistic setting for docking, proteins were prepared and minimised using the CHARMM36 force field in the absence of the ssRNA ligand and solvent. As a result, the protein surface at the RNA binding site may have been altered, and in a more specific way at some specific position of the RNA chain.

For each of our targets, we used the MCSS (Miranker and Karplus 1991) method to generate a distribution of 10.000 fragment poses. These fragments are composed of Adenine (A), Cytosine (C), Guanine (G) and Uracil (U). Only the first 4.000 fragments poses (2.000 anti-conformations and 2.000 syn-conformations) were used for this study. From these, the python package NUCLEAR (González-Alemán et al. 2023) was used to cluster poses within 0.5Å RMSD, and to build matrices (connectivity, clashes, scores) using 4.5Å as a max. value for the O3\('\)-C5\('\) distance. The max contact distance between the protein and a fragment was set to 3.5Å. All atoms of amino acids were considered, as well as for nucleotides, except for the terminal patch which was omitted to improve connectivity. All NUCLEAR runs were constrained to only generate the various matrices (run_type = partial option).

A directed graph (+ clash matrix) was then generated and fed as input to our implementation of ColorDocking algorithm, using collections of clique-level random uniform colorings. We uniformly set the tolerance to \(\varepsilon = 0.01\), i.e. the clash-free MFE complex was predicted with probability of \(p=99\%\). Clash-free MFE candidates were produced, based on a suboptimality cutoff \(\varDelta = 10\) kcal.mol\(^{-1}\).

Table 1. Summary of our benchmark and paths cardinality analysis.

3.1 Stability Analysis

Setting the RNA length to \(k=5\) enables the execution of our algorithm in modest time (about a dozen seconds per run). Such a runtime enables a comparison the results obtained over successive executions, in order to assess the impact of the random generation of colorings on the stability of predictions. Namely, for all of the 7 complexes, and setting \(k=5\), we performed 100 independent experiments. In addition, we used a brute-force approach to compute both self-avoiding and clash-free MFE complexes. As expected, we consistently recover the clash-free MFE complex in our experiments, namely between 97/100 and 100/100 of experiments over all targets. Such a behavior is expected from our choice of \(\varepsilon =0.01\), implying a 1% chance of missing the MFE but could, at least in theory, have been affected by setting \(\varDelta \) to a fixed value. Setting \(\varepsilon =0.63\) expectedly reduces the runtime by a factor 10, at the cost of degraded performances, with the MFE clash-free structure being only found between 25/100 and 47/100 of experiments.

In a second analysis, we investigated whether the top-100 clash-free complex present a strong overlap across independent executions of the algorithm. We filtered the output of \(\varDelta \)-suboptimal version of the MFEdock algorithm (\(\varDelta =10\), \(\varepsilon =0.01\)) to produce the 100 lowest-energy clash-free complexes. We iterated the experiment 100 times, and found the average pairwise overlap between two runs to be of 98%, with very limited variations.

3.2 Impact of Monochromatic Clique Covers

Next we turn to an investigation of the effect of clique-based coloring on the density of clash-free paths, the runtime and energy distance between the self-avoiding MFE and the clash-free MFE. To investigate those points, in addition to the MFE obtained as above, we used a brute-force approach to compute the numbers of unconstrained, self-avoiding and clash-free paths.

We first report in Table 2 the clique covers returned by our greedy heuristic. While the number of poses is in the order of (dozens of) thousands, the number of clashing cliques scales between 40 and 100, with larger cliques representing a sizeable proportion of the vertex set (10 to 20%). Moreover, a large proportion (50% to 70%) of clashing edges are internal to a clique. Such clashes can no longer occur upon restricting to clique-based coloring, substantially reducing the probability of a k-path featuring a clash.

As can be seen in Table 1, the number of paths is typically reduced by 75% when clique-monochromatic self-avoidance is ensured. This results in a runtime reduction an overall factor 1.5 to 4. Meanwhile, as can be seen in Table 2, the energy distance between the self-avoiding and clash-free MFEs can be greatly reduced (e.g. −10 kcal.mol\(^{-1}\) for 2 MGZ) when cliques are used to reduce the search space. Overall, the consideration of monochromatic cliques represents a very positive addition: it greatly improves the runtime and purifies self-avoiding paths to increase the density of clash-free paths.

Table 2. Properties and impact of cliques cover on runtime. Values observed for \(k=5\), averaged over 100 iterations (std dev \(\approx 1\)).

3.3 Docking Through Energy-Minimization Under Different Fragment Definitions

We showcased the versatility of our approach by supplementing the MCSS set of poses with the connectivity graphs associated with overlapping trinucleotide fragments, following de Beauchene et al. (2016). For 1CVJ, we used the ATTRACT software (de Vries et al. 2015) to generate 1.000.000 non-redundant (0.2Å RMSD threshold) fragment poses, in coarse-grained representation. We built a connectivity matrix, using as connectivity criteria a 1.8Å RMSD cutoff for overlapping nucleotides between consecutive fragments. We created a directed graph of connected poses, using the ssRNATTRACT package as described in de Beauchene et al. (2016). We used a new fragment library of RNA trinucleotides extracted from the PDB with the ProtNAff software (Moniot et al. 2022), using our Radius clustering method.

From the 1.000.000 initial poses, 5327 could be assembled in a 5-fragments chain, and were therefore retained in the final graph of connected poses. We then constructed a clash matrix of those poses, using a 1.5 Å distance criteria between two clashing heavy atoms (excluding overlapping nucleotides of connected poses). In addition, we considered a pair of poses as incompatible if both can be connected only at the same position in a 5-fragments chain, since they can not be together in the same chain. The full matrix of poses incompatibility (either clashing or only at the same position in chains) was used to define cliques of incompatible poses.

The resulting MFEdock instance only needs to be executed for \(k=6\), since each fragment represents a trinucleotide, and assembling 6 fragments is sufficient to reach the size of 8 nucleotides. This allows to execute our algorithm in as little as 34 s. Meanwhile, the runtime required by 1CVJ for our MCSS dataset, implying \(k=8\), is of 2.10\(^{3}\) s, or approximately 33 min. Beyond the demonstrated ability of ColorDocking to support multiple fragment definition, we did not analyze further the quality of the produced fragments (e.g. RMSD to native complex), since our goal is not to compare different force fields/fragment definitions.

3.4 Design

To illustrate the capacity of MFEdock to address design, we considered a design study recently published by Perzanowska et al. (2022), where an oligonucleotide targeting a poly(A)-binding protein (PDBID: 1CVJ) was designed. Since this study included modified nucleotides, we considered an extended list of nucleotides: two without modification (A,G), and 5 with modifications: adenosine and guanosine with a phosphorothioate (A\(_{P}\), G\(_{P}\)), protonated adenosine (A\(_\psi \)), N6-methyladenosine (m\(^{6}\) A), N6-methyladenosine including phosphorothioate (m\(^{6}\) A\(_{P}\)), 2’O-methyladenosine (Am) and 2’O-methyladenosine including phosphorothioate (Am\(_{P}\)). All were used in anti- and syn-conformations, for a total of 1 000 poses (500 syn/500 anti) per nucleotide type. They were clustered at 0.5 Å RMSD for a remaining number of 5 785 individual poses.

To generate solutions of length k = 8, we considered a maximum value of \(\varDelta _{max}:=3\), and gradually increased \(\varDelta \) by unit steps, to reach a maximum of 100 clash-free per coloring. We initially did not consider sequence constraints. The number of unique sequences was 75 out of 562 clash-free solutions. We report below the top 10 of unique sequence, along with their minimum free energy:

figure f

Interestingly, those differ from the sequences investigated by Perzanowska et al. (2022). In particular, the pair of sequences having highest affinity in the study, was not found in our list. This is not entirely surprising, since the authors limited their investigation to a single modified nucleotide per design. We further analyzed their two best sequences, running a sequence-constrained instance of MFEdock

figure g

and found that their MFE is significantly higher (+16/+26 kcal.mol\(^{-1}\)), suggesting that our ability to tame the combinatorial explosion grants us access to promising alternatives.

4 Conclusions and Perspectives

We have introduced a new algorithmic framework, based on color coding, to solve natural combinatorial problems arising in the context of fragment-based docking and design of ssRNAs interacting with a protein conformation. We have illustrated their utility in the context of seven RNA binding proteins, showing that color coding provides a versatile toolkit for the study and design of ssRNAs. Our method is not restricted to individually docked nucleotides, and could accommodate other fragment libraries, as demonstrated on trinucleotide fragments.

A key strength of our exact algorithm resides in its linear complexity on the number of pairwise connected poses, only being exponential on the length k of the ssRNA to ensure self-avoidance. As such, it can be seen as a parameterized complexity algorithm, showing that the MFEdock problem is Fixed Parameter Tractable (FPT) for the ssRNA length k. On a practical level, much larger sets of poses/connections could be supported, allowing to explore the impact of various sampling depth/density of poses on the quality of predictions.