Understanding the three-dimensional (3D) spatial organization of gene expression and morphology in biological tissues is crucial for elucidating developmental processes, disease mechanisms and tissue function. Capturing both genetic information and spatial context in 3D, without prior knowledge of genetic or spatial composition, remains a notable challenge in the field.
Traditional imaging methods, such as fluorescence microscopy and in situ hybridization of genetic probes1,2,3, provide spatial information but are limited in throughput and often require prior knowledge of the targets. Sequencing-based spatial transcriptomics technologies have advanced our ability to map gene expression with spatial context4,5. However, these methods typically involve trade-offs among spatial resolution, depth of coverage and signal density and often depend on targeted probes or spatial barcodes that enforce a predefined coordinate system on specimens.
DNA microscopy is a distinct imaging modality that encodes a spatial genetic map of a specimen directly into DNA molecules through a stand-alone chemical reaction, without reliance on prior spatial or genetic information. This method allows for the simultaneous acquisition of nucleotide-level genetic information and spatial organization from within the specimen itself. DNA microscopy has been demonstrated in dense two-dimensional (2D) multicellular specimens6 and applied to study cell-surface protein localization using antibody–oligo conjugates7. It has also been used for mapping DNA barcode arrays for spatial transcriptomics8,9,10. Additionally, computational advances have been shown to improve image reconstruction11,12 and theoretical variations on the technological idea have been proposed13,14.
Unlike methods that measure physical proximities between known15,16,17,18 reference sequences, DNA microscopy reconstructs a physical image using a proximity matrix assembled from interactions between unique, previously unknown and arbitrarily organized DNA or RNA molecules. This approach enables imaging without prior knowledge of the specimen’s spatial organization or genetic sequences, potentially allowing for the detection of single-nucleotide variations and genetic heterogeneity within a single tissue sample.
DNA microscopy begins by randomly tagging biomolecules inside a specimen with unique molecular identifiers (UMIs), consisting of synthetic sequences of randomized nucleotides (Fig. 1a). These UMIs are then converted into an intercommunicating molecular network, where molecular copies migrate through diffusion and link to form dimers (Fig. 1b), whose formation is recorded by integrating random nucleotides, called unique event identifiers (UEIs), into each new UMI–UMI pairing.
a, In situ synthesis of cDNA amplicons in fixed and permeabilized tissue. b, In the unanchored diffusion protocol, an in situ overlap extension PCR amplifies and links copies of UMIs and UMI-tagged cDNA molecules, introducing UEIs (DNA identifiers that label each linking event). c, Sequencing these UEIs, along with UMIs and gene inserts, allows inference of the spatial genetic map of the specimen. d, Unanchored diffusion (over length scale Ldiff) over tens of microns limits resolution by reducing the information content of each UEI. e, UEIs can collectively reduce positional uncertainty, analogous to how multiple photons improve resolution in stochastic super-resolution microscopy. f, Introducing anchored diffusion, such as through anchored polymer extension, reduces the PSF width, improving resolution and bridging multiple length scales. g–i, Starting from DNA adaptor-tagged cDNA (g), precircularized UMIs are annealed and undergo RCA (h), generating tandem copies of UMIs that remain near their points of origin (i). Oligonucleotides bridge adjacent UMIs to form new UEI amplicons. j, cDNA and UEI amplicons undergo in situ amplification by IVT, collectively encoding both proximity and cDNA sequence information.
The frequencies of these UEIs encode spatial proximities; pairs of UMIs that are close together interact more frequently and end up with more UEIs associating them than those that are far apart. After sequencing, the data form a sparse UEI matrix in which both rows and columns represent individual UMI-tagged molecules (Fig. 1c). By solving a statistical inverse problem on this matrix, the relative spatial coordinates of the original UMIs can be inferred. The DNA or RNA sequences associated with these UMIs can then be mapped to these same coordinates, completing a spatial genetic image of the specimen.
DNA microscopy is theoretically capable of volumetric spatial genetic imaging with nucleotide-level readouts because it captures images from within the specimen. This potential has broad implications for the study of gene sequence heterogeneity in 3D tissue, a phenomenon central to the functioning of the immune19 and nervous systems20.
However, three key barriers have limited the application of DNA microscopy in intact 3D tissues. First, the requirement for high-temperature thermal cycling during in situ PCR complicates uniform diffusion because of interfacial stress between the sample and its surroundings and the fact that nondegassed material (including tissue) undergoing this process nucleates bubbles21. Second, reconstructing tissues with structures and inhomogeneities across multiple length scales requires the UEI matrix to accurately represent these spatial variations. Third, reconstructing spatial information from a network of millions of UMIs distributed across 3D specimens poses substantial computational scalability challenges.
In this work, we overcome the experimental barriers by introducing layered in situ chemistries that capture multiple length scales in UEI data without high-temperature thermal cycling. Furthermore, to address the computational barriers, we develop an inference methodology that scalably and robustly reconstructs molecular positions across large length scales in 3D. We demonstrate the effectiveness of our approach by applying whole-transcriptome volumetric DNA microscopy to intact zebrafish embryos, showing that 3D image inference recapitulates zebrafish morphology and known gene expression patterns. Our extension of DNA microscopy to 3D, independent of prior templates, paves the way for integrated analysis of genomics and morphology in intact biological tissues.
DNA microscopy encodes proximity by locally dispersing copies of UMI-tagged molecules (Fig. 1d). Earlier DNA microscopy experiments6 (Fig. 1a,b) used unanchored diffusion in a polyethylene glycol (PEG) hydrogel22 to disperse UMI copies over tens of microns. This length scale is important because the positional uncertainty of a UMI pair with a single UEI can be thought of as a point spread function (PSF) dictated by the migration of UMI copies from their point of origin. This is analogous to optical microscopy6, where the PSF can be understood as the probability distribution of a point source location given the detection of a single photon from its vicinity.
Effective resolution of a reconstructed DNA microscopy image not only depends on the PSF but also scales with the inverse square root of the number of UEIs (Fig. 1e). This dependence mirrors the relationship in stochastic super-resolution light microscopy, where resolution improves with the square root of the number of detected photons because of their statistical independence6. Just as multiple photons reduce localization uncertainty in stochastic super-resolution light microscopy, multiple UEIs between UMIs reduce positional uncertainty in DNA microscopy.
Although ideally a DNA microscopy reaction could achieve arbitrary resolution by reducing UMI dispersion to individual molecules and their immediate neighbors; to accomplish this, UMIs would need to be uniformly and extremely densely packed to guarantee the formation of an intact UEI matrix. In any realistic specimen, our control over this molecular distribution would be limited.
We, therefore, reasoned that, to achieve a global mapping of an intact organism from molecular proximity data, one would need to encode both small and large length scales into a single dataset. In practice, this would require engineering how UEIs either localize or delocalize from their UMIs of origin. Over short length scales, we sought to generate UEIs through anchored diffusion or, specifically, to extend an anchored polymer of tandem copies of the same UMI and then to link UMI copies on spatially adjacent polymers. This process would reduce the width of the PSF itself by limiting the dispersion of UMI copies (Fig. 1f). Over larger length scales, we sought to generate UEIs through unanchored diffusion of the kind used for earlier experiments6.
DNA microscopy6 previously achieved the larger of the two length scales using biocompatible and reversible PEG hydrogels formed around the sample to eliminate convection and limit the range of DNA molecule migration during the reaction to approximately 50-μm diameters22. We sought to execute this in parallel with approximately 1-μm-diameter DNA dispersion achieved by rolling circle amplification (RCA)23, in which the leading end of a DNA molecule, polymerizing along a circular template, diffuses while anchored to its point of origin (Fig. 1f). With this combination of approaches, we aimed to integrate both local and larger length scales to enable comprehensive mapping of 3D biological structures.
The designed multiscale-encoding reaction is depicted in Fig. 1g–i and Supplementary Figs. 1–4. Briefly, RNA molecules in fixed and permeabilized cells were reverse-transcribed using random primers into complementary DNA (cDNA) and 3′ DNA overhangs were added (Fig. 1g) using a Tn5 transposase24 and a double-stranded DNA ligase (Methods). Precircularized25 single-stranded DNA molecules containing UMI sequences with 31 randomized nucleotides were then annealed to the ends of the protruding adaptors (Fig. 1h). These UMIs served as unique labels for individual cDNA molecules throughout the specimen.
As in the first demonstration of DNA microscopy6, we used two distinct UMI types (types I and II) to prevent homodimerization (the pairing of a UMI with itself) later in the experiment. A strand-displacing DNA polymerase would elongate the annealed DNA oligonucleotides by RCA to create DNA nanoballs with tandem copies of the same UMI.
During RCA, the 5′ end of the ligated oligonucleotide would remain anchored at its original location (Fig. 1h) while the polymerase would extend the 3′ end, synthesizing long DNA strands to form DNA nanoballs (Fig. 1i) containing deoxyuridine. Anchoring of the RCA product ensured that interaction events (UEIs) formed during this stage reflected spatial proximities to within the length scale of the DNA nanoballs themselves.
Subsequently, to capture longer-range interactions, we used unanchored diffusion during an in situ in vitro transcription (IVT) reaction inside a reversible PEG hydrogel6,22 (Methods). This reaction included uracil endonucleases (USER) that removed the RCA product during the IVT itself while leaving other DNA products (such as the cDNA) unaffected. The IVT amplified DNA in the form of RNA copies and, at low frequency, allowed templated recombination of resulting RNA products over longer length scales (Fig. 1i). This combination of anchored diffusion (through RCA) and unanchored diffusion (through IVT) would encode multiple length scales into the DNA microscopy dataset.
Following amplification, reverse transcription (RT)–PCR of the IVT products would be performed to generate sufficient material for sequencing, using recombination-interfering oligonucleotides to mitigate noise6,26. The PCR products would then be subjected to high-throughput sequencing to read the UMI and UEI sequences (Fig. 1j). All of these products, including RCA-UEIs (UEIs generated from nanoball adjacency), IVT-UEIs (UEIs generated through UMI recombination during IVT, as illustrated in Supplementary Fig. 4) and cDNA (cDNA–UMI pairs), would then be sequenced for image inference and genome alignment27 (Methods and Supplementary Table 1).
We first sought to determine whether RCA polonies could be efficiently formed by this cDNA-anchoring strategy in whole-mount zebrafish embryos. Zebrafish embryos at 24 h post fertilization (hpf) were subjected to volumetric DNA microscopy chemistry (Methods). We compared RCA reactions that incorporated fluorescent deoxyuridine triphosphates (dUTPs) by annealing either linear (Fig. 2a,b) or precircularized (Fig. 2c,d) UMIs. We found, as expected, that RCA products were generated in the latter but not the former. This signal was further increased by additional proteinase permeabilization of embryos (Methods and Fig. 2e,f). This demonstrated the permeability of the embryo under fixation conditions to circularized UMIs, enzymes and other reagents and the effectiveness of the enzymatic reaction in situ.
a–f, Using fluorescent nucleotides during RCA shows the absence of signal when linear UMIs (a) were used compared to circularized UMIs (b) in zebrafish embryos permeabilized by methanol alone (c,d) or with proteinase K (e,f). Scale bars, 100 μm. g, End-to-end experiments amplify products both from cDNA amplicons and UEI amplicons that are sequenced (Rd1, paired-end sequencing read 1; Rd2, read 2; Pr1, PCR primer 1; Pr2, PCR primer 2). UEIs and UMIs are analyzed and grouped separately by low-tolerance clustering. h, UEI matrices that link type I and type II UMIs are then pruned and segregated to identify the largest contiguous UEI matrix. Sequencing rarefaction is shown for end-to-end reactions on four zebrafish embryos, with zebrafish 1 and 2 (zf-1 and zf-2) undergoing end-to-end (normal extension condition) and zebrafish 3 and 4 (zf-3 and zf-4) undergoing end-to-end (enhanced extension condition) experiments. i–o, UMIs (types I and II combined) derived from cDNA amplicons with gene inserts (i,j) and UEI amplicons (k,l) show saturation, whereas UEIs themselves (m,n) and contiguous UEI matrix sizes (o) do not saturate. p, Rarefaction of unique genes to which gene inserts (of the total 31,226 annotated) are mapped.
Next, we performed ‘end-to-end’ in situ reactions on 24-hpf embryos. To prepare the sequencing data for spatial inference, we performed several processing steps to ensure data accuracy and reliability. Sequencing reads were demultiplexed to extract UMIs and UEIs (Fig. 2g), followed by quality filtering and error correction to mitigate the impact of sequencing errors. We used the extended-abundance single-linkage (EASL) clustering algorithm6, which efficiently clusters UMIs and associates them with their corresponding UEIs. The EASL algorithm scales log-linearly with data size and requires minimal computational infrastructure (by relying only on in-place sort and merge functions), making it suitable for handling the large DNA microscopy datasets in this study.
Sequences were then consolidated into UMI–UEI–UMI triplets. We included in all image inference analyses only those UMIs that had a minimum of two UEIs. For volumetric DNA microscopy datasets, we required all UMI–UMI associations to have at least two UEIs. The largest contiguous UEI matrix (Fig. 2h) to come out of this pruning process served as the input for spatial inference.
Sequencing rarefaction analyses of UMIs (from both cDNA and UEI amplicons), UEIs and gene inserts provided a gauge of sequencing coverage impact on sampling. cDNA–UMIs with gene inserts (Fig. 2i,j) and UEI–UMI amplicons (Fig. 2k,l) reached saturation quickly. UEIs (Fig. 2m,n) and the largest contiguous UEI matrix size (Fig. 2o) did not saturate, pointing to substantial spatial information accessible from further sequencing. Lastly, we performed rarefaction analysis on the number of unique genes identified within each cDNA library (Fig. 2p), which we found to be saturated rapidly with increased sequencing depth. Imbalance across UMI types, albeit limited (Supplementary Table 1), arose because of the required difference in circular DNA (circDNA)-annealing sequences of 3′ adaptors.
The task of estimating d-dimensional UMIs or molecular coordinates from a DNA microscopy dataset begins with a statistical model (Fig. 3a) premised on the effective independence of UEI generation, as illustrated earlier in Fig. 1. The underlying generation rate between UMIs i and j will fall off as some characteristic function of the pairwise distance between their respective points of origin, ({overrightarrow{x}}_{i}) and ({overrightarrow{x}}_{j}), for which we use the notation (| | {overrightarrow{x}}_{ij}| |).
a, DNA microscopy image inference is broadly premised on a statistical model of independent measurements (UEI counts, nij) of pairwise proximities between type I and type II UMIs, whose amplified copies are spatially dispersed about points in space. b, This statistical model dictates an optimization that can be written as either the maximization of a probability (of the UEI data nij given the unknown coordinates ({overrightarrow{x}}_{i},{overrightarrow{x}}_{j})) or the minimization of a statistical distance between expected UEIs (given some set of coordinates) and observed UEIs. c, sMLE infers the coordinates of UMIs by minimizing the statistical distance between the expected UEI counts (given coordinates) and observed UEI counts using projected gradient descent along an E-dimensional subspace, Z, constructed from the eigenvectors of the row-normalized UEI count matrix, N. d,e, GSE proceeds similarly (d) but constructs the subspace from the eigenvectors of a matrix product (e) that includes a geodesic kernel matrix (tilde{{bf{W}}}) calculated using a kNN analysis in a rescaled subspace Z. f, A hierarchical version of GSE can be performed by first using coarse inference on randomly chosen data subsets (mS), followed by linear interpolation to global candidate solutions and projected gradient descent using these candidate solutions as a vector basis.
Looking forward through time, we can claim the probability that nij UEIs generated by this UMI pair will be proportional to this rate raised to that number of UEIs (Fig. 3b). Looking backward through time, we can say that this probability is proportional to what is commonly called the Bayesian likelihood that a given observable nij resulted from distance (| | {overrightarrow{x}}_{ij}| |). Maximizing this likelihood is mathematically equivalent to minimizing a statistical distance, namely the Kullback–Leibler divergence, between UEI counts expected from a given set of UMI coordinates and those UEI counts that have been observed (Methods).
Optimizing each UMI coordinate independently can give rise to nonunique solutions that become especially pernicious in the presence of data noise6. To mitigate these issues, the practical deployment of DNA microscopy to specimens with empty space has used6,11 spectral maximum-likelihood estimation (sMLE), which confines the solution space to a linear combination of the top E eigenvectors of the row-normalized UEI count matrix (Fig. 3c). Intuitively, this can be understood as a form of ‘low-pass filter’ on the raw UEI count matrix, whose top eigenvectors isolate patterns of variation that correspond to the most significant spatial relationships.
The drawback of sMLE is the fact that it assesses these patterns much like principal component analysis does28, by seeking to explain the data matrix as a summation of linear relationships to the exclusion of nonlinear relationships. Although DNA microscopy datasets can be locally linear6, over larger length scales, this approximation increasingly breaks down. To resolve this, we developed Geodesic spectral embedding (GSE).
GSE works by preconditioning the UEI count matrix N so that the eigenvector basis on which gradient descent is performed will capture longer-range positional relationships between UMIs and enhance the accuracy of inference at larger length scales (Fig. 3d). GSE begins by constructing ‘global’ eigenvectors from the UEI matrix, as sMLE does. Rather than using these eigenvectors verbatim as the basis for projected gradient descent, the eigenvectors are instead first rescaled to make the mean-square UEI displacements of UMIs as uniform as possible (Methods). GSE then identifies the k = 2 × E-nearest neighbors for each UMI i, so that the neighbors will generally span all E dimensions. Pairwise distances are then calculated. In all analyses of real data, we adjusted these distances using a piecewise linear estimate of local tangent spaces (Supplementary Fig. 5a). This smoothing operation enhances the robustness of the inference to noise over larger length scales (Supplementary Fig. 5b).
GSE next uses these nearest-neighbor distances to compute a Gaussian kernel matrix with nonzero elements only at nearest neighbors (effectively setting non-nearest neighbor distances to infinity, making these distances collectively a form of geodesic distance approximation along the surface swept out by the data in high-dimensional space). This kernel matrix, (tilde{{bf{W}}}), is then projected into the original row-normalized UEI matrix N (Fig. 3e). The top eigenvectors of this matrix product then finally serve as the basis for projected gradient descent toward a coordinate solution, again in a manner similar to sMLE.
The kernel matrix (tilde{{bf{W}}}) on its own is still limited in its capturing complex nonlinear relationships in arbitrarily large datasets because the geodesic distance estimates used to calculate it rely on a nearest neighbor analysis of the raw data’s eigenvectors. To improve the robustness of GSE, particularly for long-range interactions, we introduced a two-stage image inference process, called hierarchical GSE (Fig. 3f and Methods). In the first stage of hierarchical GSE, GSE solutions are computed on small, contiguous subsets of UMIs (approximately 10% of the dataset, with subset size given the symbol mS), followed by linear interpolation (Supplementary Fig. 7) to estimate the remaining UMI positions. This initial inference, collated from a large number of distinct subsets, is then used to form a new E-dimensional subspace from which 2 × E-nearest neighbors can be found (as before) and the remainder of GSE inference is performed.
We benchmarked the GSE algorithm using simulated UEI count matrices derived from ground-truth positions to assess its capability for accurately embedding points in 2D and 3D. Simulations were performed using a negative binomial distribution with p = 0.8 to generate pairwise proximity relationships. In the 2D case, 1 × 104 UMIs and 1 × 105 UEIs were used (Fig. 4a), while the 3D scenario involved 5 × 104 UMIs and 2.5 × 106 UEIs (Fig. 4k). The grid sizes were set to the ‘diffusion unit’ (du), corresponding to a 1/e falloff in the UEI formation rate.
a,k, Ground-truth positions are used to simulate UEI count matrices, using a negative binomial model to sample the distribution of pairwise proximities of points across the dataset: 1 × 104 UMIs with 10 UEIs per UMI in 2D (a) and 5 × 104 UMIs with 50 UEIs per UMI in 3D, with grid sizes set to the diffusion unit of a 1/e falloff in UEI formation rate. b,l, Image inference using sMLE with E = 100. c,d,m,n, GSE at varying values of E to which embeddings were insensitive. e,f,o,p, Hierarchical GSE applies coarse inference on subset sizes mS, followed by finer-length-scale inference. g–j,q–t, UMAP embeddings under varying parameters for the same data. u, Decoherence curves quantifying local and global position fidelity are generated by randomly sampling reference points from the ground-truth coordinates, identifying local neighborhoods at different scales and performing a rigid alignment of each neighborhood in the embedding to its ground-truth counterpart. The mean point spread measures the average distance of each neighborhood in the ground truth, while the y axis (mean r.m.s.d.) shows how well each neighborhood aligns after registration. Shaded regions denote ±1 s.d. over the n = 500 sampled neighborhoods. v,w, GSE demonstrates broad superiority across length scales over alternative inference methods (parametrizations enumerated in the order in which they appear in earlier panels) and recapitulates the expected UEI resolution limits (gray arrows). x, Hierarchical GSE corrects for solution instabilities when UEIs per UMI are reduced to 10. y, Application to earlier unanchored diffusion 2D DNA microscopy datasets (scale bar, 100 μm; photograph reproduced with permission from ref. 6, Elsevier).
Image inference was performed using sMLE with E = 100 eigenvectors (Fig. 4b,l) and the results were compared to those obtained from ‘complete’ GSE at varying values of E (Fig. 4c,d,m,n). GSE showed improvements in reconstructing spatial embeddings, particularly in capturing local features, and showed little sensitivity to the eigenvector parameter E. For both sMLE and GSE, projected gradient descent using increasing numbers of eigenvectors displayed asymptotic convergence in accuracy or root-mean-square deviation (r.m.s.d.) values (Supplementary Fig. 6). Hierarchical GSE was performed as described in Fig. 3, by first performing GSE on coarsely sampled subsets of size mS, followed by finer inference at smaller length scales (Fig. 4e,f,o,p). For comparison, uniform manifold approximation and projection (UMAP) embeddings were generated under varying parameters (Fig. 4g–j,q–t).
To quantify the fidelity of local and global positions, reference points were sampled from ground-truth coordinates and rigid registrations were then performed on different local neighborhoods (Fig. 4u). Under these ‘decoherence curves’, GSE demonstrated broad improvements across length scales over alternative inference methods (Fig. 4v,w), with parametrizations enumerated in the order they appeared in earlier panels and shading plotted as the mean enveloped by the s.d. Critically, hierarchical GSE corrected for artifacts introduced by data undersampling (Fig. 4x and Supplementary Fig. 8a–e) and was robust to changes in the characteristics of the PSF (Supplementary Fig. 8f–j).
The inverse-square-root relationship quantifying the resolution limit as a function of UEIs per UMI (10 in the case of 2D simulations and 50 in the case of 3D simulations) was recapitulated by applying hierarchical GSE across these scenarios (Fig. 4v–x, gray arrows) compared to alternatives. We finally applied hierarchical GSE to unanchored diffusion DNA microscopy datasets (Fig. 4y and Supplementary Fig. 9) from earlier experiments6, recapitulating the ground truth.
Having demonstrated GSE’s effectiveness on accurately reconstructing earlier 2D DNA microscopy datasets and 3D simulations, we sought to evaluate its ability to reconstruct whole-zebrafish datasets. We applied GSE to zebrafish embryo data to assess the method’s scalability and accuracy in large-scale 3D inferences.
The four distinct zebrafish specimens are shown under bright field in Fig. 5a–d. To compare the performance of GSE in reconstructing the spatial structure of these datasets, we evaluated the observed UEI distributions with respect to the distances between GSE-embedded UMIs and those expected from the Gaussian dispersion model (Fig. 5e–l). For comparison, we ran the same analysis on simulated datasets (Supplementary Fig. 8k–n). Under an approximately uniform distribution of UMIs in 3D, we would expect that pairwise distances ∣∣x∣∣ between UMIs associated through UEI to follow the frequency distribution (sim | | x| {| }^{2}{e}^{-| | x| {| }^{2}/{L}^{2}}), commonly known as the Maxwell–Boltzmann distribution, because, for any given UMI, ∣∣x∣∣2 is proportional to the volume of the thin shell of space at a distance of exactly ∣∣x∣∣ from its location. Fitting one Maxwell–Boltzmann distribution (Fig. 5e–h, solid red line) at Ldiff = 1 deviates considerably from the observed frequencies. Fitting the sum of two (Fig. 5e–h dotted green line) with a second Ldiff ≈ 0.1 in each case brings the fit considerably closer.
a–d, Zebrafish specimens 1–4 under bright field (scale bars, 200 μm). e–h, Observed UEI distributions relative to the distance between GSE-embedded UMIs and a uniform Maxwell–Boltzmann (M–B) model of pairwise interactions, following from Gaussian dispersion in 3D. Shown is a single M–B distribution with a single length scale, Ldiff = 1, compared to the sum of two M–B distributions with fitted length scales (L1 and L2). i–l, Rank-order plots of total UEIs for each UMI and the MSD of UMIs in GSE-embedding space. m–p, GSE-embedded UMIs of specimen 1 (m; 7.7 × 106 UMIs), specimen 2 (n; 7.2 × 106 UMIs), specimen 3 (o; 12.0 × 106 UMIs) and specimen 4 (p; 12.6 × 106 UMIs). q–t, Close-up views of m–p, respectively. Points are shaded by local density. All GSE scale bars are 25 du.
The 3D GSE embeddings for the UMIs of each specimen are shown in Fig. 5m–t. Specimen 1 contained 7.7 million UMIs (Fig. 5m,q and Supplementary Video 1), specimen 2 contained 7.2 million UMIs (Fig. 5n,r and Supplementary Video 2), specimen 3 contained 12.0 million UMIs (Fig. 5o,s) and specimen 4 contained 12.6 million UMIs (Fig. 5p,t). Images show consistent structural patterns across the specimens, recapitulating gross zebrafish morphology.
Noting that, in Fig. 5, 200-μm scale bars in bright-field images were close to those for 25 du in DNA microscopy images, we may assign a unit conversion of ~10 μm:1 du and perform a simple check. Because the fitted short-length Maxwell–Boltzmann distribution had Lshort ≈ 0.1 du, we can rewrite Lshort as ~0.1 du × 10 μm per du ≈ 1 μm, the expected size of each RCA polony.
Next, UMI rank plots in Fig. 5i–l show falloffs of total UEI counts (per UMI) with median values between 5 and 6 and r.m.s.d. distances with median values of (sim sqrt{0.1},{rm{du}}) (because 1/mean-square displacement (MSD) = 6–13). We can, therefore, estimate that the median 3D resolution of UMIs across the specimen, conditionalized on the position of each UMI’s UEI-associated neighbors, is (10,upmu {rm{m}}/{rm{du}}times sqrt{0.1},{rm{du}}/sqrt{{n}_{{rm{UEI}}}}approx 1.5,upmu {rm{m}}).
We compared spatial gene expression patterns captured by DNA microscopy and Stereo-seq29, an array-based method that differs from our experiment in its nonuse of formaldehyde fixation, poly(T) capture and 2D sectioning. Despite these key differences, Stereo-seq datasets nevertheless offer an ideal comparison of internally normalized gene expression enrichment because of their covering the entire anterior–posterior axis of 24-hpf embryos. We began by identifying genes shared between the two datasets, ensuring that only genes detected in both modalities and meeting an abundance threshold of at least five spatial locations in the Stereo-seq data were considered.
To classify genes on the basis of their spatial expression patterns, we applied a spatial proximity approach using the Stereo-seq data. We randomly selected a spatial point within the zebrafish embryo to serve as a reference. For each gene shared across our datasets, we calculated the Euclidean distances from all Stereo-seq spatial locations where the gene was expressed to this random point. By computing the 10% trimmed mean of these distances, we obtained a robust measure of each gene’s typical spatial distance from the reference point.
Genes were then ranked on the basis of their mean distances to the reference point. The top five genes with the smallest mean distances were designated as ‘type A’ genes, representing genes with localized expression near the random point. Conversely, the bottom five genes with the largest mean distances were classified as ‘type B’ genes, representing genes expressed farther from the reference point. This classification was repeated to generate five nonoverlapping sets of type A and type B genes exhibiting spatial polarity between the ‘cephalic’ and ‘caudal’ regions of the Stereo-seq data.
To quantify gene enrichment patterns, we calculated the fractional representation of type A genes at each spatial location. In the Stereo-seq data, for each spot, we computed the fraction of type A gene expression relative to the combined expression of type A and type B genes. This fraction was then rank-transformed to facilitate comparisons (Fig. 6a–e). To enhance spatial coherence and reduce local variability, we applied a smoothing operation where each spot was assigned the mean enrichment value across its ten nearest neighbors with assigned enrichment values, highlighting spatial trends in gene expression (Fig. 6f–j).
a–e, Random point selection within 24-hpf zebrafish Stereo-seq data allows for the unbiased selection of five sets of ten genes, split into five colocalized type A genes and five colocalized type B genes, represented among mapped DNA microscopy gene inserts, with strong polarization signals. Enrichment was quantified by rank-transforming the fractional representation of type A genes. f–j, Smoothing of this data was performed by assigning, to each Stereo-seq spot, the mean across its closest ten neighbors with an assigned enrichment value. k–o, The same operation, using UMIs instead of spots, was performed on DNA microscopy specimens reconstructed with 12 million UMIs. All GSE scale bars are 25 du.
We applied an analogous procedure to the DNA microscopy data, using UMIs instead of spots. Each UMI was assigned a type A fraction according to the gene it represented and smoothing was performed using the same approach with ten nearest neighbors. The resulting patterns from DNA microscopy mirrored those observed in the Stereo-seq data (Fig. 6k–o), demonstrating similar gene enrichment patterns across the anterior–posterior axis of the embryos.
To explore a subset of the genetic readout in details pertinent to 3D spatial genetic organization, we mapped the UEI network connectivity among individually identified protocadherin isoforms. Deciphering the 3D spatial organization of stochastically expressed clustered protocadherin isoforms is key to understanding genetic control of the developing nervous system30 and clonally related neurons were shown to express similar isoforms compared to one another in mice31. We mapped the pairwise shortest distances along the UEI graph between pairs of transcripts mapping uniquely to protocadherin isoforms (Supplementary Fig. 10). By showing no strong signature of isoform–family colocalization but permitting the measurement nevertheless, these results provide a footing for more comprehensive mapping 3D proximities of alternately spliced transcripts in developing organisms.
In this study, we established that a massive distributed molecular network comprising 107 nodes can be used to achieve volumetric imaging of biological specimens from the inside out. This advancement carries implications in several key areas.
First, we demonstrated that DNA molecules can serve as carriers of vast image datasets, which can be decoded using conventional DNA sequencing equipment. This obviates the need for specialized imaging instruments and lays a foundation for the scalability and accessibility of 3D spatial genetic imaging technologies. Such accessibility is particularly impactful in biological systems where somatic mutations and genomic idiosyncrasies have profound effects. Moreover, the readouts require no prior knowledge of the organism’s genome, opening avenues to investigate spatial genetic complexities in nonmodel organisms. This does not, however, preclude its potential integration into spatial transcriptomic pipelines that use prior knowledge of cell types.
Second, the inferred images inherently embody both connectivity maps and representations in Cartesian coordinates, presenting a fundamental tension between these two aspects. We demonstrated, both in simulations and in experimental practice, that GSE, a methodology for dimensionality reduction, provides a scalable solution to reconcile these disparate properties. GSE complements existing dimensionality reduction techniques and may have broader applications in other large datasets where reconciling node connectivity with low-dimensional representations is essential. Future work will be needed to investigate the method’s behavior in these other contexts.
Third, ongoing advancements in volumetric DNA microscopy chemistry and inferential methods will establish a platform distinct from yet potentially complementary to conventional light and electron microscopy for analyzing biological circuits. This platform offers unique capabilities for probing the molecular architecture of cells and tissues, enabling insights that are difficult to achieve with traditional imaging modalities.
The effective resolution of DNA microscopy can be approximated by the length scale over which UMIs are dispersed divided by the square root of the number of UEIs they generate. This dependence on dispersion length scale and UEI counts is analogous to how the resolution of stochastic super-resolution light microscopy depends on wavelength and photon counts6, respectively. Unlikely in optical microscopy, however, the resolution in DNA microscopy of any single UMI is conditionalized on the positions of its UEI-associated neighbors. In our experiments, RCA polonies approximately 1 μm in size with a median of 5–6 UEIs per UMI (Fig. 5) yield a median resolution, conditionalized on the positions of its neighbors, slightly above 1 μm.
The conditionalization of UMI position resolution on the positions of UMI neighbors means that any subnetwork within the specimen can only achieve single-cell resolution if the subnetwork’s gaps are smaller than that same length scale. By design, UMI RCA products form at the sites of transcripts; thus, nonuniformities in the spatial distributions of UMIs (Fig. 2a–f) mean that our datasets currently do not achieve this throughout the organism (a 24-hpf embryo having 2.5 × 104 cells32 means that, at 107 UMIs, each cell has 400 UMIs associated with it on average).
Achieving organism-wide single-cell resolution will require further optimization of DNA microscopy chemistry to fully space-fill areas of tissue where gaps exist. Applying shape constraints similar to those explored in genetic cartography33 may substantially increase global conformity across replicates.
Ultimately moving resolution to subcellular length scales will require resolving a key open question: whether reducing or eliminating long-range UEIs (needed to form a contiguous graph across large distances) is possible. The relative contributions of these two different UEI sources (RCA versus IVT) was estimated in Fig. 5e–h. However, the fact that these two UEI types are indistinguishable once sequenced means that every observed UEI carries, on average, less positional information than it would otherwise. Eliminating this information loss by improved experimental design will be key to future improvements to resolution.
We demonstrated that volumetric DNA microscopy can capture spatial information for RNA molecules within biological specimens. Looking ahead, it will remain a crucial task to improve the genic sequence capture capabilities of the chemistry to comprehensively annotate the 3D network with biological information at cellular resolution. We anticipate extending this capability to acquire proteomic information through the use of oligo–antibody conjugates. As endeavors intensify to generate maps of neuronal circuitry and other complex biological networks, data that pair physical connectivity with molecular genetics, including gene expression patterns, genomic mosaicism and other analytes, will become increasingly valuable. We envision volumetric DNA microscopy contributing to this comprehensive understanding.
All reagents used are enumerated in Supplementary Table 2. All oligonucleotides were ordered from Integrated DNA Technologies and are enumerated in Supplementary Table 3. The detailed workflow of the ‘end-to-end’ experiment is shown in Supplementary Figs. 1c and 3; All steps from the preparation of embryos through IVT occurred in situ and all other steps were performed ex situ.
AB-wildtype zebrafish were kept and crossed in accordance with the approved protocols and ethical guidelines of the University of Chicago Institutional Animal Care and Use Committee. Embryos were collected at 24 hpf and dechorionated with 1 mg ml−1 pronase for 5 min at 28 °C. The dechorionated embryos were fixed in 4% paraformaldehyde in 1× PBS at 4 °C overnight. After dehydration in 100% methanol for 15–30 min at room temperature, the embryos were stored at −80 °C for at least 2 h before use. Embryos were successively rehydrated with 75%, 50% and 25% methanol in 1× PBS for 5 min each and washed four times with PBST (1× PBS + 0.1% Tween-20) at 5 min per wash at room temperature. The embryos were then permeabilized with 5 × 10−5 U per μl thermolabile proteinase K for 12 min at room temperature, followed by inactivation at 55 °C for 15 min. Samples were then washed four times in PBST at 5 min per wash.
After permeabilization, embryos were incubated at 4 °C for 1 h under slow rotation (10 rpm) followed by 10 min of 65 °C incubation in a pre-RT buffer comprising 20% formamide, 0.5 U per μl Superase-In, 4.4 mM DTT, 0.5 μg μl−1 recombinant albumin and 1× PBS and then immediately cooled to 4 °C. After one water rinse, RT mix (4.4 mM DTT, 400 μM deoxynucleotide triphosphate (dNTP), 32 μM aminoallyl-dUTP, 0.5 μg μl−1 recombinant albumin, 1 U per μl Superase-In, 10 U per μl Superscript III and 1 μM 21.068C-8N RT primer in 1× FS buffer) was added and underwent 4 °C incubation for 1 h, 60 °C incubation for 3 min and 37 °C incubation overnight under slow horizontal orbital rotation, followed by incubation for 1 h at 50 °C. After RT, embryos were washed three times in PBST and once with water and then incubated in Exo I mix (1.43 U per μl Exo I in 1× Exo I buffer) at 4 °C for 1 h under slow rotation, followed by 1 h at 37 °C to remove the RT primer and displaced cDNA. Embryos were again washed three times in PBST and once with water.
Transposomes were assembled according to the manufacturer’s protocol (Supplementary Fig. 1a). Briefly, oligos 22tn5.003A and 22tn5.MOS-3p were resuspended to 100 μM individually in annealing buffer (40 mM Tris-HCl pH 8 and 50 mM NaCl) and annealed at an equimolar ratio at 95 °C for 2 min, followed by 1 °C decrements per min for 70 min. Transposome assembly mix (25 μM of the annealed oligo-duplex and 1 μg μl−1 tagmentase) was incubated at 23 °C for 30 min. Glycerol was then added to a final concentration of 50% and stored until use at −20 °C. Following RT, the transposome glycerol stock was diluted 3:10 in tagmentase dilution buffer. This diluted solution was then added at a further 1:50 dilution to transposome reaction mix containing 5 mM MgCl2, 10 mM Tris-HCl pH 7.5, 10% N,N-dimethylformamide, 9% PEG8000 and 850 μM adenosine triphosphate (ATP)24. This mix was added to samples and incubated at 4 °C for 1 h under slow rotation followed by 1 h at 55 °C. Samples were then washed twice in PBST and once in 1× PBS.
The 5 mM BS(PEG)5 solution was prepared in 1× PBS and added to samples for incubation at room temperature for 1 h under slow rotation. Samples were then rinsed with 1 M Tris pH 8 and quenched in this buffer for 30 min at room temperature. Samples were then incubated in 4× saline–sodium citrate (SSC) at 4 °C for 10 min under slow rotation, followed by 15 min at 70 °C. Afterward, samples were incubated in 10% formamide and 2× SSC at 4 °C under slow rotation for 10 min and then at 50 °C for 10 min. Samples were then washed three times in PBST for 5 min each.
After rinsing with water, 3′ adaptor ligation mix (500 nM 22tn5.005, 500 nM 22tn5.006, 1.25 U per μl SplintR ligase in 1× SplintR ligase reaction buffer containing ATP) was added to samples, which were incubated at 4 °C for 1 h followed by 23 °C overnight. After ligation, samples were washed three times in PBST for 5 min each wash and then rinsed with water; the 3′ phosphates on the ligated oligos were removed with 0.5 U per μl Quick CIP in 1× rCutSmart buffer by incubation at 4 °C for 1 h under slow rotation, followed by 37 °C for 1 h. Samples were then again washed three times in 2× SSCT (SSC + Tween-20) at 5 min per wash. circDNAs (circ6G1 and circ7G1) were prepared using T4 DNA ligase and short splint oligos25 (splint6F5 and splint7F5, respectively) (Supplementary Figs. 1b and 2) and purified using a Zymo Oligo Concentrator spin column. Products were checked for size and purity using TBE–urea Gels. circDNA annealing mix (100 nM circ6G1 and 100 nM circ7G1 in 1× hybridization buffer containing 2× SSC, 10% formamide and 0.1% Tween-20) was added to samples and incubated overnight at 40 °C under slow rotation. Samples were then washed in hybridization buffer at 40 °C for 30 min under slow rotation3 and then washed in 2× SSCT, 1× SSCT and finally PBST at 5 min per wash.
After rinsing with water, 3′ adaptors were initially extended along the circDNAs under two different conditions: ‘normal’ pre-RCA extension (100 μM dNTP, 0.04 U per μl T4 DNA polymerase in 1× NEBuffer2.1) and ‘enhanced’ pre-RCA extension (25 ng μl−1 T4 gene 32, 1 mM dNTP, 0.04 U per μl T4 DNA polymerase in 1× NEBuffer2.1). Following the initial extension of the 3′ adaptors, DNA nanoballs were generated by adding the RCA mix (25 ng μl−1 T4 gene 32, 0.5 μg μl−1 recombinant albumin, 250 μM d(AUGC)TP and 0.2 U per μl phi29 polymerase in 1× phi29 reaction buffer), before incubation at 4 °C for 1 h under slow rotation and then 30 °C overnight. For fluorescence observation, 20 μM fluorescein-12–dUTP was added to the RCA mix. Samples were washed three times in 2× SSCT.
UEI annealing mix (100 nM 21.004G1/2-BC oligo mix, 100 nM 21.073pt, 100 nM 21.074B, 2× SSC, 5% formamide and 0.1% Tween-20) was added to samples, before incubation at at 4 °C for 1 h under slow rotation and then 50 °C for 2 h. After being brought to room temperature, samples were washed in UEI hybridization buffer (2× SSC, 5% formamide and 0.1% Tween-20) for 1 h at 50 °C, followed by washes in 2× SSCT, 1× SSCT and then PBST. After water rinse, extension–ligation mix (1 mM dNTP, 0.15 U per μl T4 DNA polymerase and 20 U per μl T4 DNA ligase in 1× T4 ligase reaction buffer including ATP) was added, before incubation for 1 h under slow rotation at 4 °C, followed by 40 min at room temperature. Samples were then washed three times in PBST, followed by a water rinse.
IVT mix (Supplementary Fig. 4) was prepared by combining the following components: 100 nM 21.075, 7.5 mM rNTP mix, 100 ng μl−1 T4g32, 0.05 U per μl USER enzyme, 10% T7 Enzyme Mix, 1× reaction buffer, 6.4 μg μl−1 THIOCURE ETTMP 1300 and 73.6 μg μl−1 4arm-PEG20K-vinylsulfone (PEG reagents were thawed from −80 °C immediately before reaction). The mixture was added to individual zebrafish embryos at a total volume of 30 μl. Hydrogels were allowed to form around samples for 2 h at room temperature. Reactions were then incubated at 37 °C for 20 h. Afterward, the hydrogels were denatured by adding 12 μl of denaturation solution (457.5 mM KOH, 100 mM EDTA and 42.5 mM DTT) for 2 h at 4 °C. Denaturation was stopped by adding 12 μl of stop solution (600 mM Tris-HCl pH 7.5 and 0.4 N HCl). After mixing, 30 μl of proteinase K mix (0.28% Tween-20, 0.09 U per μl proteinase K and 8.6 mM Tris-HCl pH 7.5) was added to the 54-μl samples for a total of 84 μl. This mixture was incubated at 50 °C for 1 h.
IVT RNA was purified by addition of 1.2× RNAClean XP beads, following the manufacturer’s protocol, and eluted into water. DNase I digestion was performed (final concentrations of 0.8 U per μl Superase-In, 0.1 U per μl DNase I and 1× DNase I reaction buffer) at 37 °C for 30 min. RNA was again purified using 1.2× RNAclean XP and eluted into water. RT was carried out in a final concentration of 500 nM each of RT primers (21.077 and 21.085), 500 μM dNTP, 1× FS buffer, 5 mM DTT, 1 U per μl Superase-In and 10 U per μl Superscript III. Primers and dNTP were added first to RNA and water eluent and incubated at 65 °C for 5 min, after which the mixture was placed promptly on ice. The remainder of the reaction mixture was then added and samples were incubated for 1 h at 50 °C, followed by inactivation for 15 min at 70 °C, before being kept at 4 °C. Exo I was then added directly to the product to final concentration of 3.3 U per μl and incubated at 37 °C for 30 min, followed by heat inactivation at 80 °C for 20 min.
cDNA products (from IVT products) were then amplified in two separate PCR reactions. cDNA amplicons were amplified by adding Exo I-digested product at a final 1:80 dilution into four separate reactions (per embryo) containing final concentrations of 300 nM 21.046G1-BC primer, 300 nM 21.081b primer, 1× HiFi PCR buffer, 200 μM dNTP, 2 mM MgSO4 and 0.02 U per μl Platinum Taq HiFi. These reactions were thermocycled as follows: 95 °C for 2 min, 5× (95 °C for 30 s, 56 °C for 30 s and 68 °C for 2 min), 20× (95 °C for 30 s and 68 °C for 2 min), 68 °C for 5 min and 4 °C. UEI amplicons were amplified by adding Exo I-digested product at a final 1:40 dilution into two separate reactions (per embryo) containing final concentrations of 300 nM 21.077-G1 primer, 300 nM 21.076BB primer, 3.3 μM each of 4E4.interf1 and 4E.interf2 (3′ phosphate-capped oligos to interfere with PCR recombination6,26), 5% DMSO, 1× HiFi PCR buffer, 200 μM dNTP, 2 mM MgSO4 and 0.02 U per μl Platinum Taq HiFi. These reactions were thermocycled as follows: 95 °C for 2 min, 1× (95 °C for 30 s, 66 °C for 30 s and 68 °C for 2 min), 18× (95 °C for 30 s and 68 °C for 2 min), 68 °C for 5 min and 4 °C. PCR products were then purified using a 0.75× volume of Ampure XP beads, following the manufacturer’s protocol. Products were quantified and sequenced on an Illumina NextSeq 500 instrument using 150-bp cycle kits (112 nt for read 1 and 44 nt for read 2), including the sequencing primer SBS3B as a custom spike-in, according to the manufacturer’s protocol. Illumina’s Local Run Manager version 4.0.0.649 was used on the instrument for data collection.
Initial experiments in cultured cells optimized key chemistries to enhance the DNA microscopy protocol’s efficiency and specificity. Oligonucleotides were redesigned to reduce hairpin structures and minimize nonspecific annealing. An Exo I treatment after RT effectively removed residual RT primers. SplintR ligase was selected over other ligases for efficient 3′ adaptor ligation because of its higher efficiency and specificity for RNA splinted DNA ligation. Introducing Tn5 transposase-mediated tagmentation increased cDNA yield and created more 3′ ends for adaptor ligation; a transposome dilution of 1:50 balanced DNA fragmentation with sample uniformity. Adjustments to PEG8000 concentration and ATP levels improved tagmentation efficiency. Including 32 nM aminoallyl-dUTP during in situ RT and crosslinking with BS(PEG)5 immobilized fragmented RNA and cDNA. The phi29 DNA polymerase was chosen over Bst DNA polymerase for RCA because of superior performance. UEI oligo annealing conditions were optimized by adjusting formamide concentration, incubation time and temperature. Incorporating IVT for linear amplification of cDNA and UEIs reduced material loss before PCR and minimized amplification errors; inclusion of T4 gene 32 protein during IVT further enhanced UEI formation. RNA and DNA purification steps were optimized using RNAClean XP and Ampure XP beads to improve recovery and reduce noise.
Applying the protocol to 24-hpf zebrafish embryos required additional optimization because of densely packed cells hindering reagent diffusion. Effective permeabilization was achieved using 5 × 10−5 U per μL thermolabile proteinase K under specific conditions. Including 20% formamide before in situ RT enhanced RNA denaturation and cDNA synthesis. To mitigate crowding from DNA nanoballs formed during RCA, the amplification strategy was modified; T4 DNA polymerase (lacking strand displacement activity) and dNTPs were used to extend 3′ adaptors along circDNAs, copying UMIs only once. Subsequent RCA with phi29 DNA polymerase, substituting deoxythymidine triphosphate with dUTP, produced DNA nanoballs digestible by USER enzyme after UEI formation, reducing crowding and improving reagent diffusion. Including T4 gene 32 protein in RCA further enhanced UEI formation. Following UEI oligo annealing, a protocol combining T4 DNA ligation and extension with high dNTP concentration greatly improved UEI yield. PCR conditions were adjusted by lowering template concentration to minimize nonspecific products and adding 5% DMSO in UEI amplification reactions to enhance specificity.
Sequence analysis was performed using the pipeline previously described6, with code updated for the larger scale of data (https://github.com/wlab-bio/vdnamic).
Briefly, sequencing reads were demultiplexed using the barcodes depicted in Supplementary Fig. 1. Subsequently, for each amplicon type, sequence elements (UMI type I, UMI type II, UEIs and cDNA inserts) were separately clustered with a 1-bp difference criterion using the EASL algorithm6.
For UEI datasets, each UEI was assigned a UMI pair by plurality (relevant only if a specific UEI appeared to show two different pairings of UMIs, a signature of PCR recombination). The resulting ‘consensus pairings’ were then pruned, with each UMI required to be associated with two UEIs and each association (unique UMI–UMI pair) required to be associated in at least two reads. The largest contiguous matrix (found by single-linkage clustering, with rarefaction depicted in Fig. 2o) was retained for image inference.
Although the above thresholding was the extent to which the reprocessed6 data for Fig. 4y were subjected, for volumetric DNA microscopy datasets, a more stringent threshold was additionally applied that every UMI–UMI association was required to possess at least two UEIs.
For cDNA-insert sequence data, reads grouped by the same UMI had a sequence consensus generated by majority vote. These sequences were then trimmed to eliminate sequence adaptors. Those inserts retaining at least 25 bp of nonartificial sequences (at least among known artificial sequences) were then counted toward the cDNA-insert UMIs (depicted as rarefaction in Fig. 2i,j). These consensus inserts were then input into STAR alignment27 version 2.7.9a using the Danio rerio genome assembly GRCz11. Gene annotations were obtained from Ensembl release 109 (February 2023) for D. rerio using the chromosomal assembly GTF file (Danio_rerio.GRCz11.109.chr.gtf) based on the GRCz11 reference genome. Ribosomal RNA (rRNA)-annotated loci were extended by 1 kb upstream and downstream of their respective annotated regions to ensure comprehensive labeling.
Alignments were filtered on the basis of alignment score and match percentage to retain high-quality matches. Mismatches and mutations were identified by analyzing compact idiosyncratic gapped alignment report strings and mismatch description tags, constructing mutation strings detailing each mismatch. Gene annotations were obtained by mapping genomic coordinates to features in the D. rerio GTF file, associating each sequence with gene names and biotypes. To avoid false-positive identification of mRNAs when sequences were actually rRNAs, the script checked for rRNA or Mt_rRNA biotypes and explicitly set the gene name accordingly, assigning an attention rank to prioritize accurate classification of rRNA sequences. Those not mapped to rRNA, Mt_rRNA, mRNA or long intergenic noncoding RNA loci were assigned the label ‘genomic DNA or other’ in Supplementary Table 1. Lastly, the data from alignments and annotations were merged and sorted to produce the final annotated output, consolidating information for each UMI and ensuring that each sequence was associated with its corresponding genomic annotations and read counts, thereby facilitating detailed analysis while minimizing the risk of misclassification.
The UMIs from genome-mapped cDNA-insert libraries (Supplementary Table 1) were then matched back to the UMIs in the UEI amplicon libraries of the corresponding specimen, allowing for single-nucleotide mismatches. The gene calls were then applied to label UMIs in the UEI-inferred image.
All simulations were performed by taking the raw coordinates depicted in Fig. 4 and calculating Gaussian PSFs. For UMI i and UMI j at ground-truth positions ({overrightarrow{x}}_{i}) and ({overrightarrow{x}}_{j}), respectively, with N being the sum-total of all UEI counts in the simulated dataset, we assigned a raw count nij ← NegativeBinomial(mean = μij, p) where ({mu }_{ij}leftarrow N{e}^{-| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}}/{sum }_{ij}{e}^{-| |; {overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}}) and p ← 0.8.
The DNA microscopy dataset comprised UEI counts nij, corresponding to type I UMI index i and type II UMI index j. In discussing image inference, we use the below-described notation that effectively describes the dataset as a bipartite graph, whose weights are UEI counts and whose nodes are UMIs. First, the ‘raw’ and symmetric UEI matrix, with elements (UEI counts) nij, is written as follows:
$${bf{n}}equiv left(begin{array}{cc}0&{{bf{n}}}_{{m}_{{bf{I}}}times {m}_{{bf{II}}}}\ {{bf{n}}}_{{m}_{{bf{II}}}times {m}_{{bf{I}}}}&0end{array}right)$$
On the other hand, the row-normalized matrix is written as follows:
$${bf{N}}equiv ,text{diag},{({n}_{icdot })}^{-1}{bf{n}}$$
Here and elsewhere in this paper, subscript ‘⋅’ denotes index summation, such that ni⋅ ≡ ∑jnij. The top eigenvectors of N are preliminary estimates of basis vectors to compute the final dimensional embedding. The reason for this is discussed below. Take the Kullback–Leibler minimization problem,
$$mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}{D}_{rm{KL}}(pparallel q)=mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}sum _{ij}{p}_{ij}log {p}_{ij}/{q}_{ij}(overrightarrow{x})$$
where pij ≡ nij/n⋅⋅ and ({q}_{ij}(overrightarrow{x})equiv {e}^{-| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}}/{sum }_{ij}{e}^{-| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}}). ({q}_{ij}(overrightarrow{x})) here denotes reaction probabilities given the coordinates (({overrightarrow{x}}_{i},{overrightarrow{x}}_{j})).
Writing this out, we want to solve
$$begin{array}{rcl}mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}{D}_{rm{KL}}(;pparallel q)&=&mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}left(-sum _{ij}{p}_{ij}log {q}_{ij}(overrightarrow{x})right)\ &=&mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}left(sum _{ij}{n}_{ij}| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}+{n}_{cdot cdot }log sum _{ij}{e}^{-| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}}right)end{array}$$
We can rewrite this as a least squares problem plus additional terms that we deal with later:
$$mathop{{rm{arg}},{rm{min}}}limits_{overrightarrow{x}}{D}_{rm{KL}}(;pparallel q)=sum _{ij}{n}_{ij}| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}+,text{nonlinear terms},$$
(1)
Taking the derivative of the sum of squares with respect to (overrightarrow{x}) and setting to 0 translates in matrix algebra to ({bf{N}};overrightarrow{x}approx overrightarrow{x}) (that is, we want (overrightarrow{x}) to be eigenvectors of N with eigenvalues closest to 1). Instead of allowing each point to be moved independently (previously called ptMLE6), we can prioritize finding putative solutions to the convex eigenvector problem; despite approximating the original problem, it has a unique solution. These putative solutions are used to explore the wider solution space to minimize the full objective function.
In sMLE, these putative solutions (that is, eigenvectors) are spectrally ordered (by eigenvalue) and are used incrementally to construct a solution. The basic procedure is that we start with the first d eigenvectors (where d is the number of dimensions in the embedding) and perform gradient descent on those initial d2 linear coefficients (because any dimension can use any set of the starting d eigenvectors). The gradient being used itself is that of the nonlinear objective above projected onto the (orthogonalized) set of eigenvectors under consideration. After each projected gradient descent is complete, a new eigenvector is introduced with linear coefficient zero and allowed to change to further minimize the objective.
In GSE, we extend this idea by identifying a preconditioning operation on N so that the eigenvectors of the preconditioned matrix (what we call the GSE matrix) are individually closer approximations of the optimization problem; therefore, the solution to the projected gradient descent operation is closer to the UMIs or point coordinates that gave rise to the data.
In broad strokes, GSE examines the curvature of the d-dimensional manifold swept out by the E eigenvectors from earlier. However, to evaluate the curvature of this manifold, it is important for us to define some notion of distance within this E-dimensional subspace.
We begin with our row-normalized UEI matrix N and note that the quadratic portion of the objective in Eq. (1) is minimized when the matrix product ({overrightarrow{x}}^{T}({bf{I}}-{bf{N}});overrightarrow{x}) is minimized. If we apply this to a scenario in which each eigenvector ({overrightarrow{z}}_{i}) of N occupies its own dimension and is rescaled by some factor αi, then the contribution to the objective of each dimension i goes as ({alpha }_{i};{overrightarrow{{z}_{i}}}^{T}({bf{I}}-{bf{N}});overrightarrow{{z}_{i}}{alpha }_{i}={alpha }_{i}^{2}(1-{lambda }_{i})), where λi is the corresponding eigenvalue. Consequently, letting ({alpha }_{i}=1/sqrt{1-{lambda }_{i}}) will equalize the contribution of each dimension to the objective.
To perform this rescaling on the data, we construct an orthonormal subspace U by singular value decomposition of the matrix of N’s top E eigenvectors in the m × E matrix Zraw. Specifically, we use the left singular vectors of Zraw, U, to project N into an E × E matrix Nreduced ≡ UTNU. We then perform a full eigenvector decomposition on Nreduced:
$${{bf{N}}}_{{rm{reduced}}}{overrightarrow{v}}_{i}={lambda }_{i};{overrightarrow{v}}_{i}$$
where ({overrightarrow{v}}_{i}) denotes the E eigenvectors and λi denotes the eigenvalues in the subspace U. The Ritz approximation of the eigenvectors in the original space is given by ({overrightarrow{x}}_{i}approx {bf{U}};{overrightarrow{v}}_{i}). The rescaled subspace is then set to the rescaled Ritz-vectors:
$${overrightarrow{z}}_{i}^{{rm{rescaled}}}={bf{U}};{overrightarrow{v}}_{i}/sqrt{1-{lambda }_{i}}$$
(2)
Each of these vectors then forms columns of the matrix subspace Z.
Next, a k-nearest neighbors (kNN) search is conducted, with k set to 2E, to identify the local neighborhoods for each UMI in Z, with k here chosen so that the neighbors generally span all E dimensions. For each UMI i, the set of its k-nearest neighbors ({mathcal{N}}(i)) is determined on the basis of the Euclidean distances (parallel {overrightarrow{z}}_{i}-{overrightarrow{z}}_{j}parallel), where ({overrightarrow{z}}_{i}) and ({overrightarrow{z}}_{j}) represent the ith and jth rows of Z, respectively. For kNN searches with small numbers of dimensions ≤ 3, we used Python’s Scikit-learn KDtree search and Python’s Annoy library otherwise. Both operate with (O(mlog m)) time complexity.
Following the identification of nearest neighbors, singular vectors (representing local tangent spaces) are computed for each point and then geodesic distances gij between each UMI i and its neighbors (jin {mathcal{N}}(i)) are estimated. Before computing the singular vectors, we define a weight wi for each point i as the reciprocal of its degree in the kNN graph, with a small constant to avoid division by zero:
$${w}_{i}=frac{1}{| {;j:iin {mathcal{N}}(j)}| }$$
where (| {;j:iin {mathcal{N}}(j)}|) denotes the number of times point i appears in the neighborhood of other points.
For each point i, the local tangent space Ui is estimated by performing a singular value decomposition (SVD) on the weighted difference matrix:
$${{bf{D}}}_{i}=,text{diag},{left(sqrt{{w}_{j}}right)}_{jin {mathcal{N}}(i)}left({{bf{Z}}}_{{mathcal{N}}(i),:}-{overrightarrow{1}}_{| {mathcal{N}}(i)| };{overrightarrow{z}}_{i}^{T}right)={{bf{U}}}_{i}{{mathbf{Sigma }}}_{i}{{bf{V}}}_{i}^{T}$$
where ({{bf{Z}}}_{{mathcal{N}}(i),:}) is a matrix whose rows are the coordinate vectors (in subspace Z) of the neighbors of point (i,{overrightarrow{z}}_{i}) is the coordinate vector (in subspace Z) of point (i,{overrightarrow{1}}_{| {mathcal{N}}(i)| }) is a column vector of ones with length equal to the number of neighbors of point i and (,text{diag},{(sqrt{{w}_{j}})}_{jin {mathcal{N}}(i)}) is a diagonal matrix of the square roots of the weights for the neighbors of point i. The tangent space Ui is then given by the first d left singular vectors of Di, where d is the dimensionality of the tangent space.
To estimate the geodesic distance gij between point i and its neighbor j, we define ({overrightarrow{z}}_{ij}={overrightarrow{z}}_{j}-{overrightarrow{z}}_{i}) and use the following formula:
$${g}_{ij}=(1-{beta }_{ij})cdot left(leftVert {{bf{U}}}_{i}^{T}{overrightarrow{z}}_{ij}rightVert +leftVert {{bf{U}}}_{j}^{T}{overrightarrow{z}}_{ij}rightVert right)+{beta }_{ij}cdot parallel {overrightarrow{z}}_{ij}parallel$$
where
$${{bf{M}}}_{ij}={{bf{U}}}_{i}^{T}{{bf{U}}}_{j}={tilde{{bf{U}}}}_{ij}{tilde{{mathbf{Sigma }}}}_{ij}{tilde{{bf{V}}}}_{ij}^{T}$$
and
$${beta }_{ij}={text{GeomMean}}_{l}({tilde{sigma }}_{ij,l})$$
where ({tilde{sigma }}_{ij,l}) denotes the diagonal values of ({tilde{{mathbf{Sigma }}}}_{ij}). βij, as defined, represents the ‘tangent closeness metric’ calculated as the geometric mean of these singular values. Note that Mij is a d × d matrix, reflecting the inner product of the d-dimensional tangent spaces. This formulation adjusts the Euclidean distance to more accurately reflect the underlying manifold structure.
The estimated geodesic distances gij, defined for all i and j where (jin {mathcal{N}}(i)), are subsequently transformed into similarity weights using a Gaussian (radial basis function) kernel:
$${W}_{ij}leftarrow frac{{left({sigma }_{i}^{2}+{sigma }_{j}^{2}right)}^{-d/2}{e}^{-{g}_{ij}^{2}/left({sigma }_{i}^{2}+{sigma }_{j}^{2}right)}}{{rm{row}},{rm{norm}}}$$
where σi is the median value of gij over all (jin {mathcal{N}}(i)). The resulting sparse matrix W is then used in the final determination of the GSE eigenvector basis. The matrix product can now be defined as follows:
$${{bf{W}}}_{{rm{GSE}}}equiv tilde{{bf{W}}}{bf{N}}$$
(3)
GSE computes the top eigenvectors of the normalized and symmetrized version of this matrix, ({{mathbf{Lambda }}}^{-1}({{bf{W}}}_{{rm{GSE}}}+{{bf{W}}}_{{rm{GSE}}}^{T})), where Λ is a diagonal matrix with elements being the row sum of ({{bf{W}}}_{{rm{GSE}}}+{{bf{W}}}_{{rm{GSE}}}^{T}).
The top eigenvectors of this matrix are considered to be putative solutions to the d-dimensional embedding problem, taking into account the curvature of the data manifold. Because we consider the top eigenvectors of ({tilde{{bf{W}}}}_{{rm{GSE}}}) to fit the data to the global curvature of the dataset, we now apply an incremental projected gradient descent on a global objective function (Fig. 3). The objective function we use here is the Kullback–Leibler divergence:
$${D}_{{rm{KL}}}equiv sum _{ij}left(frac{{n}_{ij}}{{n}_{cdot cdot }}right)log left(frac{{n}_{ij}/{n}_{cdot cdot }}{{w}_{ij}/{w}_{cdot cdot }}right)$$
This is a statistical distance between the UEI counts nij and the proximities in the embedding space ({w}_{ij}equiv {e}^{-| |;{overrightarrow{x}}_{i}-{overrightarrow{x}}_{j}| {| }^{2}+{A}_{i}+{A}_{j}}), where ({overrightarrow{x}}_{i}) is the d-dimensional embedding coordinate of point i and where the amplitude ({A}_{i}leftarrow log {sum }_{j}{n}_{ij}/{n}_{cdot j}).
Note that minimizing DKL is equivalent to maximizing the log likelihood ({mathcal{L}}) of the multinomial that uses UEI counts as ‘independent trials’ on the space of all possible UMI pairings (the framing in earlier DNA microscopy work)6. This is because the following statements are true:
$${mathcal{L}}=log left({prod }_{ij}{left(frac{{w}_{ij}}{{w}_{cdot cdot }}right)}^{{n}_{ij}}right)+{rm{constant}}$$
(4)
$$=sum _{ij}{n}_{ij}log frac{{w}_{ij}}{{w}_{cdot cdot }}+{rm{constant}}$$
(5)
which is the same as the negative expression for DKL once constant coefficients are factored out.
Writing out the objective gradient yields
$${partial }_{{x}_{k}}{D}_{{rm{KL}}}=sum _{ij}left(frac{{n}_{ij}}{{n}_{cdot cdot }}right){partial }_{{x}_{k}}log left(frac{{w}_{ij}}{{w}_{cdot cdot }}right)=frac{1}{{n}_{cdot cdot }}sum _{j}frac{{n}_{kj}}{{w}_{kj}}{partial }_{{x}_{k}}{w}_{kj}-frac{1}{{w}_{cdot cdot }}{partial }_{{x}_{k}}{w}_{kcdot }$$
(6)
Although the first term is a sparse sum and can be calculated in O(m) time, the second term summing all pairs of molecules presents unique challenges. In earlier work6, the N-body algorithm known as the fast Gauss transform (FGT) was implemented as part of the gradient calculation. Although FGT provided a way to scalably calculate the gradient contributions of large numbers of points within a small spatial volume in 2D, performing a similar calculation on still larger numbers of points and larger volumes in 3D presents a crucial challenge of scale.
Because of the fact that sMLE and GSE both involve projected gradient descent on the eigenvectors of a matrix, a much faster numerical approximation becomes possible. Specifically, we can select representative pairings at each iteration of GSE’s project gradient descent for each point.
At each iteration of the projected gradient descent, for each point, we first find its k = 2(d + 1)-nearest neighbors across all points in the d-dimensional embedding; we refer to this as pairing group SI. For each point, we next randomly select k points from the nearest 1/2d quantile of points to that original point (pairing group SII) and k points from the remaining 1 − 1/2d points (pairing group SIII). Then, we assign the corresponding fractions:
$${f}_{{bf{I}}}leftarrow frac{k}{m},,;,,{f}_{{bf{II}}}leftarrow frac{1}{{2}^{d}}-frac{k}{m},,;,,{f}_{{bf{III}}}leftarrow 1-frac{1}{{2}^{d}}$$
where the sum fI + fII + fIII = 1 corresponds to the total number of point-to-point (or UMI-to-UMI) comparisons. We can then calculate full approximations of the low-dimensional gradient by multiplying the gradient contribution of randomly selected UMI–UMI pairings by their expected representation within the ensemble of all UMI–UMI pairings.
Here, we review the first-order approximation of resolution limits in DNA microscopy experiments and image reconstruction6. When a UMI k accumulates nk⋅ total UEIs with neighboring UMIs, its positional uncertainty can be estimated by a Gaussian approximation to the likelihood encountered earlier in Eq. (4). Under slowly varying UMI density conditions on the scale of the characteristic diffusion length Ldiff, each interaction contributes a term in the log likelihood that is approximately quadratic in the UMI’s position. Denoting by ({overrightarrow{x}}_{k}) the maximum-likelihood position of UMI k, one finds that
$$log p(;{overrightarrow{X}}_{k})approx -frac{1}{{L}_{{rm{diff}}}^{2}}sum _{j}{n}_{kj},parallel {overrightarrow{X}}_{k}-{overrightarrow{x}}_{j}{parallel }^{2}+,text{constant},$$
where nkj is the number of UEIs between UMIs k and j. Expanding about ({overrightarrow{x}}_{k}) gives a Gaussian form,
$$p(;{overrightarrow{X}}_{k}),propto ,exp left(-frac{parallel {overrightarrow{X}}_{k}-{overrightarrow{x}}_{k}{parallel }^{2}}{2,{sigma }^{2}}right),$$
with σ determined by the curvature of (log p({overrightarrow{X}}_{k})) at the peak. This curvature is proportional to the total UEI count nk⋅ = ∑jnkj, leading to
$$sigma ={left(frac{2{n}_{kcdot }}{{L}_{{rm{diff}}}^{2}}right)}^{-frac{1}{2}}=frac{{L}_{{rm{diff}}}/sqrt{2}}{sqrt{{n}_{kcdot }}}.$$
Each such σ should be viewed as defining a characteristic linear dimension within which UMI k is likely to reside; therefore, when extended to 3D, one interprets σ3 as the ‘voxel’ volume of uncertainty.
In the main text, we introduced hierarchical GSE as an implementation of GSE that uses sparse sampling of the global UEI graph to generate a collection of S subgraphs, each containing mS UMIs. Initial reconstructions from these subgraphs are then performed. Then, a linear interpolation extrapolates these small solutions to the remaining UMIs. These putative global solutions are then input as the starting E-dimensional subspace into GSE.
Subgraph selection for hierarchical GSE was performed through a series of iterations to generate S subgraphs with as close to mS UMIs each as possible. Points were randomly selected for analysis, alternating between uniform and nonuniform (local) selections. For uniform distributions, subgraphs were assembled from UMIs that were sorted purely on the basis of their coverage by prior generated subgraphs, prioritizing those with lower inclusion counts to ensure uniform sampling across the dataset. For nonuniform (local) selections, the subspace Z, derived from the rescaled eigenvectors of the entire dataset, was used to identify regions that were undersampled.
In nonuniform (local) selections, the process began by selecting a seed point that maximized its distance from UMIs included in prior subgraphs. Starting from this seed, UMIs were added to the subgraph by selecting the top 1/2d fraction of points (with d being the embedding dimension) closest to that seed in order of lowest prior coverage by other subgraphs. UMIs that did not have at least two UEI links were pruned to ensure robust connectivity within the subgraph. Subsequently, the largest connected component within the pruned subset was identified and retained for subset GSE inference.
In all simulations S = 25 subgraphs were generated before consensus was generated in the manner described below. In all real datasets, S = 50 was used.
Let the desired subsample size be mS and the number of subsamples be S. For each S subsample iteration, the algorithm sorts the points on the basis of their coverage scores from all previous subsamplings cumulatively, prioritizing those that have not yet been sampled and randomizing those that are tied. The algorithm constructs a subgraph of size mS by agglomerating points in that order. The selected subsample is stored for further processing. Eigenvectors are computed from each subsample’s UEI matrix and the coverage scores are updated on the basis of which points were included, ensuring diversity across iterations. GSE is then applied to these subsets and the resulting embedding is extrapolated back to the original dataset as described below.
Following on from Eq. (1) and its linear approximation, let N represent the row-normalized UEI matrix. We can formulate our objective as bringing the normalized vector gradient of the objective in Eq. (1):
$$({bf{N}}-{bf{I}});overrightarrow{x}to 0$$
(7)
Let S be a subset of UMIs with known values XS and (bar{{bf{S}}}) be the complement set such that (| {bf{S}}| +| bar{{bf{S}}}| =m). We partition the matrices and vectors as follows:
$${bf{N}}equiv left(begin{array}{cc}{{bf{N}}}_{{bf{S}}times {bf{S}}}&{{bf{N}}}_{{bf{S}}times bar{{bf{S}}}}\ {{bf{N}}}_{bar{{bf{S}}}times {bf{S}}}&{{bf{N}}}_{bar{{bf{S}}}times bar{{bf{S}}}}end{array}right),quad overrightarrow{x}equiv left(begin{array}{c}{overrightarrow{x}}_{{bf{S}}}\ {overrightarrow{x}}_{bar{{bf{S}}}}end{array}right)$$
The gradient Eq. (7) with respect to ({overrightarrow{x}}_{bar{{bf{S}}}}) is
$${{bf{M}}}_{bar{{bf{S}}}times {bf{S}}};{overrightarrow{x}}_{{bf{S}}}={{bf{M}}}_{bar{{bf{S}}}times bar{{bf{S}}}};{overrightarrow{x}}_{bar{{bf{S}}}}$$
(8)
where ({{bf{M}}}_{bar{{bf{S}}}times {bf{S}}}equiv {{bf{N}}}_{bar{{bf{S}}}times {bf{S}}}) and ({{bf{M}}}_{bar{{bf{S}}}times bar{{bf{S}}}}equiv ({bf{I}}-{{bf{N}}}_{bar{{bf{S}}}times bar{{bf{S}}}})).
We solve this system using the conjugate gradient method with regularization. The conjugate gradient method is an iterative algorithm that minimizes the quadratic form (frac{1}{2}{overrightarrow{x}}^{T}{bf{A}};overrightarrow{x}-{overrightarrow{b}}^{T}overrightarrow{x}) where ({bf{A}}={{bf{M}}}_{bar{{bf{S}}}times bar{{bf{S}}}}) and (overrightarrow{b}={{bf{M}}}_{bar{{bf{S}}}times {bf{S}}};{overrightarrow{x}}_{{bf{S}}}). The algorithm converges when the residual (parallel {overrightarrow{r}}_{k}parallel =parallel overrightarrow{b}-{bf{A}};{overrightarrow{x}}_{k}parallel) is below a specified tolerance.
The regularization parameter λ is calculated as
$$lambda ={sigma }^{2}| bar{{bf{S}}}|$$
where ({sigma }^{2}equiv parallel {overrightarrow{x}}_{{bf{S}}}{parallel }^{2}/| {bf{S}}|) is the empirical average L2-norm of the known values. This regularization helps to stabilize the solution and prevent overfitting, especially when the UEI matrix is not well connected.
At each iteration of the regularized conjugate gradient, we solve
$$({{bf{M}}}_{bar{{bf{S}}}times bar{{bf{S}}}}+lambda {bf{I}});{overrightarrow{x}}_{bar{{bf{S}}}}={{bf{M}}}_{bar{{bf{S}}}times {bf{S}}};{overrightarrow{x}}_{{bf{S}}}$$
To maintain consistency with the known values, we iteratively rescale the intermediate result:
$${overrightarrow{x}}_{bar{{bf{S}}}}leftarrow {overrightarrow{x}}_{bar{{bf{S}}}}frac{sqrt{parallel {overrightarrow{x}}_{{bf{S}}}{parallel }^{2}/| {bf{S}}| }}{sqrt{parallel {overrightarrow{x}}_{bar{{bf{S}}}}{parallel }^{2}/| bar{{bf{S}}}| }}$$
(9)
This rescaling ensures that the variance of the interpolated values matches that of the known values, preserving the overall scale of the solution.
After solving the regularized conjugate gradient, we apply a final empirical quantile transformation to ({overrightarrow{x}}_{bar{{bf{S}}}}). This transformation is designed to ensure that the distribution of interpolated values closely matches the distribution of known values, while preserving the relative ordering obtained from the interpolation. The process begins by computing the empirical cumulative distribution function (ECDF) for both ({overrightarrow{x}}_{{bf{S}}}) and ({overrightarrow{x}}_{bar{{bf{S}}}}). These ECDFs represent the proportion of values in each vector that are less than or equal to a given value. For each value in ({overrightarrow{x}}_{bar{{bf{S}}}}), we then determine its quantile (that is, its position in the sorted list of values) according to the ECDF of ({overrightarrow{x}}_{bar{{bf{S}}}}). This quantile is then used to find the corresponding value in the ECDF of ({overrightarrow{x}}_{{bf{S}}}). By mapping the quantiles in this way, we effectively transform the distribution of ({overrightarrow{x}}_{bar{{bf{S}}}}) to match that of ({overrightarrow{x}}_{{bf{S}}}), while maintaining the order relationships among the interpolated values.
After each of the S d-dimensional interpolated solutions are generated, several random rotations are applied to each to generate an ‘inflated’ interpolated solution space of size Efinal. Each vector in this inflated solution space is rank-transformed to stabilize variance. The resulting concatenated vector matrix m × S ⋅ d matrix is then used to construct a new m × E-dimensional rescaled subspace to be fed into GSE.
This is achieved by first using SVD to orthogonalize the m × S ⋅ d concatenated vector matrix into a new matrix ({{bf{U}}}_{{rm{interp}}}{{bf{S}}}_{{rm{interp}}}{{bf{V}}}_{{rm{interp}}}^{T}). We then simply rewrite Eq. (2):
$${overrightarrow{z}}_{i}^{{rm{rescaled}}}={{bf{U}}}_{{rm{interp}}}{{bf{S}}}_{{rm{interp}}}{overrightarrow{v}}_{i}/sqrt{1-{lambda }_{i}}$$
(10)
As indicated in simulated datasets in Supplementary Fig. 8e,j, 2D DNA microscopy data in Fig. 4y and Supplementary Fig. 9d and all 3D DNA microscopy data in Fig. 5, a 0.2% UMI filter was applied to eliminate the top fringe of overdispersed UEIs. To do this, over two iterations, S = 6 subgraphs were generated (three alternations between uniform and nonuniform selections) to generate an estimated eigenvector basis and these were used to remove 0.1% of UMIs with the greatest RMS dispersion of UEIs between neighboring UMIs. We incorporated this filter, albeit affecting datasets minimally (Supplementary Fig. 9b,d), to reduce plausible influence of contaminating PCR products from other experiments. Further experimentation with this will better clarify its utility in such experiments going forward.
Decoherence plots in Fig. 4v–x and Supplementary Figs. 5b and 6 quantify the quality of local registration at different length scales. To do this, n = 500 reference points were sampled from the ground-truth coordinates. For each sampled point, neighborhoods of size k (with the 500 samples drawn from values k = 100, 200, 400, 800, 2,000, 4,000, 5,000, 10,000, 15,000, 20,000, 25,000, 27,000, 30,000 and 33,000 for 50,000-point 3D simulations and values k = 100, 200, 400, 800, 1,600, 3,200 and 6,400 for 10,000-point 2D simulations) were identified through kNN analysis of the ground-truth coordinates. The corresponding points were then extracted in the embedding. Each embedded neighborhood was then rigidly aligned to its ground-truth counterpart through Procrustes alignment and the r.m.s.d. was computed. The average distance of the points from the reference point in the ground truth defined the ‘spread’. This procedure yielded (spread, r.m.s.d.) pairs, which were binned according to the spread value using 20 equally sized bins between the minimum and maximum values. Shaded bands indicated ±1 s.d. about the mean r.m.s.d. within each bin.
The paired figure panels Fig. 2a–f are individually representative but single zebrafish embryos from each chemistry shown at different locations within the embryo. Figure 4y, drawn from an earlier dataset6, corresponds to a single randomly deposited cell population that underwent both fluorescence imaging and unanchored diffusion DNA microscopy. Figures 5a–d,m–t and 6k–o are the same four embryos, as indicated in the figure caption, corresponding to the biological replicates in the current experiment.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.