METHODS

This is intended as a brief summary of the method. For full experimental details, see the preprint at https://www.biorxiv.org/content/10.1101/2025.06.05.658101v1.full

Dimethyl sulfate transcriptome-wide RNA accessibility mapping by sequencing (DMS-TRAM-seq)

U2OS cells were treated with 3% DMS in DMEM for 3 minutes, followed by on-plate washes in PBS and lysis in 1% BME lysis buffer, as used in the Purelink RNA Mini kit, which was used for RNA isolation. After ribodepletion and fragmentation, reverse transcription was conducted with TGIRT using random primers, enabling inclusion of noncoding RNAs. The IDT xGen Broad-Range RNA library kit (formerly Swift Biosciences RNA library kit, as used to collect this data) was used for the rest of the library preparation, and the resulting libraries were sequenced at 100-200 million paired reads per sample on NovaSeq.

Data analysis

Sequencing reads were trimmed using Cutadapt and deduplicated using Clumpify (BBTools package). Trimmed and deduplicated reads were then mapped to the human genome (hg38) using STAR, with only uniquely-mapped reads included in the final alignment. Mismatch rates for every reference base were calculated for the entire genome using bcftools mpileup. These rates were normalized by subtracting the signal from non-DMS treated replicates, correcting for SNPs and endogenous modifications. After filtering for bases with sufficient coverage (≥ 100 filtered reads per base) across all samples, the three biological replicates were averaged to yield the signal utilized on this site.

Structure prediction

Full code and data repository information for the structure prediction pipeline used here is available at https://github.com/whitehead/humanrnamap .

User-defined input coordinates are used to pull the relevant data from the genome-wide mutational profile. The data is then winsorized with a top 5% limit to remove outliers, and the scaled reactivities are fed into RNAstructure’s Fold command as a constraint file. RNAstructure Fold outputs a base-pairing prediction for the region, which is then drawn as an image with the DMS signal overlaid via VARNA.

For each structure, an AUC value is calculated based on the area under the receiver operator characteristic (ROC) curve. In this calculation, “true positives” are highly-modified bases that are predicted to be unpaired, while “false positives” are highly-modified bases that are predicted to be paired. This value measures how well the DMS signal aligns with the predicted structure, where a value of 1 is perfect alignment and a value of 0.5 is random performance.

ACKNOWLEDGEMENTS

The data presented in this resource were generated and analyzed by Kelsey Farenhem of the Jain lab at the Whitehead Institute for Biomedical Research, with data analysis assistance from Troy Whitfield at the Bioinformatics and Research Computing core facility at the Whitehead Institute. Code streamlining user-defined secondary structure prediction from this data was written by Alina Chouloute and later modified by K.F. and A.N.H.

Special thanks to Andy Nutter-Upham and Scott McCallum at the Whitehead Institute for website development, code streamlining, and generally making this database possible.

For full protocols and to cite this work, reference our publication at https://www.biorxiv.org/content/10.1101/2025.06.05.658101v1.full


How to predict by coordinate

Input the genomic coordinates of your region of interest, using hg38 as the reference genome. This input is the most flexible, and is unconstrained by sequencing coverage or any annotation. Be sure to double-check that the output sequence matches your expected region, and check that the output coverage of A/C bases is above the recommended 70%, which is reflective of DMS-modifiable bases meeting all coverage and quality filters.

For most users, only one set of coordinates will be used to define their region of interest. However, two sets of coordinates may be needed when joining two separate regions together, such as when crossing a splice junction.

When analyzing a region within a larger transcript, it is generally recommended to test “buffer” regions, where the region of interest should be extended by 20-50 nt on each end in order to reduce the likelihood of structures being arbitrarily interrupted by the region borders.

Due to computational and server constraints, users will be limited to a maximum input length of 500 nucleotides and a maximum output of 5 predicted structures, though some regions may yield fewer than that. If this does not suit your needs, consider downloading and running the data and code locally (see Download ).

production / b252b9f76a / 2025-07-17 17:09