Tutorial on SNAP-enabled auto-curation of PDB TF-DNA complexes containing 5-methyl-cytosines

Back to homepage

Search criteria

A search was performed on 2019-09-28 using RCSB PDB with the following criteria (see the screenshot):

Chain Type: there is a Protein and a DNA chain but not any RNA or Hybrid and Representative Structures at 90% Sequence Identity

Summary report in CSV format

Clicking the "Submit Query" button led to the result page, showing a total of 4,819 hits under the search criteria. Then click on the "Reports" drop-down list, and select the "Structure" option under the section of "Summary Reports", as shown in the screenshot below:

The "Structure Summary Report" page then appeared. Save the result of the 4,819 hits in CSV format in a file named "proDNA-2019sep26-summary.csv" (see the screenshot below). The summary CSV contains the following information (including PDB id), which forms the starting point of later on SNAP analyses.

PDB ID, Structure Title, Exp. Method, NDB ID, Resolution, Classification, Rel. Date, Dep. Date, Rev. Date, Structure Author, Structure MW, Macromolecule Type, Residue Count, Atom Site Count, PDB DOI

Identification and characterization of 5mC-DNA/protein interactions using SNAP

Download PDB coordinates files

With PDB id, the atomic coordinates file for each of the 4,819 DNA-protein complexes can be downloaded from the RCSB PDB. In the following sections, PDB entry 4m9e will be used as an example, whose coordinates file (4m9e.pdb) was retrieved via the command:
wget https://files.rcsb.org/download/4m9e.pdb

Run SNAP on PDB coordinates files

SNAP is a general-purpose program, with many command-line options, for characterizing DNA/RNA-protein interactions. Running SNAP on PDB entry 4m9e using the default options is as simple as specifying the input (4m9e.pdb) and output file (4m9e.out):
x3dna-snap -i=4m9e.pdb -o=4m9e.out

The default SNAP output file 4m9e.out includes a list of 2 modified nucleotides, 1 double helix, 48 nucleotide/amino-acid interactions, 40 base-pair/amino-acid interactions, 4 phosphate/amino-acid H-bonds, 13 base/amino-acid H-bonds, 8 base/amino-acid pairs, and 2 base/amino-acid stacks.

The list of the 2 base/amino-acid stacking interactions is shown below. One is 5CM (5-methylcytosine on chain B, residue number 5) with arginine#443 (on chain A), and the other is T7 over histidine#416.

 List of 2 base/amino-acid stacks
       id   nt-aa   nt           aa      vertical-distance   plane-angle
   1  4m9e  C-arg  B.5CM5       A.ARG443        3.40              5
   2  4m9e  T-his  B.DT7        A.HIS416        3.33             11

The SNAP --methyl-C option

For an automatic analysis of 5-methyl-cytosine-containing TF-DNA complexes, the --methyl-C (short-form: --methyl) option was introduced into SNAP (and documented as of v1.0.6-2019sep30). Note that per PDB, 5-methylcytosine (5mC) in DNA is designated 5CM and the 5-methyl carbon atom is named C5A (see "5CM and 5MC, two forms of 5-methylcytosine in the PDB"). Moreover, the --type=base option is employed to ensure that base atoms (regardless sugar-phosphate atoms) of 5mC are directly involved in interactions with amino acids. So the SNAP command becomes:
x3dna-snap --methyl-C --type=base -i=4m9e.pdb -o=4m9e-5mC.out

With the SNAP --methyl option, two additional files are also generated: a text file 4m9e-5mC.txt (as shown below) and a corresponding PDB file 4m9e-5mC.pdb (potentially multi-model, two as in this case).

4m9e:B.5CM5: stacking-with-A.ARG443 is-WC-paired is-in-duplex [+]:GcG/cGC
4m9e:C.5CM5: other-contacts is-WC-paired is-in-duplex [-]:cGT/AcG

For a DNA-protein complex that has no 5mC (e.g., PDB entry 1oct) or without 5mC-amino acid interactions (e.g., 1odg), SNAP will not generate these two additional files. Thus, output of these two files indicates the existence of 5mC interacting with amino acids. Running SNAP with the --methyl-C --type=base options on the 4,819 DNA-protein complexes, we identified 107 entries with 5mC whose base atoms interact with amino acids. The list of the 107 PDB entries is as below:
10mh, 1bsu, 1dct, 1ig4, 2c7o, 2c7p, 2c7q, 2c7r, 2ky8, 2moe, 2uyc, 2uyh, 2uz4, 2zkd, 2zke, 2zkf, 2zo0, 2zo1, 3c2i, 3clz, 3f8i, 3f8j, 3fde, 3q0b, 3q0c, 3q0d, 3q0f, 3ssc, 3ssd, 3vxv, 3vxx, 3vyb, 3vyq, 4aqu, 4aqx, 4da4, 4dkj, 4f6n, 4gjp, 4gjr, 4gzn, 4hp1, 4lg7, 4lt5, 4m9e, 4m9v, 4mht, 4nm6, 4pw7, 4qen, 4qeo, 4qep, 4r28, 4r2a, 4r2e, 4r2r, 4r2s, 4x9j, 5bt2, 5cg9, 5cpj, 5cpk, 5ef6, 5ego, 5emc, 5gse, 5ke7, 5ke8, 5kl4, 5kl5, 5kl7, 5lty, 5lux, 5mcv, 5mcw, 5mht, 5szx, 5t00, 5t01, 5vmu, 5vmv, 5vmw, 5vmx, 5vmy, 5vmz, 6a5n, 6c1a, 6c1t, 6c1u, 6c1y, 6c2f, 6cc8, 6ccg, 6ceu, 6cnp, 6cnq, 6d1t, 6e93, 6e94, 6jnm, 6jnn, 6mg2, 6mg3, 6mht, 6ml6, 6ml7, 6ogk

Among the the 107 PDB entries, 41 are enzymes (e.g., ligase, transferase etc.), two are solved by NMR (1ig4 and 2ky8), 4 x-ray crystal structures have worse than 3.0-Å resolution (4qep, 5cpj, 5gse and 5lux). We thus have a total of 60 transcription factor-DNA complexes, as shown below:
3c2i, 3ssc, 3ssd, 4f6n, 4gjp, 4gjr, 4gzn, 4hp1, 4m9e, 4m9v, 4qen, 4qeo, 4r2a, 4r2e, 4r2r, 4r2s, 4x9j, 5bt2, 5cpk, 5ef6, 5ego, 5emc, 5ke7, 5ke8, 5kl4, 5kl5, 5kl7, 5lty, 5mcv, 5mcw, 5szx, 5t00, 5t01, 5vmu, 5vmv, 5vmw, 5vmx, 5vmy, 5vmz, 6a5n, 6c1a, 6c1t, 6c1u, 6c1y, 6c2f, 6cc8, 6ccg, 6ceu, 6cnp, 6cnq, 6d1t, 6e93, 6e94, 6jnm, 6jnn, 6mg2, 6mg3, 6ml6, 6ml7, 6ogk

Three modes of 5mC recognition by TF

In this work, the 5mC recognition by a TF is classified into three modes: (1) stacking interactions with a planar amino-acid side chain, most notably arginine, (2) hydrophobic interactions with ILE, VAL, ALA, MET, LEU, PRO, GLY, and (3) other than these two modes. As shown below, for PDB entry 4m9e, modes (1) and (3) are detected, whilst for 4gjr, mode (2) is also available.

4m9e:B.5CM5: stacking-with-A.ARG443 is-WC-paired is-in-duplex [+]:GcG/cGC
4m9e:C.5CM5: other-contacts is-WC-paired is-in-duplex [-]:cGT/AcG
4gjr:I.5CM8: hydrophobic-with-A.GLY437 hydrophobic-with-A.GLY438 is-WC-paired is-in-duplex [+]:TcT/AGA

In the SNAP output, "is-WC-paired" means the 5mC is in a Watson-Crick base pair, and "is-in-duplex" indicates that the base pair is in a DNA double helix. The last portion shows the sequence context of 5mC (abbreviated to lower-case 'c') with its immediate 5' and 3' neighbors: [+] means the leading strand, whilst [-] signifies the reverse complementary strand.

Bibliographic information

For the 60 entries with TF-5mC DNA interactions, we extracted primary citations from the PDB, with those fields: structureId, classification, structureTitle, experimentalTechnique, resolution, title, citationAuthor, journalName, publicationYear, volumeId, firstPage, lastPage, pubmedId. When pubmedId is available for a PDB entry, we downloaded the abstract from PubMED. The bibliographic information and SNAP annotations are presented in a dynamic table using DataTables.

Molecular visualization using DSSR, PyMOL and 3Dmol.js

Each of the TF-5mC PDB entries also has a detailed annotation webpage, with bibliographic information and molecular images.

DSSR-PyMOL block images

For the PDB entry as a whole, the DSSR-PyMOL schematic block images in six orthogonal views (front, right, top; back, left, bottom) are generated. Specifically, the DSSR command options used are: --blocview --block-file=wc-minor to orient the structure in the most extended view, and to show WC pairs as a long block with the minor-groove edge colored in black. The PyMOL session file corresponding to the top-left image is available for download so users can reproduce the molecular images rigorously.

Please visit the 'schematics' section in the SNAP-annotation page for PDB entry 4m9e for an example.

Molecular images for 5mC interactions

Each 5mC is further zoomed-in, with bases and amino acids that interacts with it. Moreover, the cluster of interacting residues is oriented in the standard base reference frame of 5mC, allowing for easy comparison and direct overlap of multiple clusters. The atomic coordinates are transformed directly by SNAP, and stored in files like 4m9e-5mC.pdb as noted above.

The static image is enabled by DSSR with options --block-file=fill-hbond --cartoon-block=sticks-label and rendered in PyMOL. It has filled base rings and H-bonds, and the residues are shown in sticks and labeled. Moreover, a 3Dmol.js viewer is embedded for interactive visualization.

Please visit the 'contacts' section in the SNAP-annotation page for PDB entry 4m9e for an example.

Last updated on 2019-09-30 by Xiang-Jun Lu <xiangjun@x3dna.org>