Protein structural comparison

Network analysis of synonymous codon usage
Milenkovic Lab

Network analysis of synonymous codon usage

Contact: Tijana Milenkovic, tmilenko AT nd DOT edu

Introduction: Most amino acids are encoded by multiple synonymous codons. For an amino acid, some of its synonymous codons are used much more rarely than others. Analyses of positions of such rare codons in protein sequences revealed that rare codons can impact co-translational protein folding and that positions of some rare codons are evolutionarily conserved. Analyses of positions of rare codons in proteins’ 3-dimensional structures, which are richer in biochemical information than sequences alone, might further explain the role of rare codons in protein folding. We analyze a protein set recently annotated with codon usage information, considering non-redundant proteins with sufficient structural information. We model the proteins’ structures as networks and study potential differences between network positions of amino acids encoded by evolutionarily conserved rare, evolutionarily non-conserved rare, and commonly used codons. In 84% of the proteins, at least one of the three codon categories occupies significantly more or less network-central positions than the other codon categories. Many of the protein groups showing different codon centrality trends (i.e., different types of relationships between network positions of the three codon categories) are enriched in unique biological functions, implying a possible existence of a link between codon usage, protein folding, and protein function.

Reference: Khalique Newaz, Gabriel Wright, Jacob Piland, Jun Li, Patricia Clark, Scott Emrich, and Tijana Milenkovic (2019), Network analysis of synonymous codon usage, submitted.

Data set: Starting with a recent large data set consisting of ∼280,000 proteins spanning 76 species for which codon usage information is available, we consider a subset of these proteins that are non-redundant (at most 90% sequence-similar) to each other and that have sufficient 3-dimensional protein structural information in the Protein Data Bank, which results in 63 proteins spanning seven species. For each of the 63 proteins, we provide a mapping file and the corresponding protein structure network (PSN).

The mapping files can be downloaded from here. Each of the mapping files has the following information:
- Column 1 indicates the PDB sequence number of the amino acids
- Column 2 indicates the amino acid codes for the PDB sequence
- Column 3 indicates the amino acid codes for the PDB sequence that has been resolved in the 3D structure. Non-resolved positions have a "-" sign.
- Column 4 indicates the amino acid codes for the part of the sequence that has been mapped from a protein sequence to its reciprocal best hit in the PDB sequence. Non-resolved positions have a "-" sign.
- Column 5 indicates the sequence number of the portion of the protein sequence sequence that has been mapped to its reciprocal best hit in the PDB sequence. Non-mapped positions have a "-" sign.
- Column 6 indicates the %MinMax value for each poistion of the protein sequence. Sequence positions for which we do not have a %MinMax value have a "-" sign.
- Column 7 contains binary values indicating if a sequence position is annotated as rare (1) or not rare (0).
- Column 8 contains binary values indicating if a sequence position is annotated as conserved (1) or not conserved (0).
- Column 9 to 15 contain node centrality values corresponding to degree , graphlet degree, k-coreness, closeness, clustering coefficient, Eccentricity, and Betweeness centrality, respectively.
The mapping files provide, within each of the proteins, the node centrality values of amino acids based on seven node centrality measures. The program for computing the node centrality values can be found here. In order to facilate the use of other (i.e., other than the above seven) node centrality measures on our PSNs, we provide all 63 PSNs that we use in two different formats: leda (i.e., ''.gw'') format and list of edges (i.e., ''.list'') format. The names of the nodes in a PSN file corresponds to the sequence position of the corresponding amino acids. The PSNs can be downloaded from here.