Email updates

Keep up to date with the latest news and content from Virology Journal and BioMed Central.

Open Access Highly Accessed Research

Imperfect DNA mirror repeats in the gag gene of HIV-1 (HXB2) identify key functional domains and coincide with protein structural elements in each of the mature proteins

Dorothy M Lang

Author Affiliations

School of Contemporary Sciences, University of Abertay-Dundee, Bell Street, Dundee DD1 1HG, Scotland, UK

Virology Journal 2007, 4:113  doi:10.1186/1743-422X-4-113


The electronic version of this article is the complete one and can be found online at: http://www.virologyj.com/content/4/1/113


Received:28 September 2007
Accepted:26 October 2007
Published:26 October 2007

© 2007 Lang; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

A DNA mirror repeat is a sequence segment delimited on the basis of its containing a center of symmetry on a single strand, e.g. 5'-GCATGGTACG-3'. It is most frequently described in association with a functionally significant site in a genomic sequence, and its occurrence is regarded as noteworthy, if not unusual. However, imperfect mirror repeats (IMRs) having ≥ 50% symmetry are common in the protein coding DNA of monomeric proteins and their distribution has been found to coincide with protein structural elements – helices, β sheets and turns. In this study, the distribution of IMRs is evaluated in a polyprotein – to determine whether IMRs may be related to the position or order of protein cleavage or other hierarchal aspects of protein function. The gag gene of HIV-1 [GenBank:K03455] was selected for the study because its protein motifs and structural components are well documented.

Results

There is a highly specific relationship between IMRs and structural and functional aspects of the Gag polyprotein. The five longest IMRs in the polyprotein translate a key functional segment in each of the five cleavage products. Throughout the protein, IMRs coincide with functionally significant segments of the protein. A detailed annotation of the protein, which combines structural, functional and IMR data illustrates these associations. There is a significant statistical correlation between the ends of IMRs and the ends of PSEs in each of the mature proteins. Weakly symmetric IMRs (≥ 33%) are related to cleavage positions and processes.

Conclusion

The frequency and distribution of IMRs in HIV-1 Gag indicates that DNA symmetry is a fundamental property of protein coding DNA and that different levels of symmetry are associated with different functional aspects of the gene and its protein. The interaction between IMRs and protein structure and function is precise and interwoven over the entire length of the polyprotein. The distribution of IMRs and their relationship to structural and functional motifs in the protein that they translate, suggest that DNA-driven processes, including the selection of mirror repeats, may be a constraining factor in molecular evolution.

Background

A DNA mirror repeat is a sequence segment delimited on the basis of its containing a center of symmetry on a single strand and identical terminal nucleotides. For example, in the sequence below, TACACG is the mirror image of GCACAT.

    <----------  ---------->

5'- T A C A C G  G C A C A T -3'

3'- A T G T G C  C G T G T A -5'

Imperfect DNA mirror repeats (IMRs) are less than 100% symmetrical.

The identification of mirror repeats is highly dependent on how they are defined. One method is to identify all mirror repeats within a sequence by systematically evaluating the symmetry of each string within in it. This method identifies relatively long (or maximal) symmetric strings (mIMRs). Using symmetry criteria of ≥ 50% and discounting strings completely contained within other strings, the longest mIMRs in TnsA were found to coincide with key structural domains [1].

Another type of mirror repeat is identified by progressively evaluating, from the start to the end of a sequence, symmetric sub-strings bounded by reverse dinucleotides (rdIMRs). These are generally shorter than and often contained within mIMRs. Lang [1] found statistically significant correlations for the coincidence of the ends of rdIMRs and the ends of protein structural elements – helices, β-sheets and turns – in 17 monomeric proteins. In TnsA (E. coli), 88% of the known or potential functional motifs occur within rdIMRs and the longest mIMRs translate key functional and/or structural sequences of the protein.

In this study, the distribution of IMRs is evaluated in a gene that translates a polyprotein. The specific goals were to determine whether IMRs span the entire polyprotein, to identify the relationship of IMRs in the precursor to IMRs in the mature cleavage products and to assess the relationship between IMRs and protein functional and structural motifs. The HIV-1 gag sequence used for this analysis is HXB2_LAI_IIIB_BRU [Genbank: K03455], the most commonly used reference sequence for the HIV-1 genome [2]. The gag gene of HIV-1 is about twice as long as TnsA, and translates the following proteins (in the order of their occurrence within the sequence): matrix (MA), capsid (CA), p2 (SP1), nucleocapsid (NC), and either (a) p1 (SP2) and p6 or (b) GagTF. CA is about the same length as TnsA. The cleavage positions for each of the mature proteins of Gag (HXB2) are summarized in Table 1.

Table 1. Nucleotide and amino acid sequences adjacent to cleavage sites in Gag (HXB2) [2]

Gag proteins are the structural components of the HIV-1 virus and cleavage of the Gag polyprotein into several mature proteins is essential to replication. Near the C-terminal of Gag (at the NC-p1 cleavage site), the protein becomes polycistronic. The ribosome "slips" within the DNA motif "tttttt", once in every 20th Gag transcription and the resulting transcript is GagTF-Pol. At maturation, the Pol segment is cleaved into enzymatic proteins. Gag and Gag-Pol are cleaved differentially and in stages. This process is summarized in Table 2.

Table 2. Gag and Gag-Pol are differentially cleaved at maturation

In order to facilitate the comparison of multiple types of data within the context of the protein, a comprehensive annotation of complete Gag sequence was made (Additional file 1) that combines experimentally determined functional and structural motifs, and the sequence positions of IMRs found in this study.

Additional file 1. Functional, structural and IMR motifs in Gag (HXB2). This table compares experimentally determined structural and functional positions of the Gag sequence with IMRs. The Gag sequence has a grey background. Annotations based on experimental evidence occur above the sequence; those that are translated by IMRs are bolded. The secondary structure of the sequence (its PDB file indicated to the right) is below the sequence (H = helix, B = residue in isolated beta bridge, E = extended beta strand, G = 310 helix, T = hydrogen bonded turn, S = bend). Below the structural information are the protein translations of DNA-IMRs identified in this study; to the right are this author's interpretation of the relationship between the indicated IMR and the known function indicated above the sequence. The IMR number indicates its rank, according to length. A hatch mark (#) indicates an mIMR; a dollar sign ($) indicates an rdIMR. Sequences that are protein translations of mIMRs are in bold letters. In order to simply the descriptions of function or structure for each motif, the earliest publication is referenced; if subsequent findings for the motif substantially altered interpretation, the motif is repeated with the new reference. References for this file are available in additional file 2.

Format: PDF Size: 82KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Results

The five longest mIMRs in gag that are ≥ 50% symmetric each translate an essential protein motif in a different cleavage product, indicating that the association between mIMR length and function may be related to selection in both the polyprotein and its cleaved products. Most IMRs translate distinct, functionally significant protein motifs. At symmetry ≥ 50% there are significant statistical correlations between the ends of both mIMRs and rdIMRs, and the ends of protein structural elements (PSEs). Several mIMRs that are ≥33% symmetric start or stop at cleavage positions.

The DNA and amino acid sequence positions of the longest L1 mIMRs are listed in Table 3. The designation L1 means that it is the longest IMR for a unique span of the DNA sequence. MIMRs are identified by evaluating the symmetry of every possible sub-string of a DNA sequence, then nesting them sequentially, beginning at the 5' end. The span of the first IMR is designated L1; all shorter IMRs within the span are designated progressively higher levels (L2, L3, etc.) based on whether they are completely contained within another IMR. The next L1 IMR ends downstream from the end of the preceding IMR; it may begin within a preceding IMR or downstream from it. For the remainder of this article, all references to IMRs refer to L1 IMRs. Each (L1) mIMR is assigned an ID number based on rank by length, and is preceded by a hash mark (e.g. #1-gag). The position of some mIMRs differ by only a few amino acids, so it is possible to simplify the data by discounting mIMRs that substantially overlap. Table 4 summarizes this simplification and illustrates that although mIMRs occur throughout most of the Gag protein each span is associated with distinct structural or functional domains.

Table 3. mIMRs in gag that are ≥50% symmetrical

Table 4. Simplification of Table 3 by removal of slightly overlapping mIMRs

MIMRs were found separately for the Gag polyprotein and each of the cleavage products. It was anticipated that the mIMRs for Gag CDS would be different than those for the components, but they were not except that there are two mIMRs in the NC that only attain L1 status when NC is evaluated separately (not as part of gag). The distribution of mIMRs in Gag indicates that most of the largest mIMRs do not span sequences that will be cleaved into separate proteins. The single exception is E419..E454 (#3-gag), which spans NC-p1, and terminates at the p1-p6 cleavage site; this is the segment that is differentially cleaved in Gag and Gag-Pol.

Table 5 lists the DNA and amino acid sequence positions of the longest rdIMRs. RdIMRs are identified by sequentially evaluating, from 5' to 3', the symmetry of each substring delineated by each dinucleotide and the next downstream reverse dinucleotide. They are nested by the same process described for mIMRs. Most of the protein segments translated by rdIMRs coincide with experimentally determined structural or functional motifs of the protein.

Table 5. rdIMRs in gag ranked by length

MIMRs and rdIMRs vary in distribution, beyond that which would occur due to the differences in their lengths. MIMRs occur throughout most of gag, as a series of overlapping, or nearly overlapping spans; within many mIMRs, there are one or two spatially separated rdIMRs. MIMRs are, however, noticeably absent in some segments of gag; in these segments, e.g. M1..R91 (MA) and P133..G248 (CA), rdIMRs form a nearly continuous series, end-to-end. The sequence spans in MA and CA that do not contain mIMRs are illustrated in Figure 1. These regions are both highly reactive and mobile (detailed in the legend).

thumbnailFigure 1. The distribution of mIMRs in the immature Gag protein [NCBI:1L6N, [8]]. MIMRs that are ≥ 50% symmetric are noticeably absent from some segments of the protein. These regions are characterized by a series of rdIMRs, arranged end-to-end (illustrated in black). The spans lacking mIMRs are highly reactive and mobile. The A3..C87 region of matrix undergoes structural transformation at several stages of the virion life cycle, and contains basic residues that target Gag to the plasma membrane [9], a calmodulin-binding motif [10] and a nuclear localization signal [11]. The T204..E245 region of capsid includes the exposed loop on the virion core [8, 12], and the CypA binding site [12].

Figures 2A and 2B illustrate the protein translation of the two largest mIMRs in gag – the largest helix in MA (2A) and CA (2B) and the adjacent turns essential to the tertiary structure. The PDB structure used for this illustration – 1L6N – is of the immature Gag protein; the structure of MA and CA is not substantially different in the mature proteins, except that the long loop between them is cut and refolded [8]. The MA-H5 helix is distinct from the other matrix components, and in the mature protein projects directly into the center of the virion [13]; the MA-H5 helix may also contain a nuclear localization signal [11]. The CA-H7 helix stabilizes interface 1 (planar strips) of the viral core [14].

thumbnailFigure 2. The longest IMRs coincide with key protein functional motifs. Figures 2A and 2B [NCBI:1L6N [8]] illustrate the two longest mIMRs in the Gag polyprotein – #1-gag in matrix and #2-gag in capsid. These mIMRs translate the MA H5 and CA H7 helices which (in the illustrated structure) are approximately parallel to each other at a pitch of about 45°. Both are essential to the structure and function of each protein. Figure 2C illustrates the largest rdIMRs in matrix and Figure 2D the largest rdIMRs in capsid, that do not coincide with mIMRs.

Figures 2C and 2D illustrate the three largest rdIMRs in MA and CA. The protein translation of $3-gag spans a nuclear localization signal; $6-gag and $10-gag are essential to structural transformation at maturation [15]. The protein translation of $16-gag spans a region that refolds to create a CA-CA interface essential to assemble the core [16]; $18-gag spans the MA-CA cleavage site; $22-gag translates part of the loop on the surface of the virion core and interacts with CypA [12].

Figure 3 illustrates the two largest mIMRs in the nucleocapsid. The largest (Fig. 3A) spans the entire region connecting the two Cys-His boxes. The second largest (Fig. 3B) spans the EF1α binding site and first Cys-His box. The largest rdIMRs in the NC overlap (Fig. 3C), and a Zn ion is bound within the region translated by the overlap. The Cys-His boxes are zinc finger binding domains which enable NC to bind to nucleic acids, and the Zn ion increases the affinity of NC for nucleic acids; NC also has unwinding properties, resembling a DNA topoimerase [17].

thumbnailFigure 3. The largest mIMR in the nucleocapsid spans the two Cys-His boxes [NCBI:1F6U [18]]. Figure 3A illustrates the largest mIMR in the nucleocapsid – #6-gag. This mIMR spans both zinc knuckles and the spacer between them. Each of the next largest mIMRs in the NC, translates one of the Cys-His boxes. Figure 3B illustrates the first Cys-His box. Figure C (same polar orientation as A and B, but rotated) illustrates the two longest rdIMRs in Gag that occur in the nucleocapsid – $1-gag and $4-gag – which overlap; within the overlap region (in purple) two amino acids bind the zinc ion [19].

The coincidence of the ends of IMRs and PSEs was tested for several gene segments – MA-CA-p2-NC, MA, CA and NC segments – using Fisher's exact test (FET) [20]. The Kabsch and Sander [21] secondary structure prediction was used with the 1L6N tertiary structure (PDB) and statistically significant values were found for the MA-CA-p2-NC, CA and NC segments; PROMOTIF secondary structure annotation was used for MA. These results are summarized in Table 6.

Table 6. Both mIMRs and rdIMRs coincide with PSEs in each mature protein and the polyprotein

The mIMRs included in the test are all ≥58 nt and often span more than a single protein structural element. The rdIMRs included in the test are all ≥15 nt. Both mIMRs and rdIMRs begin and end at various positions within codons and therefore, the composition of the two nucleotides at each end (which delimit the rdIMRs) are unlikely to be strongly influenced by preferences related to secondary structure composition or codon preference. More than 50% of the mIMRs are terminated by reverse dinucleotides.

For almost all measurements of coincidence, the ends of IMRs and PSEs were statistically significant over a range of 3 nt, similar to the span found in TnsA. The position at which the coincidence is maximal is listed in Table 6. The coincidence of IMR and PSE at position 0 indicates that the span of a PSE exactly coincides with the span of an IMR. When the position is negative, the IMR begins slightly upstream of the start of the PSE; when the position is positive, the IMR begins slightly downstream. The difference is indicated as a nucleotide position, however, so in the protein the equivalent distance is 1–2 amino acids, which is similar to the variability of different structure prediction methods.

Differences in the position of maximum coincidence between the segments occur for several reasons. The measurement includes coincidences over the entire range of the sequence, and the position of maximum coincidence would be expected to be somewhat different for each protein due to differences in secondary and tertiary structure. The values, however, are consistent; the largest segment – MA-CA-p2-NC – has a maximum coincidence at position 5 (for rdIMR ≥16 nt), which is central to positions 3, -2 and 7, which are maximal for MA, CA and NC, respectively.

The coincidence of IMRs with PSEs may be enhanced by the greater than expected numbers of them in the Gag polyprotein. The following formula predicts the expected number of occurrences.

P(t) predicted number of occurrences of mIMRs in the sequence

P(o) probability of the occurrence of a mirror repeat in a random sequence consisting of 4 nucleotides present in approximately equal amounts

P(e) probability of the ends of a segment matching, for mIMRs, P(e) = 1/4

P(m) probability of number of matches required for symmetry

l number of potential matches (1/2 total sequence length, odd values disregarded)

m number of matches required for symmetry

P(o) = P(e) * P(m)

P(m) = (l!/((m!(l-m)!) * (1/4)m * (3/4)l-m

In gag, 18 L1 mIMRs were identified that were ≥ 63 nt. Therefore, as a generalization, this length will be evaluated. Since we are only concerned that one side of the segment matches the other, l = 30 and m = 14.

P(m) = (30!/(14! * 14!)) * (1/4)14 * (3/4)16

P(m) = 0.005430

Adding the criteria that the ends must match,

P(o) = 0.001357

The length of gag is 1500 nt, from which is subtracted the required length for the match (62), resulting in 1438 potential sites ≥ 63 nt.

P(t) = P(o) * 1438 = 1.95

This value indicates that it is likely that at least two mIMRs ≥ 63 nt will occur by chance. Since each possible site of an mIMR is included to obtain this estimate, it should be compared with the total number if mIMRs ≥ 63 nt that were identified (= 49), not just L1 mIMRs (= 18). Therefore, the observed frequency (49) is 25-fold greater than the expected frequency (2).

A similar process for rdIMRs can be made, with the only change of P(e) = (1/4)*(1/4), to reflect the reverse dinucleotide criteria delimiter. The estimate will be for rdIMRs ≥20 nt, the length summarized in Table 5.

P(m) = (l!/((m!(l-m)!) * (1/4)m * (3/4)l-m

P(m) = (8!/(3! * 5!)) * (1/4)3 * (3/4)5 = 0.2076

P(o) = P(e) * P(m) = (1/16) * 0.2076 = 0.01280

P(t) = P(o) * (1500-19) = 19.2

The observed frequency for rdIMRs ≥20 nt is 53, approximately 2.5 the predicted number.

Both mIMRs and rdIMRs occur at greater than expected numbers, although the greater than expected number of mIMRs is much greater than for rdIMRs. These values demonstrate that it is unlikely that the multiple occurrences of mIMRs ≥63 nt occur by chance. It is also unlikely that chance occurrences will be at positions that are highly significant to the function of the protein.

The affect of modifying symmetry criteria on IMR identity was examined for both lower and higher levels of symmetry. No evidence of a relationship between mIMRs and protein cleavage sites for the entire Gag polyprotein was found at levels of symmetry ≥50%. Table 7 summarizes L1 mIMRs that are ≥33% symmetrical. Using the formula described previously, less than one (0.1128) mIMRs that is 704 nt in length and ≥33% symmetric is expected within the gag sequence of 1500 nt; in contrast, five are observed and there are an additional 237 that are longer than 705 nt, indicating that mirror symmetry pervades the gene. About half of the L1 mIMRs translate protein segments that would end at or near cleavage sites, and one mIMR coincides with the start of CA and the end of p6. MIMRs that are not associated with cleavage sites begin and end at functionally related domains.

Table 7. MIMRs ≥ 33% begin and end at cleavage sites (bold) and sites that have related functions in the translated protein

The region M1..K32 encompasses the start of four mIMRs (≥33% symmetrical) and is the region that targets Gag to the cell membrane [22]. Two of these mIMRs terminate within capsid D235..E260 which is a region of small helices and loops adjacent to the CypA binding site that is probably essential to disassembling the core upon infection [14]; these mIMRs, then, begin at sequences that localize Gag to the cell membrane – a process essential to core formation – and end at sequences that dissolve the virion core (upon infection). Similarly, E12..N271 begins within the membrane localization domain, and ends at CA-H7, the largest component of the structural core, which stabilizes its constituent planar strips [14]. The fourth mIMR, R15..Q379, begins within the membrane localization region and terminates one amino acid downstream from the p2-NC cleavage site; cleavage at p2-NC is the initial step in the Gag cleavage sequence [3]. MIMR E52..K410 begins at positions essential to particle formation, trimerization and virus assembly, and terminates immediately upstream of the second Cys-His box (zinc finger) which is essential to packaging. Several mIMRs begin within the region L101..D121, which includes most of the MA-H5; this helix projects away from the plasma membrane, directly into the center of the virion [23] and deleterious deletions within it have been found to block viral entry [13]. MIMRs that begin at the MA-H5 helix terminate at the NC-p1 cleavage site and the end of Gag-Pol TF and p6. The association of weakly symmetrical mIMRs with cleavage sites in the polyprotein and functionally related protein motifs suggests that different levels of IMR symmetry may be related to different functional aspects of the translated protein.

At higher criteria for symmetry (≥66%), the sequence positions of mIMRs and rdIMRs are nearly the same. These results are summarized in Table 8. At this level of symmetry the distribution of rdIMRs and mIMRs are nearly identical.

Table 8. mIMRs and rdIMRs that are ≥66% symmetric

Discussion

In this study, IMRs were found occur in gag in greater than expected numbers, and in a hierarchal order in which multiple shorter IMRs occur within the span of a longer IMR. The longest IMRs coincide with protein functional motifs that are highly significant to the gene. Some mIMRs and rdIMRs overlap, and others are uniquely positioned in the gene.

Because there are so many IMRs, the question arises whether the coincidence of IMRs and functional motifs occurs by chance. This possibility is further complicated by the uncertainty of the boundaries of functional motifs, which becomes apparent in the detailed annotation in the Additional File 1.

Functional motifs have been determined primarily through the study of engineered mutants. However, a slightly different experimental design seems to have frequently led to the identifcation of a slightly different functional motif. Additionally, there is the possibility that a motif may not be complete. Therefore it is unlikely that a probability for the coincidence of IMRs with functional motifs can be computed. However, when IMRs are identified, solely on the basis of length, the longest of them coincide with key functional motifs in the protein. The relationship between length and significance first becomes apparent in the polyprotein, but persists independently in each of the mature proteins.

It is less problematic to identify the position of protein structural elements, although, again, differences in experimental design may result in slightly different boundaries for helices, turns and β-sheets (see Additional File 1). In this study, the ends of rdIMRs were found to coincide with the ends of protein structural elements over a range of about three nucleotides, a result consistent with a previous study of monomeric proteins. In HIV-1 Gag, this property is also found in mIMRs, and reverse dinucleotide pairs terminate 55% of the longest mIMRs in Gag. This feature may be related to the structural nature of Gag proteins, a premise that would also be consistent with the absence of mIMRs in highly mobile segments of MA and CA.

IMRs at low levels of symmetry begin and/or end at cleavage positions in the protein. IMRs having higher levels of symmetry coincide with PSEs and significant functional motifs in the protein. The highest levels of symmetry delineate essential functional sites in the protein. Analysis of the distribution of IMRs in the Gag polyprotein indicates that the gene sequence exhibits a high degree of regularity, is stabilized by multiple levels of mirror symmetry, and consists of sequence segments that are specifically associated with functional attributes of the protein segments that they translate.

Conclusion

Key structural and functional features of each protein are almost always translations of IMRs. The distribution, by length, of the segments that translate the most significant motifs in each protein over the span of the polypeptide indicates that the polypeptide is the functional unit of organization for DNA motifs. The five longest mIMRs in gag that are ≥ 50% symmetric each translate the most significant protein motif in a different cleavage product.

Various thresholds for DNA symmetry differentiate functional and structural properties of the polyprotein that is translated. MIMRs that are ≥33% symmetric start or stop at cleavage positions, and positions that are functionally related in the mature proteins. IMRs that are ≥50% symmetric coincide with most of the functional motifs in the mature proteins. At ≥ 66% symmetry, the distribution of mIMRs and rdIMRs overlap and most of these motifs are related to structural features.

The frequency and distribution of IMRs in HIV-1 Gag indicates that DNA symmetry is a fundamental property of protein coding DNA and that different levels of symmetry are associated with different functional aspects of gene and protein. The interaction between DNA and protein structure and function is precise and interwoven over the entire length of the protein. The distribution of mIMRs and rdIMRs and their relationship to structural and functional motifs in the protein that they translate, suggest that DNA-driven processes, including selection for mirror repeats, may be a constraining factor in molecular evolution.

Methods

Sequence analysis

The HIV-1 gag sequence used for this analysis is HXB2_LAI_IIIB_BRU [GenBank:K03455], the most commonly used reference sequence for the HIV-1 genome [2]. All numbering in this paper refer to positions from the start of gag, unless stated otherwise.

Determination of mIMRs and rdIMRs

The mIMRs and rdIMRs were determined for the differential cleavage products of HXB2 Gag: the polyprotein, the segments at the first cleavage – MA-CA-p2 and NC-p1-p6 – and MA, CA, NC, p6 and spacer proteins p2 and p1. MIMRs were evaluated at symmetry criteria of ≥ 33%, 45%, 50%, 55% and 66%; rdIMRs were evaluated at ≥50% and ≥66%.

Evaluation of the coincidence of IMRs with PSEs

The coincidence of rdIMRs with PSEs was evaluated for the entire polyprotein and separately for each of its cleavage products. Because a high number of sub-strings might contribute to a false positive for the correlation of the ends of PSEs and IMRs, the number of IMRs was reduced by sequentially eliminating shorter lengths of IMRs, and testing whether the Fisher's exact test (FET) remained significant. The length of IMRs that have a positive FET correlation when all shorter IMRs are removed is identified as the "essential value"; this value was determined for each cleavage product.

The p6 region was not included in the rdIMR-PSE analysis because its tertiary structure has not been determined.

Detailed annotation of Gag combined IMRs and functional and structural data

The sequence motifs of experimentally determined functional and structural data, and the sequence positions of the translations of mIMRs and rdIMRs were summarized and compared. Observed and expected frequencies of mIMRs and rdIMRs were determined. The largest IMRs were mapped to 3D structures from the NCBI Structure Database [19].

Abbreviations

IMR: imperfect mirror repeat

mIMR: maximal imperfect mirror repeat

rdIMR: reverse dinucleotide imperfect mirror repeat

MA: matrix

CA: capsid

SP1: p2

NC: nucleocapsid

PSE: protein structural element

L1: refers to largest IMR for a particular sequence span

FET: Fisher's exact test

PDB: Protein Data Bank

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

DML performed all computer-based analysis. DML wrote the manuscript and approved its final copy.

Additional file 2. References for Additional file 1, not listed in main manuscript. References cited solely in Additional file 1 are listed in this document.

Format: DOC Size: 42KB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

Acknowledgements

Dr. John Palfreyman read the manuscript and made many helpful suggestions. Doug MacLean provided technical support. The support of the University of Abertay-Dundee made the work possible. All are deeply appreciated.

References

  1. Lang DM: Imperfect DNA mirror repeats in E. coli TnsA and other protein-coding DNA.

    Biosystems 2005, 81(3):183-207. PubMed Abstract | Publisher Full Text OpenURL

  2. Korber BT, Foley BT, Kuiken CL, Pillai SK, Sodroski G: Numbering Positions in HIV Relative to HXB2CG. In Human Retroviruses and AIDS 1998: A Compilation and Analysis of Nucleic Acid and Amino Acid Sequences. Edited by Korber B, Kuiken CL, Foley B, Hahn B, McCutchan F, Mellors JW, Sodroski J. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM; OpenURL

  3. Wiegers K, Rutter G, Kottler H, Tessmer U, Hohenberg H, Krausslich HG: Sequential steps in human immunodeficiency virus particle maturation revealed by alterations of individual Gag polyprotein cleavage sites.

    J Virol 1998, 72(4):2846-54. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Pettit SC, Moody MD, Wehbie RS, Kaplan AH, Nantermet PV, Klein CA, Swanstrom R: Free in PMC The p2 domain of human immunodeficiency virus type 1 Gag regulates sequential proteolytic processing and is required to produce fully infectious virions.

    J Virol 1994, 68(12):8017-27. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Swanstrom RA, Wills JW: Synthesis, assembly and processing of viral proteins. In Retroviruses. Edited by Coffin JM, Hughes SH, Varmus HE. Cold Spring Harbor Laboratory Press; 1997:263-334. OpenURL

  6. Freed EO: HIV-1 gag proteins: diverse functions in the virus life cycle.

    Virology 1998, 251(1):1-15. PubMed Abstract | Publisher Full Text OpenURL

  7. Shehu-Xhilaga M, Kraeusslich HG, Pettit S, Swanstrom R, Lee JY, Marshall JA, Crowe SM, Mak J: Proteolytic processing of the p2/nucleocapsid cleavage site is critical for human immunodeficiency virus type 1 RNA dimer maturation.

    J Virol 2001, 75(19):9156-64. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Tang C, Ndassa Y, Summers MF: Structure of the N-terminal 283-residue fragment of the immature HIV-1 Gag polyprotein.

    Nat Struct Biol 2002, 9(7):537-43. PubMed Abstract | Publisher Full Text OpenURL

  9. Yuan X, Yu X, Lee TH, Essex M: Mutations in the N-terminal region of human immunodeficiency virus type 1 matrix protein block intracellular transport of the Gag precursor.

    J Virol 1993, 67(11):6387-94. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Radding W, Williams JP, McKenna MA, Tummala R, Hunter E, Tytler EM, McDonald JM: Calmodulin and HIV type 1: interactions with Gag and Gag products.

    AIDS Res Hum Retroviruses 2000, 16(15):1519-25. PubMed Abstract | Publisher Full Text OpenURL

  11. Bukrinsky MI, Haggerty S, Dempsey MP, Sharova N, Adzhubel A, Spitz L, Lewis P, Goldfarb D, Emerman M, Stevenson M: A nuclear localization signal within HIV-1 matrix protein that governs infection of non-dividing cells.

    Nature 1993, 365(6447):666-9. PubMed Abstract | Publisher Full Text OpenURL

  12. Luban J: Absconding with the chaperone: essential cyclophilin-Gag interaction in HIV-1 virions.

    Cell 1996, 87(7):1157-1159. PubMed Abstract | Publisher Full Text OpenURL

  13. Hill CP, Worthylake D, Bancroft DP, Christensen AM, Sundquist WI: Crystal structures of the trimeric human immunodeficiency virus type 1 matrix protein: implications for membrane association and assembly.

    Proc Natl Acad Sci USA 1996, 93(7):3099-104. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Gamble TR, Vajdos FF, Yoo S, Worthylake DK, Houseweart M, Sundquist WI, Hill CP: Crystal structure of human cyclophilin A bound to the amino-terminal domain of HIV-1 capsid.

    Cell 1996, 87(7):1285-94. PubMed Abstract | Publisher Full Text OpenURL

  15. Massiah MA, Worthylake D, Christensen AM, Sundquist WI, Hill CP, Summers MF: Comparison of the NMR and X-ray structures of the HIV-1 matrix protein: evidence for conformational changes during viral assembly.

    Protein Sci 1996, 5(12):2391-8. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. von Schwedler UK, Stemmler TL, Klishko VY, Li S, Albertine KH, Davis DR, Sundquist WI: Proteolytic refolding of the HIV-1 capsid protein amino-terminus facilitates viral core assembly.

    EMBO J 1998, 17(6):1555-68. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Priel E, Aflalo E, Seri I, Henderson LE, Arthur LO, Aboud M, Segal S, Blair DG: DNA binding properties of the zinc-bound and zinc-free HIV nucleocapsid protein: supercoiled DNA unwinding and DNA-protein cleavable complex formation.

    FEBS Lett 1995, 362(1):59-64. PubMed Abstract | Publisher Full Text OpenURL

  18. Amarasinghe GK, De Guzman RN, Turner RB, Chancellor KJ, Wu ZR, Summers MF: NMR structure of the HIV-1 nucleocapsid protein bound to stem-loop SL2 of the psi-RNA packaging signal. Implications for genome recognition.

    J Mol Biol 2000, 301(2):491-511. PubMed Abstract | Publisher Full Text OpenURL

  19. Omichinski JG, Clore GM, Sakaguchi K, Appella E, Gronenborn AM: Structural characterization of a 39-residue synthetic peptide containing the two zinc binding domains from the HIV-1 p7 nucleocapsid protein by CD and NMR spectroscopy.

    FEBS Lett 1991, 292(1–2):25-30. PubMed Abstract | Publisher Full Text OpenURL

  20. Langsrud O: Fisher's exact test. [http://www.matforsk.no/ola/fisher.htm] webcite

  21. Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

    Biopolymers 1983, 22(12):2577-637. PubMed Abstract | Publisher Full Text OpenURL

  22. Zhou W, Parent LJ, Wills JW, Resh MD: Identification of a membrane-binding domain within the amino-terminal region of human immunodeficiency virus type 1 Gag protein which interacts with acidic phospholipids.

    J Virol 1994, 68(4):2556-69. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. NCBI Structure Database [http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?Dopt=s&uid=19925] webcite