Sequence Formats - a description of file formats used for HLA Alleles
Nomenclature of HLA Alleles
Sequence Alignments
The alignment files produced use the following nomenclature and numbering conventions. These conventions are based on the recommendations published for Human Gene Mutations. These were prepared by a nomenclature-working group looking at how to name and store sequences for human allelic variants. These recommendations can be found in Human Mutation 11:1-3, 1998.
- Only alleles officially recognised by the WHO HLA Nomenclature Committee for Factors of the HLA System are included in the sequence alignments.
- As recommended for all human gene mutations, a standard reference sequence should be used for all alignments. A complete list of reference sequences for each allele can be seen below.
- The reference sequence will always be associated with the same (original) accession number, unless this sequence is shown to be in error.
- All alleles are aligned to the reference sequences.
- Naming of the sequence is based upon the previously published naming conventions.
In the sequence alignments the following conventions are used.
- The entry for each allele is displayed in respect to the reference sequences.
- Where identity to the reference sequence is present the base will be displayed as a hyphen (-).
- Non-identity to the reference sequence is shown by displaying the appropriate base at that position.
- Where an insertion or deletion has occurred this will be represented by a period (.).
- If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (*).
- In protein alignments for null alleles, the 'Stop' codons will be represented by a capital X.
- In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
- These conventions are used for both nucleotide and protein alignments.
In order to provide standardised sequences for any loci, the following numbering system has been established that accurately represents the sequence at both the nucleotide and protein level. We have looked at the HUGO Gene Nomenclature Committee (1) recommendations proposed for the numbering of genomic sequences, and use a similar model for the HLA sequences held in the IMGT/HLA Database. Many of their proposals already match our current strategy. HUGO recommends that for all nomenclature systems a standard reference sequence should be used for each locus. In the case of HLA sequences a standard reference sequence is already established for each gene. The remaining recommendations for nucleotide sequences are as follows;
- The numbering of the nucleotides in the reference sequence should remain constant.
- For both gDNA and cDNA the A of the ATG initiator Methionine codon has been denoted nucleotide +1. In some non-expressed genes this codon is not present and in these cases the first base of the reference sequence has been denoted as nucleotide +1.
- The nucleotide immediately preceding the A of the ATG initiator Methionine codon has been denoted nucleotide -1. Note: that there is no nucleotide 0.
- cDNA sequences are numbered consecutively from the A of the ATG initiator Methionine codon.
- Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.
The following recommendations are used for describing mutations in nucleotide sequences;
- Nucleotide substitutions are designated using the nucleotide number, followed by the substitution. For example; 997G>T denotes a substitution of G to T at position 997 of the DNA sequence.
- Deletions are designated by 'del' after the nucleotide number. For example; 997delT denotes the deletion of a T at position 997 of the DNA. For deletions of a number of consecutive bases the mutation should be described as 997-998delTG which denotes a deletion of TG at positions 997 and 998 of the DNA.
- Insertions are designated by 'ins' after the nucleotide numbers bordering the insertion. For example; 997-998insT, represents an insertion of T between bases 997 and 998 of the DNA. In the alignments produced this will be represented by a period (.), but the numbering of the reference sequence will not be altered to include this base. Insertions of multiple bases are designated using the same form, 997-998insTG denotes an insertion of TG between positions 997 and 998 of the DNA.
The recommendations for protein sequence numbering are as follows;
- For amino acid-based systems, the start codon of the mature protein is labeled codon 1.
- The codon 5' to this is numbered -1.
- All numbering is based on the reference sequence.
- The single letter amino acid code is used in all protein alignments.
- To avoid confusion with the nucleotide numbering p. may be added to the nomenclature to denote a protein sequence.
Mutations in protein sequences follow a similar format;
- For amino acid nomenclature the reference amino acid is listed first followed by the codon and then the mutation. For example; Y97S represents a substitution of the Tyrosine at codon 97 for a Serine.
- Stop codons are always designated by X. For example; T97X represents a Threonine substituted by a stop codon.
- Deletions are again designated used 'del'. For example; T97del is the deletion of a Threonine at codon 97.
- Insertions again follow the 'ins' convention. For example; T97-98ins represents a Threonine inserted between codons 97 and 98
FASTA
Sequences in FASTA/Pearson format are represented by two main line types. The first line always begins with a "greater than" (>) sign and contains sequence information. In the files provided, the sequence information contains the name of the HLA allele. The remaining lines contain plain text representing the coding nucleotide sequence. There can be any number of these sequence lines, of any length, to represent the nucleotide sequence.
Example DRB1*01:01:01 in FASTA format:
>DRB1*01:01:01GGGGACACCCGACCACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTT
CTTCAATGGGACGGAGCGGGTGCGGTTGCTGGAAAGATGCATCTATAACC
AAGAGGAGTCCGTGCGCTTCGACAGCGACGTGGGGGAGTACCGGGCGGTG
ACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACCT
CCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTACG
GGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAGTTGAGCCTAAGGTGACT
GTGTATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTG
CTCTGTGAGTGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGA
ACGGCCAGGAAGAGAAGGCTGGGGTGGTGTCCACAGGCCTGATCCAGAAT
GGAGATTGGACCTTCCAGACCCTGGTGATGCTGGAAACAGTTCCTCGGAG
TGGAGAGGTTTACACCTGCCAAGTGGAGCACCCAAGTGTGACGAGCCCTC
TCACAGTGGAATGGAGAGCACGGTCTGAATCTGCACAGAGCAAGATGCTG
AGTGGAGTCGGGGGCTTCGTGCTGGGCCTGCTCTTCCTTGGGGCCGGGCT
GTTCATCTACTTCAGGAATCAGAAAGGACACTCTGGACTTCAGCCAACAG
GATTCCTGAGCTGA
PIR
The format of sequences in PIR/NBRF format is more complex. The first line of each sequence entry begins with a "greater than" (>) sign. This is immediately followed by a two character sequence type specifier: for the HLA alleles this is "DL", meaning DNA linear. Space four must contain a semi-colon. Beginning in space five is the sequence name or identification code: for HLA alleles this is the official allele name. The second line of each sequence entry contains a brief description, including the sequence length, and an internal checksum for PIR files. The coding nucleic acid sequence begins on the third line. The sequence is free format, but to aid in reading the sequences, the nucleotides have been arranged in blocks of 10 nucleotides. The last character is an asterisk (*), and acts as a termination character.
All PIR files have been generated using "ReadSeq", a freely available sequence format conversion program written by D. Gilbert.
Example DRB1*01:01:01 in PIR format.
>DL;DRB1*01:01:01
DRB1*01:01:01, 714 bases, A686B796 checksum.
GGGGACACCC GACCACGTTT CTTGTGGCAG CTTAAGTTTG AATGTCATTT
CTTCAATGGG ACGGAGCGGG TGCGGTTGCT GGAAAGATGC ATCTATAACC
AAGAGGAGTC CGTGCGCTTC GACAGCGACG TGGGGGAGTA CCGGGCGGTG
ACGGAGCTGG GGCGGCCTGA TGCCGAGTAC TGGAACAGCC AGAAGGACCT
CCTGGAGCAG AGGCGGGCCG CGGTGGACAC CTACTGCAGA CACAACTACG
GGGTTGGTGA GAGCTTCACA GTGCAGCGGC GAGTTGAGCC TAAGGTGACT
GTGTATCCTT CAAAGACCCA GCCCCTGCAG CACCACAACC TCCTGGTCTG
CTCTGTGAGT GGTTTCTATC CAGGCAGCAT TGAAGTCAGG TGGTTCCGGA
ACGGCCAGGA AGAGAAGGCT GGGGTGGTGT CCACAGGCCT GATCCAGAAT
GGAGATTGGA CCTTCCAGAC CCTGGTGATG CTGGAAACAG TTCCTCGGAG
TGGAGAGGTT TACACCTGCC AAGTGGAGCA CCCAAGTGTG ACGAGCCCTC
TCACAGTGGA ATGGAGAGCA CGGTCTGAAT CTGCACAGAG CAAGATGCTG
AGTGGAGTCG GGGGCTTCGT GCTGGGCCTG CTCTTCCTTG GGGCCGGGCT
GTTCATCTAC TTCAGGAATC AGAAAGGACA CTCTGGACTT CAGCCAACAG
GATTCCTGAG CTGA*