Overview of protein structure

Proteins are macromolecules consisting of one or more polypeptides (Table 2.1). Each polypeptide consists of a chain of amino acids linked together by peptide (amide) bonds. The exact amino acid sequence is determined by the gene coding for that specific polypeptide. When synthesized, a polypeptide chain folds up, assuming a specific three-dimensional shape (i.e. a specific conformation) that is unique to it. The conformation adopted is dependent upon the polypeptide's amino acid sequence, and this conformation is largely stabilized by multiple, weak non-covalent interactions. Any influence (e.g. certain chemicals and heat) that disrupts such weak interactions results in disruption of the polypeptide's native conformation, a process termed denaturation. Denatura-tion usually results in loss of functional activity, clearly demonstrating the dependence of protein function upon protein structure. A protein's structure currently cannot be predicted solely from its amino acid sequence. Its conformation can, however, be determined by techniques such as X-ray diffraction and nuclear magnetic resonance (NMR) spectroscopy.

Proteins are sometimes classified as 'simple' or 'conjugated'. Simple proteins consist exclusively of polypeptide chain(s) with no additional chemical components present or being required for biological activity. Conjugated proteins, in addition to their polypeptide components(s),

Pharmaceutical biotechnology: concepts and applications Gary Walsh © 2007 John Wiley & Sons, Ltd ISBN 978 0 470 01244 4 (HB) 978 0 470 01245 1 (PB)

Table 2.1 Selected examples of proteins. The number of polypeptide chains and amino acid residues constituting the protein are listed, along with its molecular mass and biological function

No. polypeptide

Total no. amino

Molecular mass

Protein

chains

acids

(Da)

Biological function

Insulin (human)

2

51

5 800

Complex, includes

regulation of

blood glucose

levels

Lysozyme (egg)

1

129

13 900

Enzyme capable

of degrading

peptidoglycan in

bacterial cell walls

IL-2 (human)

1

133

15 400

T-lymphocyte-

derived

polypeptide

that regulates

many aspects of

immunity

EPO (human)

1

165

36 000

Hormone that

stimulates

red blood cell

production

Chymotrypsin

3

241

21 600

Digestive proteolytic

(bovine)

enzyme

Subtilisin (Bacillus

1

274

27 500

Bacterial proteolytic

amyloliquefaciens)

enzyme

Tumour necrosis

3

471

52 000

Mediator of

factor (human

inflammation and

TNF-a)

immunity

Haemoglobin

4

574

64 500

Gas transport

(human)

Hexokinase (yeast)

2

800

102 000

Enzyme capable of

phosphorylating

selected

monosaccharides

Glutamate

~40

~8 300

~1 000 000

Enzyme

dehydrogenase

interconverts

(bovine)

glutamate and

a-ketoglutarate

and NH+

contain one or more non-polypeptide constituents known as prosthetic group(s). The most common prosthetic groups found in association with proteins include carbohydrates (glycoproteins), phosphate groups (phosphoproteins), vitamin derivatives (e.g. flavoproteins) and metal ions (metalloproteins).

Table 2.2 The 20 commonly occurring amino acids. They may be subdivided into five groups on the basis of side-chain structure. Their three- and one-letter abbreviations are also listed (one-letter abbreviations are generally used only when compiling extended sequence data, mainly to minimize writing space and effort). In addition to their individual molecular masses, the percentage occurrence of each amino acid in an 'average' protein is also presented. These data were generated from sequence analysis of over 1000 different proteins

Abbreviation

R group __Occurrence in classification Amino acid 3 letters 1 letter Molecular mass 'average' protein (%)

Table 2.2 The 20 commonly occurring amino acids. They may be subdivided into five groups on the basis of side-chain structure. Their three- and one-letter abbreviations are also listed (one-letter abbreviations are generally used only when compiling extended sequence data, mainly to minimize writing space and effort). In addition to their individual molecular masses, the percentage occurrence of each amino acid in an 'average' protein is also presented. These data were generated from sequence analysis of over 1000 different proteins

Abbreviation

R group __Occurrence in classification Amino acid 3 letters 1 letter Molecular mass 'average' protein (%)

Nonpolar,

Glycine

Gly

G

75

7.2

aliphatic

Alanine

Ala

A

89

8.3

Valine

Val

V

117

6.6

Leucine

Leu

L

131

9

Isoleucine

Ile

I

131

5.2

Proline

Pro

P

115

5.1

Aromatic

Tyrosine

Tyr

Y

181

3.2

Phenylalanine

Phe

F

165

3.9

Tryptophan

Trp

W

204

1.3

Polar but

Cysteine

Cys

C

121

1.7

uncharged

Serine

Ser

S

105

6

Methionine

Met

M

149

2.4

Threonine

Thr

T

119

5.8

Asparagine

Asn

N

132

4.4

Glutamine

Gln

Q

146

4

Positively

Arginine

Arg

R

174

5.7

charged

Lysine

Lys

K

146

5.7

Histidine

His

H

155

2.2

Negatively

Aspartic acid

Asp

D

133

5.3

charged

Glutamic acid

Glu

E

147

6.2

2.2.1 Primary structure

Polypeptides are linear, unbranched polymers, potentially containing up to 20 different monomer types (i.e. the 20 commonly occurring amino acids) linked together in a precise predefined sequence. The primary structure of a polypeptide refers to its exact amino acid sequence, along with the exact positioning of any disulfide bonds present (described later). The 20 commonly occurring amino acids are listed in Table 2.2, along with their abbreviated and one-letter designations. The structures of these amino acids are presented in Figure 2.1. Nineteen of these amino acids contain a central (a) carbon atom, to which is attached a hydrogen atom (H), an amino group (NH2) a carboxyl group (COOH), and an additional side chain (R) group - which differs from amino acid to amino acid. The amino acid proline is unusual in that its R group forms a direct covalent bond with the nitrogen atom of what is the free amino group in other amino acids (Figure 2.1).

h3n- cI

Glycine coo" +1

coo"

Alanine coo-

ch ch3^

Valine

Phenylalanine Tyrosine

Tryptophan

Tryptophan ch / \

Leucine ch2

ch3 Isoleucine

Proline

coo"

ch2oh coo"

h oh coo"

+nh3

Serine coo"

ch3 Methionine

Threonine coo"

h2n n o

Asparagine

Cysteine coo"

Glutamine

Lysine

Arginine

Histidine coo"

Aspartate

coo" Glutamate

Figure 2.1 The chemical structure of the 20 amino acids commonly found in proteins

As will be evident from Section 2.2.2, peptide bond formation between adjacent amino acid residues entails the establishment of covalent linkages between the amino and carboxyl groups attached to their respective central (a) carbon atoms. Hence, the free functional (i.e. chemically reactive) groups in polypeptides are almost entirely present as part of the constituent amino acids' side chains (R groups). In addition to determining the chemical reactivity of a polypeptide, these R groups also very largely dictate the final conformation adopted by a polypeptide. Stabilizing/repulsive forces between different R groups (as well as between R groups and the surrounding aqueous media) largely dictate what final shape the polypeptide adopts, as will be described later.

ch coo coo oh

h h2n ch

h2c ch

coo coo coo

h ch ch ch

c ch ch

The R groups of the non-polar, alipathic amino acids (Gly, Ala, Val, Leu, Ile and Pro) are devoid of chemically reactive functional groups. These R groups are noteworthy in that, when present in a polypeptide's backbone, they tend to interact with each other non-covalently (via hydrophobic interactions). These interactions have a significant stabilizing influence on protein conformation.

Glycine is noteworthy in that its R group is a hydrogen atom. This means that the a-carbon of glycine is not asymmetric, i.e. is not a chiral centre. (To be a chiral centre the carbon would have to have four different chemical groups attached to it; in this case, two of its four attached groups are identical.) As a consequence, glycine does not occur in multiple stereo-isomeric forms, unlike the remaining amino acids, which occur as either d or l isomers. Only l-amino acids are naturally found in polypeptides.

The side chains of the aromatic amino acids (Phe, Tyr and Trp) are not particularly reactive chemically, but they all absorb ultraviolet (UV) light. Tyr and Trp in particular absorb strongly at 280 nm, allowing detection and quantification of proteins in solution by measuring the absorbance at this wavelength.

Of the six polar but uncharged amino acids, two (cysteine and methionine) are unusual in that they contain a sulfur atom. The side chain of methionine is non-polar and relatively unre-active, although the sulfur atom is susceptible to oxidation. In contrast, the thiol (—C—SH) portion of cysteine's R group is the most reactive functional group of any amino acid side chain. In vivo, this group can form complexes with various metal ions and is readily oxidized, forming 'disulfide linkages' (covalent linkages between two cysteine residues within the same or even different polypeptide backbones). These help stabilize the three-dimensional structure of such polypeptides. Interchain disulfide linkages can also form, in which cysteines from two different polypeptides participate. This is a very effective way of covalently linking adjacent polypeptides.

Of the four remaining polar but uncharged amino acids, the R groups of serine and threo-nine contain hydroxyl (OH) groups and the R groups of asparagine and glutamine contain amide (CONH2) groups. None are particularly reactive chemically; however, upon exposure to high temperatures or extremes of pH, the latter two can deamidate, yielding aspartic acid and glutamic acid respectively.

Aspartic and glutamic acids are themselves negatively charged under physiological conditions. This allows them to chelate certain metal ions, and also to markedly influence the conformation adopted by polypeptide chains in which they are found.

Lysine, arganine and histidine are positively charged amino acids. The arganine R group consists of a hydrophobic chain of four — CH2 groups (Figure 2.1), capped with an amino (NH2) group, which is ionized (NH3+) under most physiological conditions. However, within most polypeptides there is normally a fraction of un-ionized lysines, and these (unlike their ionized counterparts) are quite chemically reactive. Such lysine side chains can be chemically converted into various analogues. The arganine side chain is also quite bulky, consisting of three CH2 groups, an amino group (—NH2) and an ionized guanido group (=NH2+). The 'imidazole' side chain of histidine can be described chemically as a tertiary amine (R3—N), and thus it can act as a strong nucle-ophilic catalyst (the nitrogen atom houses a lone pair of electrons, making it a 'nucleus lover' or nucleophile; it can donate its electron pair to an 'electron lover' or electrophile). As such, the his-tidine side chain often constitute an essential part of some enzyme active sites.

In addition to the 20 'common' amino acids, some modified amino acids are also found in several proteins. These amino acids are normally altered via a process of post-translational modification (PTM) reactions (i.e. modified after protein synthesis is complete). Almost 200 such modified amino acids have been characterized to date. The more common such modifications are discussed separately in Section 2.5.

-COOH

-COOH

COOH

Peptide bond

Amino acid 'residue'

Peptide bond

COOH

Figure 2.2 (a) Peptide bond formation. (b) Polypeptides consist of a linear chain of amino acids successively linked via peptide bonds. (c) The peptide bond displays partial double-bonded character

2.2.2 The peptide bond

Successive amino acids are joined together during protein synthesis via a 'peptide' (i.e. amide) bond (Figure 2.2). This is a condensation reaction, as a water molecule is eliminated during bond formation. Each amino acid in the resultant polypeptide is termed a 'residue', and the polypeptide chain will display a free amino (NH2) group at one end and a free carboxyl (COOH) group at the other end. These are termed the amino and carboxyl termini respectively.

The peptide bond has a rigid, planar structure and is in the region of 1.33 Â in length. Its rigid nature is a reflection of the fact that the amide nitrogen lone pair of electrons is delo-calized across the bond (i.e. the bond structure is a halfway house between the two forms illustrated in Figure 2.2c). In most instances, peptide groups assume a 'trans' configuration (Figure 2.2b). This minimizes steric interference between the R groups of successive amino acid residues.

Planar (rigid) peptide bonds

H 1

II

1

^ a 1

C-

N 1

H

R1

H

C-

N-

O

Ca - C bond free to rotate, angle of rotation = y

Figure 2.3 Fragment of polypeptide chain backbone illustrating rigid peptide bonds and the intervening N—Ca and Ca— C backbone linkages, which are free to rotate

Whereas the peptide bond is rigid, the other two bond types found in the polypeptide backbone (i.e. the N—Ca bond and the Ca—C bond, Figure 2.3) are free to rotate. The polypeptide backbone can thus be viewed as a series of planar 'plates' that can rotate relative to one another. The angle of rotation around the N—Ca bond is termed ^ (phi) and that around the Ca—C bond is termed y (psi) (Figure 2.3). These angles are also known as rotation angles, dihedral angles or torsion angles. By convention, these angles are defined as being 180° when the polypeptide chain is in its fully extended, trans form. In principle, each bond can rotate to any value between -180 ° and +180 °. However, the degrees of rotation actually observed are restricted due to the occurrence of steric hindrance between atoms of the polypeptide backbone and those of amino acid side chains.

For each amino acid residue in a polypeptide backbone, the actual ^ and y angles that are physically possible can be calculated, and these angle pairs are often plotted against each other in a diagram termed a Ramachandran plot. Sterically allowable angles fall within relatively narrow bands in most instances. A greater than average degree of ^/y rotational freedom is observed around glycine residues, due to the latter's small R group - hence steric hindrance is minimized. On the other hand, bond angle freedom around proline residues is quite restricted due to this amino acid's unusual structure (Figure 2.1). The ^ and y angles allowable around each C a in a polypeptide backbone obviously exert a major influence upon the final three-dimensional shape assumed by the polypeptide.

2.2.3 Amino acid sequence determination

The amino acid sequence of a polypeptide may be determined directly via chemical sequencing or by physical fragmentation and analysis, usually by mass spectrometry. Direct chemical sequencing was the only method available until the 1970s. Insulin was the first protein to be sequenced by this approach (in 1953), requiring several years and several hundred grams of protein to complete. The method has been refined and automated over the years, such that, today, polypeptides containing 100 amino acids or more can be automatically sequenced within a few hours, using microgram to milligram levels of protein. The actual chemical sequencing procedure employed is termed the Edman degradation method.

Table 2.3 Representative organisms whose genomes have been or will soon be completely/ almost completely sequenced. Data taken largely from http://wit.integratedgenomics.com/GOLD/ eucaryoticgenomes.html and http://www.tigr.org/tdb/mdb/mdcomplete.html. Updated information is available on these sites

Genome Genome

Organism Classification sizea (Mb) Organism Classification sizea (Mb)

Aeropyrum pernix Archaea 1.67

Archaeoglobus Archaea 2.18

fulgidus

Pyrococcus Archaea 1.80

horikoshii

Pyrococcus furiosus Archaea 2.10

Sulfolobus Archaea 2.99

solfataricus

Thermoplasma Archaea 1.56

acidophilum

Aquifex aeolicus Eubacteria 1.50

Bacillus subtilis Eubacteria 4.20

Bacillus anthracis Eubacteria 4.50

Bordetella pertussis Eubacteria 3.88

Brucella suis Eubacteria 3.30

Chlamydia Eubacteria 1.23

pneumoniae

Clostridium tetani Eubacteria 4.40

Corynebacterium Eubacteria 3.10

diphtheriae

E. coli Eubacteria 5.23

Lactobacillus Eubacteria 1.90

acidophilus

Listeria Eubacteria 2.94

monocytogenes

Mycobacterium Eubacteria 2.80

leprae

Mycobacterium Eubacteria 4.40

tuberculosis

Neisseria Eubacteria 2.18

meningitidis

Pseudomonas Eubacteria 6.3

aeruginosa

Salmonella enterica Eubacteria NL

Staphylococcus Eubacteria 2.80

aureus

Streptococcus Eubacteria 2.04

pneumoniae

Treponema pallidum

Eubacteria

1.14

Vibrio chloerae

Eubacteria

4.0

Aspergillus nidulans

Fungi

31.0

Candida albicans

Fungi

15.0

Neurospora crassa

Fungi

47.0

Schizosaccharomyces

Fungi

14.0

pombe

Babesia bovis

Protozoa

NL

Cryptosporidium

Protozoa

10.4

parvum

Leishmania major

Protozoa

33.6

Arabidopsis thaliana

Plant

70.0

(thale cress)

Hordeum vulgare

Plant

5.0

(barley)

Gossypium hirsutum

Plant

NL

(cotton)

Triticum aestivum

Plant

NL

(wheat)

Zea mays (maize)

Plant

NL

Danio rerio (zebra fish)

Fish

NL

Gallus gallus (chicken)

Bird

NL

Bos taurus (cow)

Mammal

NL

Canis familiaris (dog)

Mammal

NL

Rattus norvegicus (rat)

Mammal

NL

Ovis aries (sheep)

Mammal

NL

Sus scrofa (pig)

Mammal

NL

Ape

Primate

NL

Homo sapiens

Primate

NL

aNL: not listed in source publication.

Table 2.4 The major primary sequence (protein and nucleic acid) databases and the web addresses from which they may be accessed

Database

Web address

Protein PIR

Swiss-Prot

MIPS

NRL-3D

Tr EMBL Owl

Nucleic acid EMBL GenBank DDBJ

http://www-nbrf.georgetown.edu/ http://www.ebi.ac.uk/swissprot/ http://www.mips.biochem.mpg.de/ http://www-nbrf.georgetown.edu/pirwww/

dbinfo/nrl3d.html http://www.ebi.ac.uk/index.html http://www.bis.med.jhmi.edu/Dan/ proteins/owl.html

http://www.ebi.ac.uk/embl/index.html/

http://www.ncbi.nlm.nih.gov

http://www.ddbj.nig.ac.jp/

An alternative approach to amino acid sequence determination is to sequence its gene (Chapter 3). The amino acid sequence can be inferred from the nucleotide sequence obtained. This approach has gained favour in recent years. Refinements to DNA sequencing methodologies and equipment have made such sequence analysis both rapid and relatively inexpensive. The ongoing genome projects continue to generate enormous amounts of sequence data. By the early 2000s, substantial/complete sequence data for some 300 organisms were available (Table 2.3). As a result, the putative amino acid sequences of an enormous number of proteins (most of unknown function/structure) had been determined.

Upon its generation, sequence information is normally submitted to various databases. The major databases in which protein primary sequence data are available are listed in Table 2.4. Also included in this table are the major nucleic acid sequence databases, as amino acid sequence information can potentially be derived from these.

The Swiss-Prot database is probably the most widely used protein database. It is maintained collaboratively by the European Bioinformatics Institute (EBI) and the Swiss Institute for Bioin-formatics. It is relatively easy to access and search via the World Wide Web (Table 2.4). A sample entry for human insulin is provided in Figure 2.4. Additional information detailing such databases is available via the web addresses provided in Table 2.4 and in the bioinformatics publications listed at the end of this chapter.

A polypeptide's amino acid sequence can thus be determined by direct chemical (Edman) or physical (mass spectrometry) means, or indirectly via gene sequencing. In practice, these methods are complementary to one another and can be used to cross-check sequence accuracy. If the target gene/messenger RNA (mRNA) has been previously isolated, then DNA sequencing is usually most convenient. However, this approach reveals little information regarding any PTMs present in the mature polypeptide, many of whom are of critical significance in the context of therapeutic proteins (discussed in Section 2.5).

Generat information abnul the entry

Entry name

INS HUMAN

Primary accession number

P01308

Secondary accession numbcr(s)

None

Entered in SWISS-PROT

in

Release 01, July 1986

Sequence was last modified in

Release 01, July 1986

Annotations were last modified in

Release 39, May 2000

Name and origin of the protein

Protein name

INSULIN [Precursor]

Synonym(s)

None

Gene name(s)

INS

From

Homo sapiens (Human) [TaxID; 9606]

Taxonomy

Eukaryola; Metazoa; Chordata; Craniata; Vcrtcbrata; Eutelcostomi; Mammalia; Euthcria; Primates; Catarrhini; Hominidae; Homo.

Features

SIGNAL

CHAIN

PKOPEP

CHAIN

CISULFID

DISULFID

OISULFID

VARIANT

Features

SIGNAL

CHAIN

PKOPEP

CHAIN

CISULFID

DISULFID

OISULFID

VARIANT

VARIANT

VARIANT VARIANT

VARIANT

VARIANT

TURN

HELIX

STRAND

HELIX

TURN

HELIX

STRAND

Sequence information

Length: 110 A A [This is (he length or the unprocessed precursor]

Molecular weight: 11981 Da (This is the Mw of the unprocessed precursor]

CRC64; C2C3B23B85E520E5 [This is a chccksum on the sequence]

10 20 30 40 50 60 1 1 1 1 1 1 MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFÏ TPKTRREAED 70 80 90 100 110 1 t 1 1 I LQVGQVELGG GPGAGSLOPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

P0I308 in FASTA formal

Figure 2.4 Sample entry for human insulin as present in the Swiss-Prot database. Refer to text for further details. Reproduced from the Swiss-Prot database on the Uniprot website htt://www.ebi.uniprot.org/

2.2.4 Polypeptide synthesis

VARIANT

VARIANT VARIANT

VARIANT

VARIANT

TURN

HELIX

STRAND

HELIX

TURN

HELIX

STRAND

Full-scale polypeptide characterization usually requires modest/large (milligram to gram) amounts of the purified target polypeptide. Even larger quantities are then generally required if the polypeptide has a commercial application. In some cases a polypeptide can be obtained in sufficient quantities by direct extraction from its natural producer source. However, polypeptides may also be produced by direct chemical synthesis, as long as their amino acid sequence (and any PTMs) has been elucidated. Synthesis can be undertaken via a biological route (recombinant DNA technology), as is the case for virtually all modern therapeutic proteins.

10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook


Post a comment