WWW services for sequence analysis
Contact: Reinhard
Schneider (schneider@EMBL-Heidelberg.de)
and Burkhard Rost
Introductions to the World Wide Web
WWW searches
WWW search engines (general)
Search engines are programs that search through the entire WWW for certain
keywords. Some engines allow to restrict the search to certain subjects.
- http://cuiwww.unige.ch/meta-index.html
Subject oriented searching (e.g. Information servers, Software, People)
- http://www.ebi.ac.uk/htbin/bwurld.pl
Searching for keywords at sites related to particular subjects (e.g. Databases,
BioUtilities, Journals, Software).
- http://www.yahoo.com/Science/Biology/
Yahoo page for Science-biology, permitting subject driven searching
(e.g. Anatomy, Biochemistry, Biomedical Engineering, Biotechnology)
- http://metacrawler.cs.washington.edu:8080/
MetaCrawler, general search engine
- http://webcrawler.com/
Web Crawler, general search engine
- http://altavista.digital.com/
Alta Vista, general search engine
- http://home.netscape.com/home/internet-search.html
NetSearch, general search engine
- http://minbar.cs.washington.edu:6060/
Ahoy! the Homepage Finder: search for email addresses and www homepages
WWW searches (topic specific)
General sites
Central sites
Groups & Universities, America
Groups & Universities, Europe
- CBS: http://www.cbs.dtu.dk/
Center for Biological Sequence Analysis, Copenhagen, Denmark.
- CNB: http://gredos.cnb.uam.es/
Protein Design Group at the CNB, Madrid, Spain.
- CSC: http://www.csc.fi:80/molbio/
The Finnish EMBnet node.
- Stockholm Univ: http://www.biokemi.su.se/~server/
Services provided by Stockholm University.
- Dublin: http://biotech.bio.tcd.ie//
Trinity College, Dublin, Ireland, Bioinformatics.
- ETHZ: http://cbrg.inf.ethz.ch/
The Computational Biochemistry Server at ETHZ (Univ. Zuerich, Switzerland)
- MIPS: http://speedy.mips.biochem.mpg.de/
Martinsried Institute for Sequence Analysis, Munich, Germany.
- Univ. Cambridge: http://www.bio.cam.ac.uk/
Univ. of Cambridge, U.K., School of Biological Sciences.
Groups & Universities, Other
Extremely useful private sites
The following two links are maintained privately and constitute probably
the best starting point to obtain an overview about what you would want
to search for in molecular biology.
Literature (Medline) and Journals
Miscellaneous sites for molecular biologists
- Dictionary of cell biology: http://www.mblab.gla.ac.uk/~julian/Dict.html
The Dictionary of Cell Biology was first published in 1989, and has
since been translated into several languages. It is intended to provide
quick access to easily-understood and cross-referenced definitions of terms
frequently encountered in reading the modern biology literature. This server
contains the text of the Second edition, published in April 1995, together
with enhancements, hypertext links and new entries which are destined for
the third edition.
- Molecular Biology Protocols: http://research.nwfsc.noaa.gov/protocols.html
Molecular Biology Protocols (Microbial Pathogenesis/Utilization Research
Division, Northwest Fisheries Science Center (NMFS/NOAA), USA). Lists various
protocols and collects information on techniques (e.g., DNA purification
techniques, DNA transformation/library preparation, southern/northern blotting,
DNA sequencing, PCR and related methods, Protein electrophoresis).
- The antibody resource page: http://www.antibodyresource.com/
Meta collection of links to various antibody resources!
- Biobase: http://biobase.dk/cgi-bin/celis
The Danish Centre for Human Genome Research's 2-D PAGE Databases at
the University of Aarhus contain data on proteins identified on various
reference maps. Available are:
- Taxa: http://ucmp1.berkeley.edu/taxaform.html
Web 'lift' through any taxon, including an introduction to phylogeny
and the origins of life (Berkely, USA).
- List of species in SWISS-PROT:
http://expasy.hcuge.ch/cgi-bin/speclist
- Phylogeny: http://www.no.embnet.org/phylogeny.html
Evolution and Phylogeny Laboratory, Norway.
- Courses: http://www.biochem.ucl.ac.uk/bsm/dbbrowser/courses.html
Collection of bioinformatics courses (Univ College, London, England)
- BioMOO: http://bioinformatics.weizmann.ac.il/BioMOO/
BioMOO is a virtual meeting place for biologists, connected to the
Globewide Network Academy. The main physical part of the BioMOO is located
at the BioInformatics Unit of the Weizmann Institute of Science, Israel.
- OPAL: http://www.elsevier.nl:80/section/life/opal/doc/demos.htm
Open programs for associative learning (Elsevier, NL), an interactive
system useful for teaching and learning across various disciplines of cell
biology and biomedical sciences.
Metabolic and other pathways (genome analysis)
Informations about genome projects
General
Arabidopsis
C. elegans
Drosophila
- BDGP:
Berkeley Drosophila Genome Project, Berkeley, U.S.A.
Fish
Human
Yeast
Databases, general
Databases, miscellaneous
(note: see also 'Metabolic and other
pathways')
- Malaria db: http://www.wehi.edu.au/biology/malaria/who.html
Dep. of Microbiol., Monash Univ. and The Walter and Eliza Hall Inst.
of Medical Research, Australia.
- Malaria (Parasitology): http://www.wehi.edu.au/biology/malaria/sites.html
Other Malaria and Parasitology Sites, Monash Univ., Australia.
- Parasite genome db: http://www.ebi.ac.uk/parasites/parasite-genome.html
Parasite genome databases and genome research resource, EBI, U.K.
- PDD Protein Disease Db: http://www-pdd.ncifcrf.gov/PDD/GEN-docs/indexGEN.html
- GIF_DB: http://www-biol.univ-mrs.fr/~lgpd/GIF_DB/GIF_entries/GIF_DB_listing.html
Genes Interactions in the Fly DataBase, Marseille, France. A specialized
database for Interactions involved in Pattern formation in Drosophila.
- CySPID: http://paella.med.yale.edu/cyspid/
The Cytoskeletal Protein Interactions Database, Yale, U.S.A.
- TBASE The Transgenic/Targeted Mutation db: http://www.gdb.org/Dan/tbase/tbase.html
Databases, nucleotide sequences
- EMBL Nucleotide Sequence Database (EBI): http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
- GenBank Nucleotide Sequence db (NCBI): http://www.ncbi.nlm.nih.gov/Web/Search/index.html
- NDB Nucleic Acid db (Tuttgers): http://ndbserver.rutgers.edu
- The Genome Sequence DataBase (NCGR): http://www.ncgr.org/gsdb/gsdb.html
- The TIGR Human cDNA Database: http://www.tigr.org/tdb/hcd/hcd.html
- Vector sequence db (Queen's Univ, Canada): http://biology.queensu.ca/~miseners/vector.html
- The Ribosomal Database Project (Univ of Illinois, Urbana, USA):
http://rdpwww.life.uiuc.edu/
- Large ribosomal subunit db (Univ Antwerpen, Belgium): http://rrna.uia.ac.be/rrna/lsuform.html
- Small ribosomal subunit db (Univ Antwerpen, Belgium): http://rrna.uia.ac.be/rrna/ssuform.html
- uRNA db (Univ of Texas, Tyler, USA): http://pegasus.uthct.edu/uRNADB/uRNADB.html
- RNA modification db (University of Utah, Salt Lake City, USA): http://medstat.med.utah.edu/RNAmods/
- The molecular probe db (IST, Genova, Italy): http://www.biotech.ist.unige.it/interlab/mpdb.html
- PCR primers db (Univ Nijmegen, Netherlands): http://www.ebi.ac.uk/primers_home.html
- Codon usage db (Kazusa DNA Research Institute, Japan): http://www.dna.affrc.go.jp/~nakamura/
- DOGS (Database Of Genome Sizes, CBS, Denmark): http://www.cbs.dtu.dk/DOGS/index.html
For other database related issues, e.g., 'Carbohydrates resource', 'Species
specific databases', 'Gene(s)/protein(s) specific databases/resources',
and 'Dictionaries, primers, courses, nomenclature, asf.' see the WWW links
of Amos Bairoch http://expasy.hcuge.ch/www/amos_www_links.html
Databases, protein sequence information
- SWISS-PROT database of protein sequences: http://expasy.hcuge.ch/sprot/sprot-top.html
- PIR international protein sequence database: http://www.gdb.org/Dan/proteins/pir.html
- PROSITE: http://expasy.hcuge.ch/sprot/prosite.html
Dictionary of protein sites and patterns. PROSITE is a method
of determining what is the function of uncharacterized proteins translated
from genomic or cDNA sequences. It consists of a database of biologically
significant sites, patterns and profiles that help to reliably identify
to which known family of protein (if any) a new sequence belongs.
- BLOCKS: http://www.blocks.fhcrc.org
Blocks are multiply aligned ungapped segments corresponding to the
most highly conserved regions of proteins. Block Searcher, Get
Blocks and Block Maker are aids to detection and verification
of protein sequence homology. They compare a protein or DNA sequence to
a database of protein blocks, retrieve blocks, and create new blocks, respectively.
- PRINTS : http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html
PRINTS is a compendium of protein fingerprints. A fingerprint is a
group of conserved motifs used to characterise a protein family; its diagnostic
power is refined by iterative scanning of OWL. Usually the motifs do not
overlap, but are separated along a sequence, though they may be contiguous
in 3D-space. Fingerprints can encode protein folds and functionalities
more flexibly and powerfully than can single motifs: the database thus
provides a useful adjunct to PROSITE.
- MOTIFS : http://www.genome.ad.jp/SIT/MOTIF.html
A set of motif libraries and search programs (Kyoto Univ., Japan) for
retrieval and analysis of protein sequence and structural motifs. The program
currently available is
- ProDom: http://protein.toulouse.inra.fr/
The ProDom protein domain database consists of an automatic compilation
of homologous domains detected in the SWISS-PROT database by the DOMAINER
algorithm (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci. 3:482-492).
It has been devised to assist with the analysis of the domain arrangement
of proteins.
- PUU: ftp://ftp.embl-heidelberg.de/pub/databases/protein_extras/puu/domains.puu
Putative protein structural domains.
- Yeast db: http://quest7.proteome.com/YPDhome.html
YPD contains physical, functional, and genetic information for the
proteins of budding yeast, Saccharomyces cerevisiae.
- Kabat db: http://immuno.bme.nwu.edu/
The Kabat database of sequences of proteins of immunological interest.
- REBASE - The Restriction Enzyme db: http://www.gdb.org/Dan/rebase/rebase.html
- EC-Enzyme classification db: http://www.gdb.org/Dan/proteins/ec-enzyme.html
- ENZYME (nomenclature db): http://expasy.hcuge.ch/sprot/enzyme.html
--- ftp://expasy.hcuge.ch/databases/enzyme
- Enzyme Structures db: http://www.biochem.ucl.ac.uk/bsm/enzymes/index.html
- TBASE The Transgenic/Targeted Mutation db: http://www.gdb.org/Dan/tbase/tbase.html
- PDD Protein Disease Db: http://www-pdd.ncifcrf.gov/PDD/GEN-docs/indexGEN.html
- O-GlycBase: http://www.cbs.dtu.dk/OGLYCBASE/cbsoglycbase.html
O-GLYCBASE is a revised database of O-glycosylated proteins (CBS, Denmark).
Databases, protein structure information
- PDB (Brookhaven): http://pdb.pdb.bnl.gov/
The Protein Data Bank (PDB) is an archive of experimentally determined
three-dimensional structures of biological macromolecules, serving a global
community of researchers, educators, and students.
- BMCD: http://ibm4.carb.nist.gov:4400/bmcd/bmcd.html
The biological macromolecule crystallization db and the NASA archive
for protein crystal growth data.
- BioMagResBank: http://www.bmrb.wisc.edu
Protein, peptide and nucleic acid NMR spectroscopy db.
- Klotho: http://ibc.wustl.edu/klotho
Biochemical compounds declarative db.
- DSSP: file://ftp.embl-heidelberg.de/pub/databases/dssp/
DSSP database of secondary structure assignments for proteins of known
structure. Contains information about secondary structure, solvent accessibility
and some contacts for all PDB proteins.
- HSSP: file://ftp.embl-heidelberg.de/pub/databases/hssp/
HSSP database of homology-derived secondary structure of proteins.
It contains the alignments of all known structures against the SWISS-PROT
sequence database.
- FSSP: file://ftp.embl-heidelberg.de/pub/databases/fssp/
FSSP database of fold classification based on structure-structure alignment
of proteins. Contains the structural all-against all alignments for PDB.
- PDBFINDER: http://www.sander.embl-heidelberg.de/pdbfinder/
The PDBFINDER database is a database that is constructed from the PDB,
DSSP and HSSP databases. Many of the fields contained in the PDBFINDER
database are difficult to access from the original databases. Some information
is retrieved from the original literature.
- Enzyme Structures db: http://www.biochem.ucl.ac.uk/bsm/enzymes/index.html
- SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
Classification of protein structures into structural families and display
of 3D structures.
- CATH: http://www.biochem.ucl.ac.uk/bsm/cath/CATHintro.htm
Classification of protein structures into structural families. CATH
is based on both structural and sequence relationships between proteins
at several levels of similarity.
- ProDom: http://protein.toulouse.inra.fr/
The ProDom protein domain database consists of an automatic compilation
of homologous domains detected in the SWISS-PROT database by the DOMAINER
algorithm (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci. 3:482-492).
It has been devised to assist with the analysis of the domain arrangement
of proteins.
- PUU: ftp://ftp.embl-heidelberg.de/pub/databases/protein_extras/puu/domains.puu
Putative protein structural domains.
- SWISS-3DIMAGE: http://expasy.hcuge.ch/sw3d/sw3d-top.html
High quality pictores of biological macromolecules.
- Protein Motions: http://hyper.stanford.edu/~mbg/ProtMotDB
A db of domain, loop and subunit motions
Services, Hotlists
Services, general
- ECACC: http://www.gdb.org/annex/ecacc/HTML/ecacc.html
European collection of animal cell cultures. The European Collection
of Animal Cell Cultures is a self financed part of the Centre for Applied
Microbiology and Research. The collection is supported from a combination
of sources, the UK Research Councils (MRC, AFRC, SERC, NERC), the Commission
of the European Communities, the World Health Organisation and revenue
from sales and the provision of technical services. The Collection accepts
deposits from a wide range of institutions including industry and aims
to provide as comprehensive a service as possible to its users. Further
information is provided on technical matters and the increased scope of
back-up services from ECACC.
- HyperCLDB: http://www.biotech.ist.unige.it/cldb/indexes.html
HyperCLDB, the hypertext on cell culture availability extracted from
the Cell Line Data Base of the Interlab Project.
- QUEST: http://siva.cshl.org/
The QUEST Protein Database Center is a facility for the construction
and analysis of Protein Databases. The data is generated by two-dimensional
(2D) electrophoresis of proteins on polyacrylamide gels. We are located
at the Cold Spring Harbor Laboratory (CSHL) on Long Island, New York, and
we have a computer facility where gels are analyzed and 2D gel protein
databases are built. Our goal is the construction of protein databases
for scientific investigations.
- Compute pI/Mw: http://expasy.hcuge.ch/ch2d/pi_tool.html
Compute pI/Mw is a tool which allows the computation of the theoretical
pI (isolectric point) and Mw (molecular weight) for a list of SWISS-PROT
entries or for a user entered sequence.
- K2d server: http://kal-el.ugr.es/k2d/k2d.html
Estimation of the percentages of protein secondary structure from UV
circular dichroism spectra using a neural network.
- Biotech validation: http://biotech.embl-heidelberg.de:8400/
Biotech validation suite for protein structures (quality checks of
protein structures). The server gives you a comprehensive check report
of your protein.
- Dali server: http://www.embl-heidelberg.de/dali/dali.html
The Dali server is a network service for comparing protein structures
in 3D. You submit the coordinates of a query protein structure and Dali
compares them against those in the Protein Data Bank. A multiple alignment
of structural neighbours is mailed back to you. In favourable cases, comparing
3D structures may reveal biologically interesting similarities that are
not detectable by comparing sequences.
Services, alignments and database searches
Central sites
- EBI (England): http://www.ebi.ac.uk/searches/searches.html
Sequence similarity searches (FASTA, BLITZ, PROSITE, BLAST, MAXHOM-PredictProtein).
- BCM (USA): http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html
General protein sequence/pattern searches. Programs include fast methods
(BLAST, FASTA, PROSITE) and full dynamic programming methods (FASTA, BLAST,
BLITZ, MPSEARCH).
- BioSCAN: http://genome.cs.unc.edu/online.html
The BioSCAN Server allows searching, retrieving and comparing of protein
and DNA sequences.
- NCSA Biology Workbench: http://biology.ncsa.uiuc.edu/BW/BW.cgi
The NCSA Biology Workbench provides a point and click interface for
rapid access to biological databases and analysis tools.
Fast database searches
- BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
BLAST performs fast database searching combined with rigorous statistics
for judging the significance of matches. Five BLAST programs search all
combinations of query and database sequences.
Full dynamic programming
Refinement by profile-based analysis
Refinement by hidden Markov models
Analysing and displaying alignments
- ToPLign: http://cartan.gmd.de/ToPLign.html
ToPLign implements standard pairwise and multiple alignment methods
with flexible parameter handling. The analysis of alignments is supported
by offering different visualisations of alignments. Furthermore, the stability
of the resulting alignments can be explored.
- BOX: http://ulrec3.unil.ch/software/BOX_form.html
Pretty Printing and Shading of Multiple-Alignment files.
Finding motifs
Analysing composition bias
Other services for sequence analysis
- Sequence Alerting System: http://swan.embl-heidelberg.de:8080/Alerting/
The sequence alerting system in its present form will search each day
in several databases for news on (homologues of) "your" sequence
and will inform you by email if it has detected a new relative.
- PSORT: http://psort.nibb.ac.jp/
Prediction of protein sorting signals and localisation sites in amino
acid sequences.
Services, analysing nucleotide sequences
- GRAIL: http://avalon.epm.ornl.gov/
GRAIL (Gene Recognition and Assembly Internet Link) is DNA Sequence
analysis tool. The GenQuest sequence comparison server is designed
for rapid and sensitive comparison of DNA and Protein sequence to existing
DNA and Protein sequence databases. Full database entries of any sequence
found in the course of a search are retrieved.
- GenQuest: http://www.gdb.org/Dan/gq/gq.form.htm
Running BLAST, FASTA or a full dynamic programming alignment of nucleotide
sequences against Nucleotide and protein databases.
- Splice site predictions : http://www.cbs.dtu.dk/bsnn.html
The Center for Biological Sequence Analysis (CBS, Copenhagen, Denmark)
offers a service for predicting intron splice sites in human and Arabidopsis
thaliana DNA.
- Gene recognition : http://www-hto.usc.edu/software/procrustes/
Gene recognition algorithm PROCRUSTES (UCS, USA) is based on the spliced
alignment algorithm which explores all possible exon assemblies and finds
the multi-exon structure with the best fit to a related protein via spliced
alignments.
Services, protein structure prediction
Collection of tools
Secondary structure prediction
Solvent accessibility prediction
- PHDacc: http://www.embl-heidelberg.de/predictprotein/
Multiple alignment-based neural network system.
Accuracy: > 75% (+/-10%, one standard deviation), higher for more reliably
predicted residues. Evaluated by cross-validation on 720 unique proteins;
comparisons to other methods based on identical sets.
Transmembrane helix and signal peptide prediction
- PHDhtm: http://www.embl-heidelberg.de/predictprotein/
Multiple alignment-based neural network system predicting the locations
of transmembrane helices.
Accuracy: > 95% (+/-10%, one standard deviation), higher for more reliably
predicted residues. Evaluated by cross-validation on 132 proteins; comparisons
to other methods based on identical sets.
- TMAP: http://www.embl-heidelberg.de/tmap/tmap_sin.html
Single sequence-based statistical prediction of the locations of transmembrane
helices.
Accuracy: > 95%. Evaluated on 28 proteins WITHOUT cross-validation.
- PHDtopology: http://www.embl-heidelberg.de/predictprotein/
Refinement of PHDhtm by dynamic programming and prediction of topology
(orientation of N-term with respect to membrane).
Accuracy: for > 85% of all proteins all helices and topology are predicted
correctly. Evaluated by cross-validation on 132 proteins; comparisons to
other methods based on identical sets.
- TMpred: http://ulrec3.unil.ch/software/TMPRED_form.html
Single sequence-based prediction of location and topology for helical
transmembrane proteins using statistics and similarity metrices.
- DAS: http://www.biokemi.su.se/~server/DAS/
Single sequence-based prediction of location for helical transmembrane
proteins.
- TopPred2: http://www.biokemi.su.se/~server/toppred2/
Single sequence-based prediction of topology for helical transmembrane
proteins.
- Signalp : http://www.cbs.dtu.dk/services/SignalP/
Neural network prediction of presence and location of signal peptide
cleavage sites in amino acid sequences from different organisms: Gram-positive
and Gram-negative prokaryotes, and eukaryotes.
Prediction of coiled-coils
Prediction of O-glycosylation sites
Homology modelling
Threading
- PHDthreader: http://www.embl-heidelberg.de/predictprotein/
Prediction-based threading detecting the fold type and aligning a protein
of unknown structure and a protein of known structure for low levels of
sequence identity ( < 25%).
Accuracy: < 30%, i.e., less than 30% of the predicted first hits are
true remote homologues. Evaluated by cross-validation on 89 unique protein
structures.
- T3P2: http://www.mbi.ucla.edu/people/frsvr/frsvr.html
Prediction-based threading detecting the fold type and aligning a protein
of unknown structure and a protein of known structure for low levels of
sequence identity ( < 25%).
- PSCANN: http://www.biokemi.su.se/~arne/pscan/
Threading method combining sequence and structure profiles. Performance
accuracy: more likely to recognise similar folds than simple sequence alignment.
Software