🔍 FASTA (Protein-Nucleotide): Fast Protein to DNA/RNA Search

FASTA (Protein-Nucleotide) is a bioinformatics tool that compares a protein query sequence against a nucleotide (DNA or RNA) sequence database. It translates the nucleotide database in all six possible reading frames and then searches your protein query against these translated sequences.

❓ What is FASTA (Protein-Nucleotide)?

FASTA (when used for protein-nucleotide searches, often denoted as TFASTX or TFASTY) takes an amino acid (protein) query sequence and searches it against a chosen nucleotide sequence database. Before the search, every sequence in the database is translated in all six possible reading frames. This allows you to find potential coding regions in DNA/RNA databases that are homologous to your protein query.

  • Protein Query vs. Translated Nucleotide Database: Compares protein to a six-frame translated nucleotide database.
  • Gene Finding: Excellent for identifying potential genes in raw genomic or EST data.
  • Versatile Search: Can find protein homologs even if the nucleotide sequence is unannotated.

🎯 Why Use FASTA (Protein-Nucleotide)? For Gene Discovery with Protein Evidence

FASTA (Protein-Nucleotide) is indispensable for:

  • 🔍 Gene Discovery: Identifying novel genes or coding regions in uncharacterized nucleotide sequences (e.g., newly sequenced genomes, ESTs, transcriptomes) using a known protein as a guide.
  • 🧬 Pseudogene Identification: Helping to distinguish functional genes from non-functional pseudogenes by comparing protein homology.
  • 📊 Functional Annotation: Inferring the function of a genomic region by finding homologous proteins.
  • 🎯 Cross-Species Homology: Finding protein homologs from a protein sequence across nucleotide databases of different organisms.
  • 📈 Sequence Validation: Confirming the coding potential of a nucleotide sequence based on protein similarity.

🧑‍💻 How to Use FASTA (Protein-Nucleotide) on Job Dispatcher: A Step-by-Step Guide

Follow these simple steps to perform a protein to translated nucleotide FASTA search:

1️⃣ Navigate to the Tool

  1. From the main menu, go to All Tools (or search for "FASTA (Protein vs. Nucleotide Search)").
  2. Click the prominent Use Tool button located next to "FASTA (Protein vs. Nucleotide Search)."

2️⃣ Input Your Protein Sequence

  • Locate the input box (large text area) or the "upload a Sequence File" option.

  • Paste your protein sequence(s) in FASTA format or upload a FASTA file.

    >my_protein_query
    MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
    
  • Important: You can provide a sequence either by typing into the text area OR by uploading a file, but not both simultaneously. Please clear one input to proceed.

3️⃣ Configure Parameters

  • 📝 Title: Provide a descriptive title for your job (e.g., "My FASTA-NP Search").

  • 💡 Program: Select the specific FASTA program to run.

    • TFASTX - Default
    • TFASTY (More sensitive, accounts for frameshifts)
  • 🗄️ Databases: Select one or more nucleotide databases to search against.

    • Default: em_est_env, em_gss_env, em_htc_env, em_htg_env, em_pat_env, em_std_env, em_sts_env, em_tsa_env
    • (Many other options available in the Nucleotide Databases Tree on the form)
  • 📊 Matrix (matrix): Select the scoring matrix for protein alignments.

    • BL62 (BLOSUM62) - Default
    • BL50 (BLOSUM50)
    • BP62 (BLASTP62)
    • BL80 (BLOSUM80)
    • P250 (PAM250), P120 (PAM120)
    • M40, M20, M10 (MDM matrices)
    • VT160, VT120, VT80, VT40, VT20, VT10 (VTML matrices)
  • ➖ Gap Open (gapopen): The penalty for opening a new gap.

    • Default: -10
    • Options: default (10), 0, -1, ..., -19
  • ➖ Gap Extend (gapext): The penalty for extending an existing gap.

    • Default: -2
    • Options: default (10), 0, -1, ..., -16
  • 📏 KTUP (ktup): The size of the word (k-tuple) used for initial seeding. Higher values are faster but less sensitive.

    • Default: 6
    • Options: 6, 5, 4, 3, 2, 1, N/A (-1)
  • 📈 EXPECTATION UPPER LIMIT (expupperlim): Maximum E-value for reported matches. Lower values are stricter.

    • Default: 10
    • Options: 1e-300, 1e-100, 1e-50, 1e-10, 1e-5, 0.001, 0.1, 1, 2, 5, 10, 20, 50
  • 📉 EXPECTATION LOWER LIMIT (explowlim): Minimum E-value for reported matches. Allows excluding very closely related hits.

    • Default: 0
    • Options: 0, 1e-300, 1e-100, 1e-50, 1e-10, 1e-5, 0.001, 0.1, 1, 2, 5, 10, 20, 50
  • ↔️ STRAND (strand): For nucleotide database sequences, specify the sequence strand to be translated and used for the search.

    • both - Default
    • top
    • bottom
  • 📊 HISTOGRAM (hist): Display a histogram of scores in the FASTA result.

    • false (no) - Default
    • true (yes)
  • 🧹 FILTER (filter): Filter regions of low sequence complexity.

    • none - Default
    • dust (DUST filter)
  • 📊 STATISTICAL ESTIMATES (stats): Method for calculating statistical significance.

    • 1 (Regress) - Default
    • 2 (MLE), 3 (Altshul-Gish), 4 (Regress/shuf.), 5 (MLE/shuf.)
  • 🔢 SCORES (scores): Maximum number of match score summaries to report.

    • Default: 50
    • Options: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 500, 750, 1000
  • ↔️ ALIGNMENTS (alignments): Maximum number of alignments to report.

    • Default: 50
    • Options: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 500, 750, 1000
  • 📏 SEQUENCE RANGE (seqrange): Specify a range within the query sequence to search.

    • Default: START-END (entire sequence)
  • 🗄️ DATABASE RANGE (dbrange): Specify a length range for database sequences to search against.

    • Default: START-END (all lengths)
  • 🔢 MULTI HSPS (hsps): Display all significant High-scoring Segment Pairs (HSPs) between query and library sequence.

    • no (false) - Default
    • yes (true)
  • 📝 ANNOTATION FEATURES (annotfeats): Turn on/off annotation features from UniProtKB.

    • no (false) - Default
    • yes (true)
  • ⚙️ SCORE REPORT FORMAT (scoreformat): Choose the format for the score report.

    • default - Default
    • -m 8 -- blast tabular, -m 8C -- BLAST tabular with comments, etc. (various tabular and ASN.1 formats)
  • 📚 TRANSLATION TABLE (transltable): Select the genetic code table for translating the database sequences.

    • Default: 1 (Standard SGC0)
    • Options: N/A (-1), 1 (Standard SGC0), 2 (Vertebrate Mitochondrial), 3 (Yeast Mitochondrial), ..., 23 (Thraustochytrium Mitochondrial)

4️⃣ Submit Your Job

  • Once your sequence is entered and parameters are set, click the Submit or Run button.
  • Your job will be dispatched to the EMBL-EBI Web Service. You will be automatically redirected to a Job Status page to monitor its progress.

5️⃣ Interpret Results

  • On the results page, you will find a summary of your FASTA (Protein-Nucleotide) search, including a list of significant hits, their scores, and alignments.
  • Pay attention to the E-value and the specific reading frame in the database sequence that best matches your protein query.
  • ⭐ Tip: TFASTY is more sensitive than TFASTX for finding distant protein homologs in nucleotide sequences, especially if frameshifts are expected.

💬 Need Help?

If you run into issues, please visit our Contact Us page for support. Happy FASTA searching!