This article may require cleanup to meet Wikipedia's quality standards. The specific problem is: Bad casing on headings, reads like a manual. Please help improve this article if you can. (November 2023) (Learn how and when to remove this message)
MAFFT
Developer(s)Kazutaka Katoh
Stable release
7.475 / 23 November 2020; 3 years ago (2020-11-23)
Written inC
Operating systemUNIX, Linux, Mac, MS-Windows
TypeBioinformatics tool
LicenceBSD[1]
Websitemafft.cbrc.jp/alignment/software/

In bioinformatics, MAFFT (for multiple alignment using fast Fourier transform) is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform.[2] Subsequent versions of MAFFT have added other algorithms and modes of operation,[3] including options for faster alignment of large numbers of sequences,[4] higher accuracy alignments,[5] alignment of non-coding RNA sequences,[6] and the addition of new sequences to existing alignments.[7]

History

There have been many variations of the MAFFT software, some of which are listed below:

A timeline outlining the different versions of MAFFT since 2002.Provides brief descriptions for each notable generation of the software.

Algorithm

The MAFFT algorithm works following these 5 steps Pairwise Alignment, Distance Calculation, Guide Tree Construction, Progressive Alignment, Iterative Refinement.[8]

Input/Output

Web Form  

Input

Steps of how to use MAFFT with other programs to view a MSA

This program can take in multiple sequences as input, which can be entered in two ways:

Sequence Input Window  
Here is an example of a FASTA format, to see more available formats click on the following link: https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Multiple+Sequence+Alignment+Tool+Input+Examples

The user can directly enter three or more sequences in the input window in any of the following formats: GCG, FASTA, EMBL (nucleotide only), GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot (protein only). It is important to note that partially formatted sequences are not accepted, and adding a return to the end of the sequence may help certain applications understand the input. It is also advised to avoid using data from word processors as hidden/control characters may be present.[11]

Sequence File Upload  

The user can upload a file containing three or more valid sequences in any format mentioned above. Word processor files may yield unpredictable results due to the presence of hidden/control characters, so it is best to save files with the Unix format option to avoid hidden Windows characters. Once the file is uploaded, it can be used as input for multiple sequence alignment.[11]

Text files saved on DOS/Windows format have different line endings than those saved on Unix/Linux. DOS/Windows uses a combination of carriage return and line feed characters ("\r\n") to indicate the end of a line, while Unix/Linux systems use only a line feed character ("\n").[12]

When transferring files between Windows and Unix-based systems, it's important to be aware of these differences to ensure that the line endings are correctly translated. Otherwise, the hidden carriage return characters in the Windows-formatted files may cause issues when viewed or edited on Unix-based systems, and vice versa.[12]

Output

The user will have the option to request the Multiple Sequence Alignment (MSA) to be generated in one of the two available formats:

Example of ClustalW output
Output Format Description Abbreviation
Pearson/FASTA Pearson or FASTA sequence format fasta
ClustalW ClustalW alignment format without base/residue numbering clustalw

Default value is: Pearson/FASTA [fasta]

Understanding ClustalW output:
Symbol Definition Meaning
* asterisk Conserved sequence (identical)
: colon Conservative mutation
. period Semi-conservative mutation
( ) blank Non-conservative mutation
- dash Gap

Settings

There are many settings that affect how the MAFFT algorithm works. Adjusting the settings to your needs is the best way to get accurate and meaningful results. The most important settings to understand are: the Scoring Matrix, Gap Open Penalty, and Gap Extension Penalty.

Accuracy and Results

MAFFT is widely considered to be one of the most accurate and versatile tools for multiple sequence alignment in bioinformatics. In fact, studies have shown that MAFFT performs exceptionally well when compared to other popular algorithms such as ClustalW and T-Coffee, particularly for larger datasets and sequences with high degrees of divergence.[16] For example, in a study comparing the performance of various alignment algorithms on increasing sequence lengths, MAFFT's FFT-NS-2 algorithm was found to be the fastest program for all tested sequence sizes. This is due to its use of fast Fourier transform (FFT) algorithms, which enable rapid and accurate alignment of even highly divergent sequences. Because of the use of fast Fourier transform(FFT) the algorithm runs in either O(n^2) or O(n) depending on the given data set. MAFFT takes less CPU runtime than other algorithms that have the same or similar accuracies especially T-Coffee, ClustalW, and Needleman-Wunsch.[2]

Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences,[9] higher accuracy alignments,[17] alignment of non-coding RNA sequences,[18] and the addition of new sequences to existing alignments.[19]

MAFFT stands out among other popular algorithms such as ClustalW and T-Coffee due to its high accuracy, versatility, and range of features. It offers various alignment methods and strategies, including iterative refinement and consistency-based approaches, that further enhance the accuracy and robustness of the alignments. As a result, MAFFT is widely recognized as a powerful tool for multiple sequence alignment and is highly appreciated by the scientific community.[20]

See also

References

  1. ^ The base MAFFT software is distributed under the BSD license, while versions for Microsoft Windows are licensed under the GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
  2. ^ a b c d Katoh, Kazutaka; Misawa, Kazuharu; Kuma, Kei-ichi; Miyata, Takashi (2002). "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform". Nucleic Acids Research. 30 (14): 3059–66. doi:10.1093/nar/gkf436. PMC 135756. PMID 12136088.
  3. ^ a b c d "MAFFT ver.7 - a multiple sequence alignment program". mafft.cbrc.jp. Retrieved 28 April 2021.
  4. ^ Katoh, K; Toh, H (2006). "PartTree: An algorithm to build an approximate tree from a large number of unaligned sequences". Bioinformatics. 23 (3): 372–4. doi:10.1093/bioinformatics/btl592. PMID 17118958.
  5. ^ Katoh, K; Kuma, K; Miyata, T; Toh, H (2005). "Improvement in the accuracy of multiple sequence alignment program MAFFT". Genome Informatics. International Conference on Genome Informatics. 16 (1): 22–33. PMID 16362903.
  6. ^ Katoh, Kazutaka; Toh, Hiroyuki (2008). "Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework". BMC Bioinformatics. 9: 212. doi:10.1186/1471-2105-9-212. PMC 2387179. PMID 18439255.
  7. ^ Katoh, Kazutaka; Frith, Martin C (2012). "Adding unaligned sequences into an existing alignment using MAFFT and LAST". Bioinformatics. 28 (23): 3144–6. doi:10.1093/bioinformatics/bts578. PMC 3516148. PMID 23023983.
  8. ^ The base MAFFT software is distributed under the BSD license, while versions for Microsoft Windows are licensed under the GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
  9. ^ a b c d e f Katoh, K.; Standley, D. M. (April 2013). "MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability". Molecular Biology and Evolution. 30 (4): 772–780. doi:10.1093/molbev/mst010. PMC 3603318. PMID 23329690.
  10. ^ a b c Katoh, Kazutaka; Toh, Hiroyuki (July 2008). "Recent developments in the MAFFT multiple sequence alignment program". Briefings in Bioinformatics. 9 (4): 286–298. doi:10.1093/bib/bbn013. PMID 18372315.
  11. ^ a b "MAFFT Help and Documentation - Job Dispatcher Sequence Analysis Tools - EMBL-EBI". www.ebi.ac.uk. Retrieved 2023-04-24.
  12. ^ a b "Windows vs. Unix Line Endings". www.cs.toronto.edu. Retrieved 2023-04-27.
  13. ^ Pearson, William R. (October 2013). "Selecting the Right Similarity‐Scoring Matrix". Current Protocols in Bioinformatics. 43 (1): 3.5.1–3.5.9. doi:10.1002/0471250953.bi0305s43. PMC 3848038. PMID 24509512.
  14. ^ "ROSALIND | Glossary | Gap penalty".
  15. ^ Carroll, Hyrum; Clement, Mark; Ridge, Perry; Snell, Quinn (October 2006). "Effects of Gap Open and Gap Extension Penalties". Faculty Publications.
  16. ^ Edgar, Robert; Batzoglou, Serafim (June 2006). "Multiple sequence alignment". Current Opinion in Structural Biology. 16 (3): 368–373. doi:10.1016/j.sbi.2006.04.004. PMID 16679011.
  17. ^ Katoh, Kazutaka (2010-04-28). "Parallelization of the MAFFT multiple sequence alignment program". Bioinformatics. 26 (15): 1899–1900. doi:10.1093/bioinformatics/btq224. PMC 2905546. PMID 20427515.
  18. ^ Kazunori, Yamada (4 July 2016). "Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees". Bioinformatics. 32 (21): 3246–3251. doi:10.1093/bioinformatics/btw412. PMC 5079479. PMID 27378296.
  19. ^ Kazutaka, Katoh (27 September 2012). "Adding unaligned sequences into an existing alignment using MAFFT and LAST". Bioinformatics. 28 (23): 3144–3146. doi:10.1093/bioinformatics/bts578. PMC 3516148. PMID 23023983.
  20. ^ Edgar, R. C. (8 March 2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research. 32 (5): 1792–1797. doi:10.1093/nar/gkh340. PMC 390337. PMID 15034147.