Trimming algorithms
Manual methods
Custom columns
This algorithm eliminates a specified set of columns defined by the user. The set of columns to be removed should be provided as individual column numbers separated by commas, and/or as blocks of consecutive columns indicated by the first and last column numbers separated by a hyphen. In the following example:
-selectcols { n,l,m-k }
where n and l are interpreted single column numbers, while m-k is a range of columns (from column m to column k, both included) to be deleted. Note that column numbering starts from 0. For instance, the command:
-selectcols { 2,7,20-25,80-100 }
will remove columns 2 and 7, along with two blocks of columns ranging from column 20 to 25 and 80 to 100, respectively.
Threshold-based trimming
The user can choose to remove all columns that do not meet a specified threshold or a combination of thresholds. The gap threshold (-gt) and similarity threshold (-st) represent the minimum values of the respective scores explained above and can be used individually or in combination. Like the scores they refer to, both thresholds range from 0 to 1.
trimAl provides two shortcuts to commonly used thresholds: -nogaps (equivalent to -gt 1), which deletes all columns with at least one gap, and -noallgaps, which removes columns composed solely of gaps.
In addition, the user can set a conservation threshold (-cons), indicating the minimum percentage of columns from the input alignment that should be retained in the trimmed alignment. This threshold is defined between 0 and 100 and takes precedence over all other thresholds. If any other threshold would result in a trimmed alignment with fewer columns than specified by the conservation threshold, trimAl adds more columns to meet the conservation threshold. These columns are added based on their scores, with a preference for columns with higher scores. In the case of equal scores, columns adjacent to already selected column-blocks and closer to the center of the alignment are added first, prioritizing the extension of longer and central blocks.
When provided with a set of multiple sequence alignments, trimAl calculates a consistency score for each alignment in the set. Subsequently, the alignment with the highest score is selected. The chosen alignment can undergo various trimming methods, one of which involves removing columns that exhibit lower consistency across the other alignments. To achieve this, the user can utilize the -ct parameter to define the minimum values for the consistency score, within the range of 0 to 1. Any columns not meeting this specified value will be removed. Alternatively, the conservation score, as explained previously, can also be employed here. Moreover, it can be used in conjunction with gap and/or similarity methods.
Overlap trimming
trimAl can also remove poorly aligned or incomplete sequences considering the rest of sequences in the MSA. For that purpose, the user has to define two thresholds: First, the residue overlap threshold (-resoverlap) corresponds to the minimum residue overlap score for each residue. Second, the sequence overlap threshold (-seqoverlap) sets up the minimum percentage of the residues for each sequence that should pass the residue overlap threshold in order to maintain the sequence in the new alignment. Sequences that do not pass the sequence overlap threshold will be removed from the alignment. Finally, all columns that only have gaps in the new alignment will also be removed from the final alignment.
trimAl can effectively eliminate poorly aligned or incomplete sequences while considering the entire multiple sequence alignment (MSA). To achieve this, users need to specify two thresholds:
Residue Overlap Threshold (-resoverlap): This threshold corresponds to the minimum residue overlap score required for each residue.
Sequence Overlap Threshold (-seqoverlap): This threshold establishes the minimum percentage of residues within each sequence that must surpass the residue overlap threshold to retain the sequence in the new alignment. Sequences failing to meet this criterion will be excluded from the final alignment.
Additionally, columns exclusively filled with gaps in the new alignment will be systematically removed.
Automated methods
Gappyout method
This method relies on the gap distribution within the multiple sequence alignment (MSA). This method relies on the gap distribution within the Multiple Sequence Alignment (MSA). Initially, the method calculates gap scores for all columns and arranges them based on this score, generating a plot depicting potential gap score thresholds versus the percentage of the alignment below each threshold (see Fig. 2). In the subsequent step, for every set of three consecutive points on this plot, trimAl computes the slopes between the first and third point, represented by blue lines. Following a comparison of all slopes, trimAl identifies the point with the maximal variation between consecutive slopes, indicated by a vertical red line in Fig. 2.
After determining a gap score cut-off point, trimAl removes all columns that do not meet this specified value (see Fig. 3). In practical terms, this method effectively identifies the bimodal distribution of gap scores (columns rich in gaps and columns with fewer gaps) within an alignment. Subsequently, it eliminates the mode associated with a higher concentration of gaps. Our benchmarks indicate that this method efficiently eliminates a significant portion of poorly aligned regions.
Strict method
This method combines gappyout trimming with subsequent trimming based on an automatically selected similarity threshold. To determine the similarity threshold, trimAl utilizes the residue similarity scores distribution from the multiple sequence alignment (MSA). This distribution is transformed to a logarithmic scale (refer to Fig. 4), and the residue similarity cutoff is selected as explained below.
From this similarity distribution, trimAl selects the values at percentiles 20 and 80 of the alignment length (vertical blue lines in Fig. 4). The residue similarity threshold (vertical red line in Fig. 4) is computed as follows:
\[ \begin{align}\begin{aligned}P_{20} = \log(\text{Simvalue}_{20})\\P_{80} = \log(\text{Simvalue}_{80})\\SimThreshold = \left(P_{80} + \frac{{P_{20} - P_{80}}}{10} \right)^{10}\end{aligned}\end{align} \]
This process is equivalent to establishing upper and lower boundaries for the threshold at percentiles 20 and 80, respectively, of the similarity score distribution in that alignment. The similarity threshold is calculated using the difference between these two boundaries, being at 1/10 to the lower boundary (similarity at P80).
This method of setting the similarity threshold has demonstrated optimal performance in our benchmarks. The lower and upper boundaries ensure that the 20% most conserved columns in the alignment are preserved, while the 20% most dissimilar columns are discarded.
The specific similarity threshold will lie between these boundaries depending on the distribution of similarity scores in the alignments. Alignments with steep similarity score curves and significant differences between the most similar and dissimilar columns will set more columns below the threshold. Conversely, alignments with more columns having scores similar to the most-conserved fraction will apply more relaxed cutoffs. However, the removal of a specific column will depend on its context.
Once trimAl has calculated the residue similarity cutoff, the following steps are taken:
The gappyout method is applied to identify columns that would be deleted with that method.
Residues below the similarity cutoff are marked.
After applying these filters, trimAl recovers (unmarks) columns that have not passed the gap and/or similarity thresholds but where three of the four most immediate neighboring columns (two on each side) have passed them.
Finally, in a last step, trimAl removes all columns that do not fall within a block of at least five consecutive columns unmarked for deletion.
Strictplus method
This approach is very similar to the strict method. However, the final step of the algorithm is slightly different. In this case, the block size is defined as 1% of the alignment size with a minimum value of 3 and a maximum size of 12.
This method is optimized for neighbor joining phylogenetic tree reconstruction.
Automated1 method
Based on our own benchmarks with simulated alignments (see benchmarking) we have designed a heuristic approach, denoted as automated1, to determine the optimal automatic method for trimming a given alignment. This heuristic is specifically fine-tuned for trimming alignments intended for maximum likelihood phylogenetic analyses.
Making use of a decision tree (Fig. 3) , this heuristic dynamically selects between the gappyout and strict methods. In making this choice, trimAl considers the average identity score among all the sequences in the alignment, the average identity score for each most similar pair of sequences in the alignment, as well as the number of sequences in the alignment. We have observed that all these variables were important in deciding which method would provide the highest improvement on a given alignment.