DIAMOND database: Best respresentatives / All sequences

The search against a DIAMOND database is used to assign a gene to the correct homology group prior to phylogenetic placement. The default database (best representatives of homology groups) contains sequences selected using k-means clustering that best represent the sequence diversity of each group. In most cases this works well and the correct homology group for a query gene is easily identified. The alternative option is to use the database containing all sequences from all homology groups, which may work better in some cases.

Initial DIAMOND sensitivity: Default / Ultra-sensitive

By default SHOOT performs a DIAMOND normal sensitivity search. If this finds no hits it is followed by a slower "ultra-sensitive" search against the database of all sequences. The alternative option is to perform the slower ultra-sensitive search immediately.

MAFFT Options

The alignment of the query sequence against the MSA of all the genes in its homology group is one of the most computationally expensive steps in a SHOOT query. For this reason MAFFT is used with options giving high performance: "--retree 1 --maxiterate 0 --nofft". You can instead choose for MAFFT to run using its default options, which is slower but may produce more accurate results.

Large Trees

The largest gene trees in the databases (>2500 genes) have been split into subtrees since phylogentic placement into these large trees can be very slow. At most this affects the 40 largest trees in any database. For these largest trees DIAMOND is used to assign the query sequence to a sub-tree and phylogentic methods (MSA & tree inference) are used to place the gene in its correct position wihtin the sub-tree. The sub-tree is then grafted back into the original super-tree, which is returned to the user, thus giving a view of the gene in the complete gene family at lower computational cost. There is, however, a risk of an incorrect placement using this method because a DIAMOND search is not as accurate. In the worst case the best hit could potentially be in a different sub-tree within this large homologous family of genes. In this case, the subsequent phylogentic analysis will be incorrect because an attempt has been made to place the gene in the wrong sub-tree. The option 'Phylogenetics on full tree' uses DIAMOND to assign the sequence only to the level of the full tree, and phylogenetic methods are used to determine its position within the tree. While this method it less risky, it could take 3-30 minutes or more.