Getting started
I want a quick example
Warning
This section gives a copy-paste for experienced users who understand the meaning of all parameters. If you are not one of them, you may want to read the docs instead.
Note
We assume that you want to use RAxML-ng as the external ancestral reconstruction tool. If you want to use PhyML instead, change the command accordingly.
$ python ipk.py build -w workdir \
-r alignment.fasta \
-t tree.newick \
-m GTR \
-a 0.42 \
-k 10
Temporarily disabled in this version of IPK.
I want to read the docs
First, you must ensure that your data satisfies the assumptions that IPK makes. Then, you need to ensure that specific default parameters fit your case.
Here is a checklist for any phylo-k-mer-based analysis. This section will talk about all of these points in detail below.
-r: Reference sequences are short and aligned in a .fasta file-t: Reference tree is inferred with ML, rooted, and in accordance with the alignment-mor--ar-parameters: Evolutionary model set matches the one used to infer the tree-kand-o: The values fit the data type and size(optional)
-u: desired k-mer filtering rate is set.
Danger
Failing to meet any of these requirements (but the last one) will lead to meaningless results or incorrect operation.
Reference multiple alignment
Phylo-k-mers are applied for phylogenies inferred from either short genetic markers common in metabarcoding (such as 16S and others) or short viral genomes. Therefore, we do not know how phylo-k-mers in general and IPK in particular will work with data representing much longer sequences, e.g., hundreds of kilobases.
-r or --refalign define the path to the reference sequences. Your reference sequences should be aligned
and stored in fasta format.
Note
Depending on the gap percentage of your data, you may want to change the default value
of --reduction-ratio (see command line reference for detail).
Reference phylogenetic tree
The reference phylogenetic tree provided with -t or --reftree should satisfy
the following criteria:
1. The tips of the tree are in one-to-one correspondance with the reference sequences defined
by -r.
2. It was inferred with a maximum likelihood method. You can use PhyML, RAxML-ng, or IQ-TREE for these purposes.
Tip
If the tree is too large for ML inference, you can infer it with any other method. Then, keep its topology but reoptimize the branch lengths with maximum likelihood.
The tree should be rooted.
Ancestral reconstruction
Ancestral reconstruction is a required step of phylo-k-mer computation.
To complete it, IPK runs either PhyML or RAxML-ng.
By default, it searches for raxml-ng in your PATH. Should it be absent
(or you want to use a specific version of the software),
you must provide its path via -b or --ar.
Besides that, you must provide the evolutionary model used to infer the tree, as long as necessary parameters for that model. There are two ways of doing it.
Passing parameters via command line
Pass the evolutionary model via -m or --model.
We follow the RAxML-ng’s naming notation, i.e., it must be one of the
following ones:
Data type |
Model name |
DNA |
JC, K80, F81, HKY, TN93ef, TN93, K81, K81uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1, TIM1uf, TIM2, TIM2uf, TIM3, TIM3uf, TVMef, TVM, SYM, GTR |
Proteins |
Blosum62, cpREV, Dayhoff, DCMut, DEN, FLU, HIVb, HIVw, JTT, JTT-DCMut, LG, mtART, mtMAM, mtREV, mtZOA, PMB, rtREV, stmtREV, VT, WAG, LG4M, LG4X, PROTGTR |
Command line supports among-site rate heterogeneity parameters for the GAMMA model:
-a (--alpha) and -c (--categories). Those should be
set accordingly to their values. In case another model was used, see the next section
to pass the parameters via config file.
Passing parameters via config file
Alternatively, you can create a .json-formatted config file to pass arbitrary arguments to the ancestral reconstruction tool. For example, let us say the tree was inferred with RAxML-ng under the GTR model as follows:
$ raxml-ng --all --msa msa.fasta --model GTR+G4+FC --tree pars{20} --bs-trees 200
resulting in the best model of
$ cat msa.fasta.raxml.bestModel
``GTR{1.229596/5.824854/2.865999/2.632363/9.525945/1.000000}+FC+G4m{0.492429}, noname = 1-10262``
Then, the following config file should be created and passed via --ar-config to
IPK:
{
"arguments": {
"data-type": "DNA",
"model": "GTR{1.229596/5.824854/2.865999/2.632363/9.525945/1.000000}+FC+G4m{0.492429}",
"opt-branches": "on",
"opt-model": "on",
"blopt": "nr_safe"
}
}
Note
The evolutionary model should be reoptimized after the extention of the tree done by IPK. This
is why we set opt-model: on in this example.
K-mer size and score threshold
K-mer size and score threshold parameter (-k and --omega)
have tremendous performance impact on IPK and tools using
resulting phylo-k-mers, both in running time and memory. It is important to
understand what those parameter mean.
K-mer size
K-mer size determines how long considered k-mers are, or, in other words,
how many hypothetical strings we consider. The number of considered k-mers, as well as
running time of IPK, grows exponentially with k. For these reasons,
working values of k should be small (see discussion below).
Score threshold
Score threshold determines how low probability of a considered k-mer can be, or how picky we are considering k-mers. All k-mers that get a score below the specific threshold value are excluded from consideration. The threshold is computed according to the following formula:
where \(\omega\) is omega, and \(\sigma\) is the alphabet size (4 for DNA).
Thus, increasing omega leads to excluding low-scored k-mers from consideration.
Note
Note that the number of k-mers considered decreases non-linearly with increasing omega.
Recommendations
You may need to experiment with tweaking these parameters for your data. The instruction below may be helpful.
-k:1. For short genetic markers (<2Kbp), leave it to be default 10. If the tree is large (e.g., >10K taxa), the analysis may be too slow. Then, the value of 8 can be an acceptable trade-off between accuracy and speed.
2. For long viral genomes (>10Kb), use the value of 12. If it takes too long, consider using higher values of
--omega.For any kind of data, do not use values lower than 8.
-o(--omega):For large trees, use the default value of 1.5.
For small trees (<1K taxa), consider using values 1.25 and 1.0 for higher accuracy.
Temporarily disabled in this version of IPK.