Welcome to the BTE documentation!¶

BTE (Big Tree Explorer) is a Python extension for analysis and traversal of extremely large phylogenetic trees. It's based on the Mutation Annotated Tree (MAT) library which underlies UShER and matUtils, premier software for pandemic-scale phylogenetics.

This tool is generally intended as a replacement for ETE3, Biopython.Phylo, and similar Python phylogenetics packages for Mutation Annotated Trees (MATs). Using standard packages with MATs requires conversion to newick and the maintenance of mutation annotations as a separate data structure, generally causing inconvenience and slowing both development and runtime. BTE streamlines this process by exposing the heavily optimized MAT library underlying UShER and matUtils to Python, allowing for efficient and convenient use of MATs in a Python development environment!

UCSC maintains a repository, updated each day, containing the complete and latest publicly-available global SARS-CoV-2 phylogenetic tree in MAT protobuf format here.

http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/

To try out this tool, download the latest tree, build or install the extension, and jump straight to your Python analysis!

Installation¶

On Mac or Linux, you can install BTE by running the following command:

conda install -c bioconda bte

See the README if you encounter installation difficulties or need to build a local version of the extension.

Basic Logic¶

BTE contains two primary classes, the MATree and the MATNode.

The MATree class represents a single mutation-annotated phylogenetic tree. It can be created from a MAT protobuf (.pb) file, from a newick and a vcf together, or from an Auspice-format JSON like that used by Nextstrain. The MATree class has many useful functions for manipulation, subsetting, traversal, and summarization of the tree it represents.

The MATNode class represents a single node member of the Mutation Annotated Tree. MATNode class attributes include the following:

node.id: the node's unique identifier string

node.level: the level of the node in the tree (root is 0)

node.parent: the parent node of the node

node.children: a list of child nodes

node.mutations: a list of mutations associated with the node

node.annotations: a list of annotations associated with the node

Additionally, BTE uses a class representation of amino acid changes, generated by the bte.translate() method. This class includes a number of different attributes, including attributes dedicated to the original and alternative nucleotides, amino acids, and codons. A string representation of an amino acid change, such as "S:N501Y", can be generated with AAChange.aa_string().

The full API documentation follows.

BTE API Documentation¶

Index

class bte.AAChange(gene, aa, nuc, codons)¶

Class container for amino acid translation information. Generated by MATree.translate().

aa_string(self)¶: Return a string representation of the mutation in the form "gene:refindexalt" e.g. "S:D614G".

is_synonymous(self)¶: Return True if the mutation is synonymous.

class bte.MATNode(MATree tree: MATree = None, parent: Optional[str] = None, unicode identifier: str = u'', list mutations: list = [], list annotations: list = [], double branch_length: float = 0.0)¶

A wrapper around the MAT node class. Has an identifier, mutations, parent, and child attributes.

get_mutation_information(self) → list[dict[str, str]]¶

Print full attribute information for each mutation associated with this node, including chromosome, location, parent, reference, and alternative nucleotides.

Returns:: list[dist[str,str]]: List of dictionaries containing keyed attribute information.

is_leaf(self)¶: Returns true if the node is a leaf node.

most_recent_annotation(self) → list[str]¶

Find the most recent clade annotations for the node in the node's ancestry.

Returns:: A list of annotation strings.

set_branch_length(self, double blen: float)¶

Set the branch length for this node to the input float value.

args:: blen (float): Set the branch length to this value.

update_mutations(self, mutation_list: list[str], bool update_branch_length: bool = True)¶

Take a list of mutations as strings and replace any currently stored mutations on this branch with the new set. Mutation strings should be formatted as chro:reflocalt e.g. chr1:A234G. If chromosome is left off, assumes SARS-CoV-2 chromosome.

Args:: mutation_list (list[str]): List of mutations to store. update_branch_length (bool): Update the branch length attribute to the new count of mutations. Default is True.

class bte.MATree(pb_file: Optional[str] = None, bool uncondense: bool = True, nwk_file: Optional[str] = None, nwk_string: Optional[str] = None, vcf_file: Optional[str] = None, json_file: Optional[str] = None) → None¶

A wrapper around the MAT Tree class. Includes functions to save and load from parsimony .pb files or a newick. Includes numerous functions for tree traversal including breadth-first, depth-first, and traversal from leaf to roots. Also includes numerous functions for subtree selection by choosing leaves that match regex patterns, contain specific mutations, or are from a specific clade or lineage.

LCA(self, list node_ids: list) → str¶

Find the last common ancestor of the input node IDs.

Args:: node_ids list[str]: A set of node_ids in string format. Must have a length of at least 2.
Returns:: str: The node_id of the last common ancestor.

apply_mutations(self, mmap: dict[str, list[str]], bool update_branch_length: bool = True) → None¶

Pass a set of node:mutation mappings to place into the tree. Current mutations will be replaced.

Args:: mmap (dict[str,list[str]]): A dictionary of node:mutation list mappings (e.g. {"node_id":["chro:reflocalt","chro:reflocalt"]}, {"node_1":["chro1:A123G","chro3:T315G"]} update_branch_length (bool): Update the branch length to match the new count of mutations on each node. Defaults to True.

apply_node_annotations(self, annotations: dict[str, list[str]]) → None¶

Apply annotations to the tree. Replaces any annotations on nodes affected.

Args:: annotations (dict[str,list[str]]): A dictionary of annotations to apply to the tree, keyed on node_id with a list of annotation strings as the value.

breadth_first_expansion(self, unicode nid: str = u'', bool reverse: bool = False) → list[MATNode]¶

Perform a level order (breadth-first) expansion starting from the indicated node. Use reverse to traverse in reverse level order (all leaves, then all leaf parents, back to root) instead.

Args:

nid (str, optional): Node to begin the traversal at. Defaults to the root.

reverse (bool, optional): Perform the traversal in reverse. Defaults to False.

Returns:

list[MATNode]: List of MATNode wrappers representing the nodes in the traversal.

clear(self) → None¶: Call this function to explicitly deallocate all tree memory. Use when the tree object is no longer necessary and high memory use is becoming problematic. Automatically called on garbage collection.

compute_nucleotide_diversity(self) → float¶

Function which computes the nucleotide diversity of the tree. This is defined as the mean number of pairwise differences in nucleotides between any two leaves of the tree. Computes an unbiased estimator which multiplies the final mean by the total number of sequences divided by the total number of sequences minus one.

Raises:: Exception: Can't be computed on an empty tree.
Returns:: float: The estimated nucleotide diversity.

count_clades_inclusive(self, unicode subroot: str = u'') → dict[str, int]¶

Count the total number of leaves belonging to each clade on the subtree. Counts are inclusive (e.g. samples belonging to a clade descended from another clade will count for the ancestor clade as well) By default, counts across the whole tree.

Args:: subroot (str, optional): Count members of clades descended from this node. Defaults to the root.
Returns:: dict[str,int]: Dictionary containing clade counts.

count_haplotypes(self) → dict[tuple, int]¶

Count unique haplotypes from the tree.

Returns:: dict[tuple,int]: haplotype counts.

count_leaves(self, subroot: Optional[str] = None) → int¶

Return the number of leaves descended from the indicated node. By default, counts all leaves on the tree.

Args:: subroot (Optional[str], optional): Count leaves descended from the indicated node. Defaults to the root.
Returns:: int: The count of leaves.

count_mutation_types(self, subroot: Optional[str] = None) → dict[str, int]¶

Compute the counts of individual mutation types across the tree. If a subtree root is indicated, it only counts mutations descended from that node. By default, this counts across the entire tree.

Args:: subroot (Optional[str], optional): Count mutations descended from the indicated node. Defaults to the root.
Returns:: dict[str,int]: Dictionary containing mutation counts.

create_node(self, unicode identifier: str, unicode parent_id: str, mutations: list[str] = [], annotations: list[str] = [], double branch_length: float = 0.0)¶

Create a new node and place it in the tree without generating a wrapper. This does not return a MATNode object, so access to the created node will require a subsequent get_node call or using the MATNode constructor method to add the node to the tree.

Args:

identifer (str): The identifier of the new node.

parent_id (str): The identifier of the parent node.

mutations (list[str]): A list of mutations to add to the new node. Mutation strings should be formatted as chro:reflocalt e.g. chr1:A234G. If chromosome is left off, assumes SARS-CoV-2 chromosome.

annotations (list[str]): A list of annotations to add to the new node. Currently limited to 2 or less.

branch_length (float): The length of the branch between the new node and its parent. If not specified, the branch length is equal to the number of mutations.

depth_first_expansion(self, nid: Optional[str] = None, bool reverse: bool = False) → list[MATNode]¶

Perform a preorder (depth-first) expansion of the tree, starting from the indicated node. By default, traverses the whole tree. Set reverse to true to traverse in postorder (reverse depth-first) instead.

Args:

nid (Optional[str], optional): Node to begin the traversal at. Defaults to the root.

reverse (bool, optional): Traverse in reverse order. Defaults to False.

Returns:

list[MATNode]: List of nodes in depth-first order.

dump_node_annotations(self) → dict[str, list[str]]¶

Return a dictionary of internal node ids with corresponding annotation root labels. Formatted for compatibility with apply_node_annotations(). If a node is not included, it does not have any associated annotation roots.

Returns:: dict[str,list[str]]: A dictionary of nodes with the annotation roots they are associated with.

from_json(self, unicode jsonf: str) → None¶

Load a mat from a json compatible with the Auspice.us visualization web tool.

Args:: jsonf (str): Path to a json file.

from_newick(self, unicode nwk_file: str) → None¶

Load from a newick file only. The resulting tree will lack mutation information, preventing some functions from being applied.

Args:: nwk_file (str): Path to a text file containing a newick representation of the tree.

from_newick_and_vcf(self, unicode nwk: str, unicode vcf: str) → None¶

Load from a newick and a vcf. The vcf must contain sample entries (genotype columns) for every leaf in the newick.

Args:

nwk (str): Path to a text file containing the tree to load in Newick format.

vcf (str): Path to a text file containing leaf/sample genotype information in VCF format.

from_newick_string(self, unicode nwk: str) → None¶

Load from a Python string newick. The resulting tree will lack mutation information, preventing some functions from being applied.

Args:: nwk (str): A Python string containing a newick representation of the tree.

from_pb(self, unicode file: str, bool uncondense: bool = True) → None¶

Load from a protobuf into the initalized wrapper. Includes both tree and mutation information.

Args:

file (str): Path to a .pb or .pb.gz file.

uncondense (bool, optional): Uncondense the tree after loading (split identical samples into individual leaves). Defaults to True.

get_annotations(self) → dict[str, str]¶

Return a dictionary keyed on all annotations with values of the internal node they are defined by.

Returns:: dict[str,str]: A dictionary of annotations with the root node they are associated with.

get_clade(self, unicode clade_id: str) → MATree¶

Return a subtree representing the selected clade.

Args:: clade_id (str): The clade to retrieve.
Returns:: MATree: the subtree representing that clade.

get_clade_samples(self, clade_id) → vector[string]¶

Return samples from the selected clade.

Args:: clade_id (str): Clade to find.
Returns:: list[bytes]: List of sample IDs which are members of the indicated clade.

get_haplotype(self, unicode nid: str) → set[str]¶

Return the complete set of mutations (haplotype) the indicated node has with respect to the reference.

Args:: nid (str): The target node to get the haplotype for.
Returns:: set[str]: the haplotype of the node, represented as a set of mutations formatted in reflocalt (e.g. A123G) format.

get_leaves(self, unicode nid: str = u'') → list[MATNode]¶

Create a list of MATNode objects representing each leaf descended from the indicated node. By default, returns all leaves on the tree.

Args:: nid (str, optional): Node to get leaves descended from. Defaults to the root.
Returns:: list[MATNode]: List of MATNode wrappers representing all leaves.

get_leaves_ids(self, unicode nid: str = u'') → list[str]¶

Return a list of leaf name strings containing all leaves descended from the indicated node. By default, returns all leaves on the tree.

Args:: nid (str, optional): Node to get the descendents of. Defaults to the root.
Returns:: list[str]: List of leaf names.

get_mutation(self, unicode mutation: str) → MATree¶

Return a subtree containing samples with genotypes containing the indicated mutation.

Args:: mutation (str): string representation of the mutation in reflocalt format (e.g. "A123C").
Returns:: MATree: subtree containing samples with the mutation.

get_mutation_samples(self, mutation) → vector[string]¶

Return samples with genotypes containing the selected mutation.

Args:: mutation (str): string representation of the mutation in reflocalt format (e.g. "A123C").
Returns:: list[bytes]: samples with the mutation.

get_newick(self, subroot: Optional[str] = None, bool print_internal: bool = True, bool print_branch_len: bool = True, bool retain_original_branch_len: bool = True, bool uncondense_leaves: bool = True) → str¶

Extract a newick string from the tree.

Args:

subroot (Optional[str], optional): Return a newick representing the subtree descended from this node. Defaults to the root.

print_internal (bool, optional): Include internal node names in the newick output. Defaults to True.

print_branch_len (bool, optional): Include branch lengths in the newick output. Defaults to True.

retain_original_branch_len (bool, optional): Retain the original branch length attribute, if one was provided. Defaults to True.

uncondense_leaves (bool, optional): Uncondense nodes before returning the newick. Defaults to True.

Returns:

str: A newick string representation of the tree.

get_node(self, unicode name: str) → MATNode¶

Create a MATNode class object representing the indicated node.

Args:: name (str): ID of the node to fetch.
Returns:: MATNode: MATNode class object representing the indicated node.s

get_parsimony_score(self) → int¶

Compute the parsimony score of the complete tree.

Returns:: int: The parsimony score of the tree.

get_random(self, size: int, list current_samples: list = [], bool lca_limit: bool = False) → MATree¶

Select a random subtree of the selected size. Optionally, pass a list of samples to include. If the list of samples to include is larger than the target size, random samples will be removed from the list. Set lca_limit to True to limit random selection to below the common ancestor of the current selection. Selects as many as possible if not enough are available.

Args:

size (int): The size of the subtree to select.

current_samples (list, optional): List of samples to include in the set. Defaults to [].

lca_limit (bool, optional): Limit randomly selected samples to below the LCA of the input samples. Defaults to False.

Returns:

MATree: A subtree containing the selected samples.

get_regex(self, unicode regexstr: str) → MATree¶

Return a subtree representing all samples matching the regular expression.

Args:: regexstr (str): The regular expression to use to query the tree.
Returns:: MATree: Subtree containing samples matching the regex.

get_regex_samples(self, unicode regexstr: str) → vector[string]¶

Return a list of sample IDs on the tree which match the regular expression.

Args:: regexstr (str): The regex pattern to match.
Returns:: list[bytes]: List of sample IDs.

ladderize(self) → None¶: Sort the branches of the tree according to the size of each partition.

list_clades(self) → set[str]¶

Return a set of all valid clade annotations in the tree that can be used with get_clade and other functions.

Returns:: set[str]: Set of all valid clade annotations.

move_node(self, unicode to_move: str, unicode new_parent: str) → None¶

Move a node from its current parent to a new parent.

Args:

to_move (str): The identifier of the node to move.

new_parent (str): The identifier of the new parent of the node.

mutation_set(self, unicode nid: str) → set[str]¶

Return the complete set of mutations (haplotype) the indicated node has with respect to the reference. DEPRECATED in favor of get_haplotype, which is equivalent functionally.

Args:: nid (str): The target node to get the haplotype for.
Returns:: set[str]: the haplotype of the node, represented as a set of mutations formatted in reflocalt (e.g. A123G) format.

remove_node(self, unicode to_remove: str) → None¶

Remove a node from the tree. This is a destructive operation. WARNING: It can cause segmentation faults if children are left orphaned.

Args:: to_remove (str): The identifier of the node to remove.

reverse_strand(self, genome_size: int = 29903) → None¶

Inverts the tree representation of mutations such that all mutations are with respect to the reverse strand of the reference. All bases are complemented and indeces are reversed. The tree structure itself and parsimony scores are unaffected.

Args:: genome_size (int): The size of the genome. Assumes SARS-CoV-2 if not specified.

root¶

Retrieve the root of the tree.

Returns:: MATNode: MATNode wrapper representing the root node of the tree.

rsearch(self, unicode nid: str, bool include_self: bool = False, bool reverse: bool = False) → list[MATNode]¶

Return a list of MATNode objects representing the ancestors of the indicated node back to the root in order from the node to the root.

Args:

nid (str): ID of the node to get the ancestry of.

include_self (bool, optional): Include the indicated node on the path. Defaults to False.

reverse (bool, optional): Return the path in reverse order. Defaults to False.

Returns:

list[MATNode]: A list of ancestors of the indicated node.

save_pb(self, unicode file: str, bool condense: bool = True) → None¶

Save the tree to a protobuf file. If the filename ends in '.pb.gz', it will be gzipped automatically.

Args:

file (str): Name for the .pb/.pb.gz file.

condense (bool, optional): Condense the tree before saving. Defaults to True.

simple_parsimony(self, leaf_assignments: dict[str, str]) → dict[str, str]¶

This function is an implementation of the small parsimony problem (Fitch algorithm) for a single set of states. It takes as input a dictionary mapping leaf names to character states and returns a dictionary mapping both leaf and internal node names to inferred character states.

Args:: leaf_assignments (dict[str,str]): Dictionary mapping leaf names to character states.
Returns:: dict[str,str]: Dictionary mapping node names to inferred character states.

subtree(self, samples: list[Union[str, bytes]]) → MATree¶

Retrieve a subtree containing all samples in the input list.

Args:: samples (list[Union[str,bytes]]): List of sample names to include in the subtree. Can be bytes or str.
Returns:: MATree: the subtree containing all samples in the input list.

translate(self, unicode gtf_file: str, unicode fasta_file: str) → dict[str, list[AAChange]]¶

Translate amino acid changes across the tree and return the results as a dictionary of node IDs and class objects representing amino acid changes as returned from matUtils translate. The translation is representative of the tree at the time of this function being called only.

Args:: gtf_file (str): The path to the GTF file containing gene information. fasta_file (str): The path to the FASTA file containing the reference genome.

tree_entropy(self, categorical: dict[str, str], unicode from_node: str = u'') → dict[str, float]¶

Calculate the absolute and relative entropy of each split in the tree with respect to a categorical tip trait map. If a node is specified, the entropy map of the subtree rooted at that node is returned. If no node is specified, the entropy map of the entire tree is returned.

Args:: categorical (dict[str,str]): A dictionary of categorical trait values with the sample IDs as keys. from_node (str): The identifier of the node to calculate the entropy from. If not specified, the entropy map of the entire tree is returned.

write_json(self, unicode jsonf: str, samples: list[str] = [], unicode title: str = u'Tree', metafiles: list[str] = []) → None¶

Write a json compatible with the Auspice.us visualization web tool containing the indicated samples. Default behavior includes the whole tree. You can optionally pass a tsv or csv file or a list of tsv and csv files containing categorical metadata to decorate the json with (one sample per row).

Args:

jsonf (str): Name for the JSON output.

samples (list, optional): Samples to use. Defaults to all samples.

title (str, optional): Title of the JSON. Defaults to "Tree".

metafiles (list, optional): Metadata tsv and csv files to use. Defaults to no metadata.

write_newick(self, unicode file: str, subroot: Optional[str] = None, bool print_internal: bool = True, bool print_branch_len: bool = True, bool retain_original_branch_len: bool = True, bool uncondense_leaves: bool = True)¶

Print a newick string representing the tree/subtree to the target file.

Args:

file (str): Name of the file to write the newick to.

subroot (Optional[str], optional): Write a newick representing the subtree descended from this node. Defaults to the root.

print_internal (bool, optional): Include internal node names in the newick output. Defaults to True.

print_branch_len (bool, optional): Print branch lengths. Defaults to True.

retain_original_branch_len (bool, optional): Retain the original branch length attribute, if one was provided. Defaults to True.

uncondense_leaves (bool, optional): Uncondense nodes before writing the newick. Defaults to True.

write_vcf(self, unicode vcf_file: str, bool no_genotypes: bool = False, samples: list[str] = []) → None¶

Write a vcf representing the chosen samples to the indicated file. By default, writes a vcf including all samples.

Args:

vcf_file (str): Name the output vcf file.

no_genotypes (bool, optional): Do not include individual genotype information in the output vcf. Defaults to False.

samples (list[str], optional): Samples to include. Defaults to all samples.