Substructure Search Methods¶

A set of algorithms for finding similar structures and substructures.

Inclusion Comparison Algorithms
- Direct Comparators
- Substructure Inclusion-based Algorithms
  - Partial Subgraph Walks
Common Substructure-based Algorithms
- Implementation Details of maximum_common_subgraph()
Treelets
- Implementation Details

Inclusion Comparison Algorithms ¶

Inclusion-based search asks whether one graph is completely contained in another sub-graph, using either Exact Matching or Topological Matching according to the chosen approach. These comparisons can be made fuzzy, using commutative_similarity() to evaluate matches between individual nodes in a graph.

Direct Comparators ¶

These direct algorithms compare two structures starting from the provided root nodes, using their particular matching strategy. They are useful for immediate comparisons of similarity, returning the maximal commutative_similarity() of their traversal, but they do not consider non-root-to-root comparisons.

glypy.algorithms.subtree_search.topological_inclusion(target, reference, substituents=True, tolerance=0, visited=None)¶

A generalization of topological_equality() which allows for target to be matched to reference, but for reference to include more. Consequently, this method is not commutative.

Parameters

target (Monosaccharide) – The monosaccharide to compare

reference (Monosaccharide) – The monosaccharide to compare against

substituents (bool, optional) – Whether or not to compare with the substituents of each node. Defaults to True.

tolerance (float, optional) – The maximum difference to permit between nodes for inclusion. Defaults to 0

Returns

The inclusion score. Greater than 0 indicates topological inclusion, though larger corresponds to better alignment.

Return type

float

See also

commutative_similarity_score_with_tolerance(), exact_ordering_inclusion

glypy.algorithms.subtree_search.exact_ordering_inclusion(target, reference, substituents=True, tolerance=0, visited=None)[source]¶

A generalization of exact_ordering_equality() which allows for target to be matched to reference, but for reference to include more. Consequently, this method is not commutative.

Parameters

target (Monosaccharide) – The monosaccharide to compare

reference (Monosaccharide) – The monosaccharide to compare against

substituents (bool, optional) – Whether or not to compare with the substituents of each node. Defaults to True.

tolerance (float, optional) – The maximum difference to permit between nodes for inclusion. Defaults to 0

Returns

The similarity score between the two structures. A score of 0 means no inclusion, and a non-zero score means inclusion, with larger scores indicating more node pairs matching.

Return type

float

See also

commutative_similarity_score_with_tolerance, topological_inclusion

Substructure Inclusion-based Algorithms ¶

Building off the Direct Comparators, these algorithms answer higher level questions about whether one structure is included in another.

glypy.algorithms.subtree_search.subtree_of(subtree, tree, exact=False, include_substituents=True, tolerance=0)[source]¶

Test to see if subtree is included in tree anywhere. Returns the node id number of the first occurence of subtree included in tree or None if it is not found.

Parameters

subtree (Glycan) – The structure to search for. The search attempts to match the complete structure of subtree.

tree (Glycan) – The sturcture to search in. The search iterates over each residue in tree and calls a comparator function, comparing the subtree to the substructure rooted at that residue.

exact (bool) – If True, use exact_ordering_inclusion() to compare nodes. Otherwise use topological_inclusion(). Defaults to False.

Return type

int or None if no match

glypy.algorithms.subtree_search.find_matching_subtree_roots(subtree, tree, exact=False, include_substituents=True, tolerance=0)[source]¶

Find the list of nodes where occurences of subtree included in tree are rooted.

Parameters

subtree (Glycan) – The structure to search for. The search attempts to match the complete structure of subtree.

tree (Glycan) – The sturcture to search in. The search iterates over each residue in tree and calls a comparator function, comparing the subtree to the substructure rooted at that residue.

exact (bool) – If True, use exact_ordering_inclusion() to compare nodes. Otherwise use topological_inclusion(). Defaults to False.

Return type

list of Monosaccharide

Partial Subgraph Walks ¶

glypy.algorithms.subtree_search.walk_with(query, reference, visited=None, comparator=<function commutative_similarity>, include_substituents=True)[source]¶

Walk the query along reference, yielding successive matched nodes along a subtree of reference using a Exact Matching traversal.

This function provides slightly more detail than subtree_of() as it exposes every step along the subgraph traversal, rather than just the root.

Parameters

query (Glycan) – The query structure to search with

reference (Glycan) – The reference structure to search in

visited (set, optional) – The set of node id pairs to ignore

comparator (Callable, optional) – The Callable object which can compare two Monosaccharide

Yields

(Monosaccharide, Monosaccharide) – Pairs of matched nodes along a path

Common Substructure-based Algorithms ¶

glypy.algorithms.subtree_search.maximum_common_subgraph(seq_a, seq_b, exact=True)¶

Find the maximum common subgraph between seq_a and seq_b.

Parameters

seq_a (Glycan) –

seq_b (Glycan) –

exact (bool) – Whether to use exact equality or fuzzy equality

Return type

MaximumCommonSubtreeResults

References

[1] K. F. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, and H. Mamitsuka, “Efficient tree-matching methods for accurate carbohydrate database queries.” Genome Inform. Jan. 2003.

glypy.algorithms.subtree_search.n_saccharide_similarity(self, other, n=2, exact=False)[source]¶

Calculate n-saccharide similarity between two structures

Parameters

self (Glycan) –

other (Glycan) –

n (int) – Size of the fragment saccharide to consider. Defaults to 2

exact (bool) – Whether to use Glycan.exact_ordering_equality() or Glycan.topological_equality(). Defaults to exact_ordering_equality

Returns

score – How similar these structures are at the n-saccharide level. Ranges between 0 and 1.0 where 1.0 is exactly the same, while 0.0 means no shared n-saccharides.

Return type

float

References

[1] K. F. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, and H. Mamitsuka, “Efficient tree-matching methods for accurate carbohydrate database queries.” Genome Inform. Jan. 2003.

glypy.algorithms.subtree_search.distinct_fragments(self, other, fragmentation_parameters=None)[source]¶

Compute the set of distinct masses observed between two structures

Parameters

self (Glycan or list) –

other (Glycan or list) – Glycan objects whose fragments will be compared or lists of Fragment to compare.

fragmentation_parameters (dict, optional) – If self and other are Glycan objects, these parameters will be used to call Glycan.fragments().

Returns

set (The distinct fragment masses of self)

set (The distinct fragment masses of other)

Implementation Details of `maximum_common_subgraph()`¶

class glypy.algorithms.subtree_search.MaximumCommonSubgraphSolver(seq_a, seq_b, exact=True)[source]¶

Find the maximum common subgraph between seq_a and seq_b.

Variables

seq_a (Glycan) –

seq_b (Glycan) –

exact (bool) – Whether to use exact equality or fuzzy equality

References

[1] K. F. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, and H. Mamitsuka, “Efficient tree-matching methods for accurate carbohydrate database queries.” Genome Inform. Jan. 2003.

class glypy.algorithms.subtree_search.MaximumCommonSubtreeResults(score, tree, similarity_matrix)[source]¶

Holds a maximum common subgraph solution

Variables

score (float) – The alignment score between the trees compared

similarity_matrix (list of list of float) – A simple dynamic programming solution matrix describing the alignment at each position.

tree (Glycan) – The maximum common subgraph, extracted as a separate Glycan object.

Treelets ¶

glypy.algorithms.subtree_search.treelets(glycan, k, distinct=True)[source]¶

Iterator over all distinct \(k\)-treelets of tree, all unique sub-trees of tree with k monosaccharides.

Parameters

glycan (Glycan) – The glycan to extract reelets from

k (int) – The number monosaccharides per treelet

distinct (bool) – Whether or not to filter out duplicate treelets

glypy.algorithms.subtree_search.treelet_enrichment(cond1, cond2, k, distinct=True)¶

Perform a test for each treelet to determine if it is enriched in cond1 compared to cond2.

Parameters

cond1 (list of Glycan) – Glycans from the first condition, the condition to be tested for enrichment

cond2 (list of Glycan) – Glycans from the second condition, the background condition

k (int) – The size of the treelet to use

distinct (bool) – Whether or not to count redundant treelets. Defaults to True

Returns

A mapping from treelet to p value from Fisher’s Exact Test for that treelet being found with its observed frequency in cond1 if it is as common in cond1 as in cond2.

Return type

dict

References

Pevzner, P., & Shamir, R. (2011). Bioinformatics for Biologists. New York, NY, USA: Cambridge University Press.

Implementation Details ¶

Treelet methods are built atop a more complex object system

class glypy.algorithms.subtree_search.Treelet(subtree, frontier_ids)[source]¶

Represents a subgraph of a larger Glycan, with a frontier of node ids which are children of the current subgraph.

Variables

frontier_ids (set) – The id values of the nodes from the parent Glycan which are children of members of subtree

subtree (Glycan) – The subgraph defining the treelet

class glypy.algorithms.subtree_search.TreeletIterator(tree, k, distinct=True)[source]¶

Iterator over all distinct \(k\)-treelets of tree, all unique sub-trees of tree with k monosaccharides.

Variables

k (int) – The number monosaccharides per treelet

tree (Glycan) – The glycan to extract reelets from

distinct (bool) – Whether or not to filter out duplicate treelets