Cutting Trees & FAIR Software
Benjamin Hackl • June 27, 2022

From cutting trees to
maintaining FAIR software

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Motivation: a story about cutting trees

Joint work with Clemens Heuberger, Sara Kropf, Helmut Prodinger
  • Input: random plane tree of size $n$
  • Procedure: remove all leaves, repeat until tree is deleted
Question: expected tree size after $r\in\mathbb{Z}_{\geq 0}$ iterations?

Cutting a "larger" tree

Tools of the trade: generating functions

  • $\mathcal{T}$ ... class of plane trees; inner nodes $\rightsquigarrow z$, leaves $\rightsquigarrow w$
\[ \mathcal{T} = \text{leaf} + \text{inner node} \times \text{non-empty sequence of } \mathcal{T} \]
\[ T(z, w) = w + z\cdot \frac{T(z, w)}{1 - T(z, w)} \]
\begin{align*} T(z, w) &= \frac{1 - (z-w) - \sqrt{1 - 2(z+w) + (z-w)^2}}{2}\\ &\fragment{4}{= w + wz + wz^2 + w^2 z + wz^3 + 3w^2z^2 + w^3 z + ...} \end{align*}

$[z^n w^k] T(z,w)$ counts # of trees with $n$ inner nodes and $k$ leaves.

Trimming vs Growing Trees

  • From every current leaf: grow $\geq 1$ new leaf
    \[ w \longrightarrow z\cdot \frac{w}{1-w} \]
  • From every inner node: grow $\geq 0$ new leaves before/after every outgoing edge
    \[ z^n w^k \longrightarrow z^n \Big(\frac{1}{1-w}\Big)^{2n+k-1} \]
Combined: $\quad z^n w^k \longrightarrow z^{n+k} w^k \frac{1}{(1-w)^{2n + 2k - 1}}$

Tree Expansion Operator

Combined: $\quad z^n w^k \longrightarrow z^{n+k} w^k \frac{1}{(1-w)^{2n + 2k - 1}}$
  • Note: expansion does not depend on tree shape, only on number of inner nodes and leaves
  • Define linear operator $\Phi$ that expands a tree family, \[ \Phi(f(z, w)) := (1-w) f\bigg(\frac{z}{(1-w)^2}, \frac{zw}{(1-w)^2}\bigg) \]

Iterated tree expansions, and back to cutting

Theorem.
\begin{align*} G_r(z, v) &= \Phi^r(T(zv, wv))|_{w=z}\\ &= \frac{1 - u^{r+2}}{(1 - u^{r+1})(1+u)} T\bigg(\frac{u (1 - u^{r+1})^2}{(1 - u^{r+2})^2} v, \frac{u^{r+1} (1-u)^2}{(1 - u^{r+2})^2} v\bigg) \end{align*} with $z = u/(1+u)^2$ counts trees with respect to both their size ($\rightsquigarrow z$) and their size after $r$ cuts ($\rightsquigarrow v$).
  • Construction of linear, multiplicative $\Psi$, study of $\Psi^r(t)$, $\Psi^r(z)$
  • Recurrences involving Fibonacci polynomials, coordinate transformation $z = u/(1+u)^2$
  • $X_{n,r}$ ... RV for size of tree after $r$ cuts
\[ \mathbb{E} X_{n,r} = \frac{1}{C_{n-1}} [z^n] \frac{u^{r+1}}{(1+u) (1 - u^{r+1})} \]

What now?

  • Singularity analysis: singular expansion at dominant singularities $\rightsquigarrow$ asymptotic growth of coefficients
\[ f(z) = \frac{1}{1 - 42z} \longrightarrow [z^n]f(z) = 42^n \]
\[ f(z) = \frac{1 - \sqrt{1 - 4z}}{2z} \longrightarrow [z^n]f(z) = \frac{4^n}{\sqrt{n^3 \pi}} + O(4^n n^{-5/2}) \]
\[ f(z) = \frac{u^{r+1}}{(u+1)(1 - u^{r+1} )} \longrightarrow \ ? \]

Asymptotic Expansions in SageMath

Joint work with Thomas Hagelmayer, Clemens Heuberger, Daniel Krenn, ...
  • Want: framework for computations with generalized power series with rigorous control over error terms
sage: A.<n> = AsymptoticRing('n^QQ', SR, default_prec=4)
sage: (1 + 1/n)^n
e - 1/2*e*n^(-1) + 11/24*e*n^(-2) - 7/16*e*n^(-3) + O(n^(-4))
  • Initial implementation as a Google Summer of Code project in 2015 (as student)
  • Gradual improvement and refactors over the years
  • $B$-terms (explicitly bounded $O$-terms) in GSoC'21 (as mentor)

Basic Design Ideas

  • AsymptoticRing $\ni$ AsymptoticExpansion, sum of (partially ordered) terms
  • TermMonoid $\ni$ AsymptoticTerm, summands of the expansion, $42\log(n)^3$, $O(n^{42})$
  • GrowthGroup $\ni$ GrowthElement, holds only the growth

Development principles of SageMath:
  • every (new) class / function needs documentation ...
  • ... and tests!

The average size of $r$-fold cut trees

\[ \mathbb{E} X_{n,r} = \frac{1}{C_{n-1}} [z^n] \frac{u^{r+1}}{(1+u) (1 - u^{r+1})} \]

>>> catalan_shifted_asy = asymptotic_expansions.Binomial_kn_over_n(n, k=2, precision=3)
>>> catalan_shifted_asy = catalan_shifted_asy.subs(n=n-1) / n
>>> catalan_shifted_asy
1/4/sqrt(pi)*4^n*n^(-3/2) + 3/32/sqrt(pi)*4^n*n^(-5/2) + 25/512/sqrt(pi)*4^n*n^(-7/2) + O(4^n*n^(-9/2))
>>> z_asy = 1/4 * (1 - Z^-1)
>>> u_asy = upsilon(z_asy) + O(Z^-2); u_asy
1 - 2*Z^(-1/2) + 2*Z^(-1) - 2*Z^(-3/2) + 2*Z^(-2) - 2*Z^(-5/2) + O(Z^(-2))
>>> singular_expansion = fast_callable(u^(r+1) / (1 + u) / (1 - u^(r+1)), vars=(u, r))(u_asy, r)
>>> singular_expansion
(1/4/(r + 1))*Z^(1/2) + 1/4*((r + 1)^2 + r + 1)/(r + 1)^2 - 1/2 + (1/2*r - 1/2*((r + 1)^2 + r + 1)/(r + 1) + 1/12*(3*((r + 1)^2 + r + 1)^2/(r + 1)^2 - (2*(r + 1)^3 + 3*(r + 1)^2 + 4*r + 4)/(r + 1))/(r + 1) + 1/2)*Z^(-1/2) + O(Z^(-1))
>>> expectation = singular_expansion._singularity_analysis_(n, zeta=1/4, precision=3) / catalan_shifted_asy
>>> expectation.map_coefficients(lambda t: t.simplify_rational())
(1/(r + 1))*n - 1/6*(r^2 - r)/(r + 1) + O(n^(-1/2))
Theorem.

After $r$ cuts, the average number of remaining nodes is \[ \mathbb{E} X_{n,r} = \frac{n}{r+1} - \frac{r(r-1)}{6(r+1)} + O(n^{-1/2}). \]

Reproducible Research is gaining momentum!

  • Plan S, open access movement
  • arXiv: tight integration of paperswithcode
  • Ancillary code encouraged in journal submissions
  • Rise of services like binder for running code online in preconfigured environments
  • Zenodo for persistent identifiers of released software
  • Also inner-mathematical movements; computer-aided proofs, interactive theorem provers (Coq, Isabelle / HOL, LEAN, ...)

The FAIR Principles

Since 2019, all FWF projects need a data management plan to explain how project data is being made FAIR:

  • Findable (persistent identifier, rich metadata)
  • Accessible (retrieveable e.g. via persistent identifier)
  • Interoperable (can be combined with other resources with "minimal" effort)
  • Reuseable(appropriate license; satisfying relevant community standards)
Well-maintained open-source software is one of the most effective ways to adhere to FAIR principles!

FAIRness in my work

Maintainer tasks

  • writing and reviewing code
  • structuring development cycles
  • managing releases: editing changelogs, distribution
  • maintaining infrastructure for continuous integration

  • support and community management
  • forming relationship to scientific open-source umbrella organizations like NumFOCUS

An ideal medium-sized project

  • code split in modules of reasonable size, few dependencies between modules
  • well-documented classes and functions
  • infrastructure that allows for continuous testing, high test coverage
  • up-to-date installation instructions and contributing guidelines

Collaboration Opportunities

Thank you!