From cutting trees to maintaining FAIR software

Cutting Trees & FAIR Software

Benjamin Hackl • June 27, 2022

From cutting trees to
maintaining FAIR software

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Motivation: a story about cutting trees

Joint work with Clemens Heuberger, Sara Kropf, Helmut Prodinger

Input: random plane tree of size $n$
Procedure: remove all leaves, repeat until tree is deleted

Question: expected tree size after $r\in\mathbb{Z}_{\geq 0}$ iterations?

Cutting a "larger" tree

Tools of the trade: generating functions

$\mathcal{T}$ ... class of plane trees; inner nodes $\rightsquigarrow z$, leaves $\rightsquigarrow w$

\[ \mathcal{T} = \text{leaf} + \text{inner node} \times \text{non-empty sequence of } \mathcal{T} \]

\[ T(z, w) = w + z\cdot \frac{T(z, w)}{1 - T(z, w)} \]

\begin{align*} T(z, w) &= \frac{1 - (z-w) - \sqrt{1 - 2(z+w) + (z-w)^2}}{2}\\ &\fragment{4}{= w + wz + wz^2 + w^2 z + wz^3 + 3w^2z^2 + w^3 z + ...} \end{align*}

$[z^n w^k] T(z,w)$ counts # of trees with $n$ inner nodes and $k$ leaves.

Trimming vs Growing Trees

From every current leaf: grow $\geq 1$ new leaf
\[ w \longrightarrow z\cdot \frac{w}{1-w} \]
From every inner node: grow $\geq 0$ new leaves before/after every outgoing edge
\[ z^n w^k \longrightarrow z^n \Big(\frac{1}{1-w}\Big)^{2n+k-1} \]

Combined: $\quad z^n w^k \longrightarrow z^{n+k} w^k \frac{1}{(1-w)^{2n + 2k - 1}}$

Tree Expansion Operator

Combined: $\quad z^n w^k \longrightarrow z^{n+k} w^k \frac{1}{(1-w)^{2n + 2k - 1}}$

Note: expansion does not depend on tree shape, only on number of inner nodes and leaves
Define linear operator $\Phi$ that expands a tree family, \[ \Phi(f(z, w)) := (1-w) f\bigg(\frac{z}{(1-w)^2}, \frac{zw}{(1-w)^2}\bigg) \]

Iterated tree expansions, and back to cutting

Theorem.

\begin{align*} G_r(z, v) &= \Phi^r(T(zv, wv))|_{w=z}\\ &= \frac{1 - u^{r+2}}{(1 - u^{r+1})(1+u)} T\bigg(\frac{u (1 - u^{r+1})^2}{(1 - u^{r+2})^2} v, \frac{u^{r+1} (1-u)^2}{(1 - u^{r+2})^2} v\bigg) \end{align*} with $z = u/(1+u)^2$ counts trees with respect to both their size ($\rightsquigarrow z$) and their size after $r$ cuts ($\rightsquigarrow v$).

Construction of linear, multiplicative $\Psi$, study of $\Psi^r(t)$, $\Psi^r(z)$
Recurrences involving Fibonacci polynomials, coordinate transformation $z = u/(1+u)^2$

$X_{n,r}$ ... RV for size of tree after $r$ cuts

\[ \mathbb{E} X_{n,r} = \frac{1}{C_{n-1}} [z^n] \frac{u^{r+1}}{(1+u) (1 - u^{r+1})} \]

What now?

Singularity analysis: singular expansion at dominant singularities $\rightsquigarrow$ asymptotic growth of coefficients

\[ f(z) = \frac{1}{1 - 42z} \longrightarrow [z^n]f(z) = 42^n \]

\[ f(z) = \frac{1 - \sqrt{1 - 4z}}{2z} \longrightarrow [z^n]f(z) = \frac{4^n}{\sqrt{n^3 \pi}} + O(4^n n^{-5/2}) \]

\[ f(z) = \frac{u^{r+1}}{(u+1)(1 - u^{r+1} )} \longrightarrow \ ? \]

Asymptotic Expansions in SageMath

Joint work with Thomas Hagelmayer, Clemens Heuberger, Daniel Krenn, ...

Want: framework for computations with generalized power series with rigorous control over error terms

sage: A.<n> = AsymptoticRing('n^QQ', SR, default_prec=4)
sage: (1 + 1/n)^n
e - 1/2*e*n^(-1) + 11/24*e*n^(-2) - 7/16*e*n^(-3) + O(n^(-4))

Initial implementation as a Google Summer of Code project in 2015 (as student)
Gradual improvement and refactors over the years
$B$-terms (explicitly bounded $O$-terms) in GSoC'21 (as mentor)

Basic Design Ideas

AsymptoticRing $\ni$ AsymptoticExpansion, sum of (partially ordered) terms
TermMonoid $\ni$ AsymptoticTerm, summands of the expansion, $42\log(n)^3$, $O(n^{42})$
GrowthGroup $\ni$ GrowthElement, holds only the growth

Development principles of SageMath:

every (new) class / function needs documentation ...
... and tests!

The average size of $r$-fold cut trees

\[ \mathbb{E} X_{n,r} = \frac{1}{C_{n-1}} [z^n] \frac{u^{r+1}}{(1+u) (1 - u^{r+1})} \]


>>> catalan_shifted_asy = asymptotic_expansions.Binomial_kn_over_n(n, k=2, precision=3)
>>> catalan_shifted_asy = catalan_shifted_asy.subs(n=n-1) / n
>>> catalan_shifted_asy
1/4/sqrt(pi)*4^n*n^(-3/2) + 3/32/sqrt(pi)*4^n*n^(-5/2) + 25/512/sqrt(pi)*4^n*n^(-7/2) + O(4^n*n^(-9/2))
>>> z_asy = 1/4 * (1 - Z^-1)
>>> u_asy = upsilon(z_asy) + O(Z^-2); u_asy
1 - 2*Z^(-1/2) + 2*Z^(-1) - 2*Z^(-3/2) + 2*Z^(-2) - 2*Z^(-5/2) + O(Z^(-2))
>>> singular_expansion = fast_callable(u^(r+1) / (1 + u) / (1 - u^(r+1)), vars=(u, r))(u_asy, r)
>>> singular_expansion
(1/4/(r + 1))*Z^(1/2) + 1/4*((r + 1)^2 + r + 1)/(r + 1)^2 - 1/2 + (1/2*r - 1/2*((r + 1)^2 + r + 1)/(r + 1) + 1/12*(3*((r + 1)^2 + r + 1)^2/(r + 1)^2 - (2*(r + 1)^3 + 3*(r + 1)^2 + 4*r + 4)/(r + 1))/(r + 1) + 1/2)*Z^(-1/2) + O(Z^(-1))
>>> expectation = singular_expansion._singularity_analysis_(n, zeta=1/4, precision=3) / catalan_shifted_asy
>>> expectation.map_coefficients(lambda t: t.simplify_rational())
(1/(r + 1))*n - 1/6*(r^2 - r)/(r + 1) + O(n^(-1/2))

Theorem.

After $r$ cuts, the average number of remaining nodes is \[ \mathbb{E} X_{n,r} = \frac{n}{r+1} - \frac{r(r-1)}{6(r+1)} + O(n^{-1/2}). \]

Reproducible Research is gaining momentum!

Plan S, open access movement
arXiv: tight integration of paperswithcode
Ancillary code encouraged in journal submissions
Rise of services like binder for running code online in preconfigured environments
Zenodo for persistent identifiers of released software
Also inner-mathematical movements; computer-aided proofs, interactive theorem provers (Coq, Isabelle / HOL, LEAN, ...)

The FAIR Principles

Since 2019, all FWF projects need a data management plan to explain how project data is being made FAIR:

Findable (persistent identifier, rich metadata)
Accessible (retrieveable e.g. via persistent identifier)
Interoperable (can be combined with other resources with "minimal" effort)
Reuseable(appropriate license; satisfying relevant community standards)

Well-maintained open-source software is one of the most effective ways to adhere to FAIR principles!

FAIRness in my work

SageMath's asymptotic expansions module
Manim (open-source Python animation library)
Smaller, one-person projects: combpyter
Web projects:
- Onlinekreuze (Assignment administration system)
- Exercise catalogue
- math.ematical.fun, Lange Nacht der Forschung topic collection
Outreach activities via YouTube and Twitch

Maintainer tasks

writing and reviewing code
structuring development cycles
managing releases: editing changelogs, distribution
maintaining infrastructure for continuous integration

support and community management
forming relationship to scientific open-source umbrella organizations like NumFOCUS

An ideal medium-sized project

code split in modules of reasonable size, few dependencies between modules
well-documented classes and functions
infrastructure that allows for continuous testing, high test coverage
up-to-date installation instructions and contributing guidelines

Collaboration Opportunities

Recent work with Panholzer, Wagner: analysis and characterization of stochastic process behind uncovering a random tree
Machine learning in combinatorial contexts: counterexamples in extremal combinatorics (currently: supervision of a corresponding master's thesis)
High-Performance Computing: efficient, parallelized processing of large combinatorial structures usually required for guess-and-prove-type approaches