Architecture¶
This page documents the fork-specific kernels. For the upstream algorithm and its mathematical definitions, see the CAFA-evaluator paper.
Sparse confusion matrix kernel¶
Upstream evaluates one confusion matrix per threshold by scanning
a dense (n_prot, n_toi) mask of active predictions. For n_tau
thresholds this is O(n_tau · n_prot · n_toi) cell visits. At
th_step=0.01 with ~25 000 terms and ~8 000 proteins that is 20 B
visits per namespace.
The fork replaces the scan with a single scatter pass followed by one right-to-left cumulative sum:
Walk non-zero predictions once. For each prediction
(i, j, s), compute its highest active threshold binbin = np.searchsorted(tau_arr, s, side="right") - 1
Scatter
(row, bin)into a flat(n_prot · n_tau)-sizednp.bincount. This gives, for each protein and each threshold, the number of predictions whose score sits exactly in that bin.Run
np.cumsumfrom right to left along the threshold axis. A prediction active at binbis also active at every lower bin, so the right-to-left cumsum converts “count in binb” to “count of active predictions at thresholdtau[b]or lower”.Combine with the ground-truth vector to produce TP / FP / FN in
O(n_prot · n_tau)arithmetic.
Cost drops from O(n_tau · n_prot · n_toi) to
O(nnz + n_prot · n_tau). On real CAFA inputs that is a ~30×
reduction in arithmetic volume.
The PK variant (compute_confusion_matrix_exclude_sparse) extends
the same idea with two dense bool filters expressed as a single AND:
toi_mask # (n_terms,) global terms of interest
excluded_mask # (n_prot, n_terms) per-protein exclude
The scatter only sees non-zeros that survive both masks, so
toi_perprotein / gt_perprotein — the upstream per-protein
Python lists of valid columns — are never materialised.
Sparse push-up propagation¶
Upstream propagates scores along the DAG by sweeping terms in a
topological order: for each term, walk its children and take the
per-protein max. This is O(n_terms · n_prot · avg_children)
cell updates, and it pays for every term in the ontology even if only
a handful have predictions.
The fork walks the DAG differently:
Flat ancestor cache.
graph._ancestors_csrprecomputes the transitive ancestor set of every term once perGraphinstance, flattened into a CSR-style(indptr, indices)pair. Ancestor lookups are then constant-time index slicing.Scatter input non-zeros. Collect all input non-zeros
(row, col, score). For each non-zero, gather the flat ancestor list for itscolvia vectorisednp.repeatoffsets — no Python loop — producing an expanded(row, ancestor)triple stream.Stable sort + reduceat. Encode
(row, ancestor)as a singleint64key, stable-sort once, and reduce per-group withnp.maximum.reduceat. This replaces the per-term sweep with a single pass overRexpanded triples.Write group-maxes back in place.
mode='fill'semantics are preserved by snapshotting the input non-zeros and writing them back on top after the group-max.
Cost is O(nnz · avg_ancestors + R log R). On a sparse CAFA input
where nnz is a tiny fraction of n_terms · n_prot, this skips
over every term that has no predictions.
The Phase B6 triples= passthrough lets the caller (gt_parser
or the prediction parser) pass the non-zero coordinates it already
has, avoiding an np.nonzero scan of the full (n_prot, n_terms)
bool matrix on the hot path. On the BP ontology that scan was 488 M
cells to find ~27 000 non-zeros.
Vectorised prediction parser¶
parser._pred_parser_vectorised replaces the upstream per-line
loop with PyArrow read_csv. The high-level shape is:
pyarrow.csv.read_csvreads the full prediction file at native speed into three columns:pid,tid,score.pidandtidare dictionary-encoded. The Python-level lookups intons_dict/gts[ns].ids/term_indexrun once per unique value instead of once per row. On a 4.45M-row file with ~20 000 unique terms and ~8 000 unique proteins, the lookup cost drops by three orders of magnitude.Per-namespace filtering uses vectorised comparisons against the integer code arrays. No Python-level masking.
Duplicate
(protein, term)rows are collapsed to their max via the same sort +np.maximum.reduceatpattern as the propagation kernel. This matches the upstream loop’s “store the max if higher” semantics bit-exactly because the maximum is order-independent.Alt-id expansion is rare on CAFA inputs and would break the fast path’s vectorised structure. Unique terms whose column lookup resolves to the sentinel
-2are handed off to a short Python loop over just the affected rows.
Gate: CAFAEVAL_FAST_PARSER=1 (default). Falls back to the legacy
loop if PyArrow is not installed or the fast path raises. The
max_terms cap is order-sensitive and also routes to the legacy
loop.
toi_is_full short-circuit¶
compute_metrics and evaluate_prediction frequently run on
toi = np.arange(n_terms) — i.e. the caller wants metrics over the
entire namespace. In that case column slices of the form
gt_matrix[:, toi] are no-ops that copy the full matrix, and
toi_mask can be built as a single np.ones.
The fork detects this path once per compute_metrics call:
toi_is_full = (
isinstance(toi, np.ndarray)
and toi.ndim == 1
and toi.size == n_terms
and bool(np.array_equal(toi, np.arange(n_terms)))
)
When toi_is_full is true:
(gt_matrix[:, toi] != 0).any(axis=1)is replaced by(gt_matrix != 0).any(axis=1).g = gt_with_annots[:, toi]/p = pred[...][:, toi]are skipped entirely on the sparse path — those variables are dead when the sparse kernel takes the non-zeros directly.toi_maskis built asnp.ones(n_terms, dtype=bool)instead ofnp.zeros+ scatter.
On the benchmark corpus this alone removed ~3.4 s of BP-namespace
prep cost per compute_metrics call.
Triples passthrough API¶
Several kernels in the fork accept an optional
triples=(nz_rows, nz_cols, nz_scores) keyword argument. This lets
a caller that already has non-zero coordinates (e.g. because it just
scattered them to build the matrix) forward those coordinates through
instead of forcing the callee to rediscover them with
np.nonzero.
The convention is:
Callers pass
triples=...as an internal (underscore-prefixed) kwarg so the public API shape stays stable.Callees treat
triplesas a hint: they are free to ignore it on dense fallbacks, or use it to skip an expensive rediscovery.The fork’s dense fallback paths are never triples-aware — they stay as close to upstream as possible so the self-parity test remains a meaningful check.
Current users of the passthrough:
gt_parser→graph.propagate→_propagate_sparse_pushup(Phase B6).
Logging¶
All print() calls in the upstream library are replaced by
structured logging calls under four loggers:
cafaeval.parsercafaeval.propagatecafaeval.metricscafaeval.eval
Extras are passed via logger.info(..., extra={...}) so machine
consumers can read structured timing payloads without regex-ing log
strings.