Performance¶

All benchmarks below are on a real GOA-derived benchmark corpus (8 712 BP / 4 992 MF / 5 125 CC ground-truth proteins, ~4.45 M prediction rows) with th_step=0.01 and n_cpu=1. Upstream is pristine claradepaolis/CAFA-evaluator-PK 16a6a6d.

Headline¶

Variant	Upstream	Fork (B1–B7)	Speedup
NK / LK	92.96 s	4.08 s	22.8×
PK	418.53 s	10.33 s	40.5×

NK and LK share the same code path (gt_exclude is None); the fork speeds up both equally. The PK path (gt_exclude is not None) gets a larger win because its per-protein Python loops were heavier to begin with.

Phase-by-phase breakdown¶

Phase A — cherry-picked ideas from T0chka/CAFA-evaluator-PK-speedup (Antonina Dolgorukova). Bit-exact (atol=0, rtol=0) against the oracle.

Phase	Change
A.1	`weighted_only` fast path — skip the unweighted metric pass when the caller only needs `_w` columns.
A.2	`compute_metrics`: `fork`-pool with shared-state initializer, eliminates per-chunk pickling of `g/p/toi/n_gt`.
A.3	`graph.propagate`: cached children adjacency, fill-mode restricted to zero rows, optional `spawn` shared-memory parallel propagation.
A.4	`parser.pred_parser`: precomputed `term_index`, buffered file reads, single dict lookup per term.

Phase B — sparse + vectorised rewrites. Parity gate loosened to rtol=1e-6, atol=1e-9 because the rewrites reorder inner per-protein sums.

Phase	Area	Change
B1	NK kernel	Sparse scatter + right-to-left cumsum. Replaces per-threshold dense mask scan (`O(n_tau · n_prot · n_toi)`) with `O(nnz + n_prot · n_tau)`.
B2	PK kernel	Sparse scatter extended with per-protein exclude AND — the per-protein `toi_perprotein` / `gt_perprotein` Python lists are never materialised.
B3	Parser	PyArrow `read_csv` + dictionary-encoded `(pid, tid)` + vectorised per-namespace group-max. 2.3-2.7× faster than the per-line loop on multi-million-row inputs.
B4	Propagation	Sparse push-up kernel. Flat CSR ancestor cache + scatter over input non-zeros + `np.maximum.reduceat` for per-group max. Skips every term that has no predictions.
B6	gt_parser	Skip dead dense scans in `propagate` when the caller is on the sparse path; accept a `triples=` passthrough so `_propagate_sparse_pushup` reuses the caller’s scatter coordinates instead of calling `np.nonzero` on the full `(n_prot, n_terms)` bool matrix.
B7	Prep	Trim `compute_metrics` and `evaluate_prediction` prep work: detect `toi_is_full` once and collapse column slices to no-ops, defer `g`/`p`/`pred_sub` materialisation to the branch that actually uses them, replace the per-protein `setdiff1d` loop with a vectorised `_count_proteins_in_toi` helper.

Phase B7 drilldown¶

Before B7, the PK prep block dominated BP namespace cost:

Line	Wall time
`gt_matrix[:, toi].sum(1) > 0`	1.15 s
`g = gt_with_annots[:, toi]`	1.07 s
`p = pred[proteins_has_gt, :][:, toi]`	2.29 s
`excluded_mask`, `valid_gt_mask`, `n_gt`	0.59 s
`pred_sub = pred[proteins_has_gt, :]`	0.61 s

After B7, end-to-end on the same corpus:

Measurement	Before B7	After B7
NK end-to-end	6.68 s	4.08 s
PK end-to-end	28.73 s	10.33 s
`compute_metrics` PK BP prep	5.443 s	0.599 s
`compute_metrics` PK BP kernel	1.805 s	0.746 s
`evaluate_prediction` PK total	5.68 s	2.05 s

Micro-benchmarks¶

Individual hot spots, isolated:

Kernel	Corpus	Before	After
`pred_parser` (NK)	4.45M rows	1.22 s	0.45 s (2.72×)
`pred_parser` (PK)	4.45M rows	1.44 s	0.63 s (2.28×)
`compute_confusion_matrix_exclude` (PK, n_cpu=1)	real PK	2.73 s	1.45 s (1.88×)
`gt_parser` (NK, cold cache)	real corpus	1.71 s	1.46 s
`gt_parser` (PK, cold cache)	real corpus	2.21 s	0.64 s
`gt_parser` (PK, hot ancestor cache)	real corpus	2.21 s	0.16 s

Phase C — memory¶

Phases A and B optimise time; Phase C removes the two dense O(n²) / O(n_prot · n_terms) allocations that dominate memory at full-GO scale, with no loss of speed. Parity is bit-exact (atol=0, rtol=0) against the pre-C fork on every scenario and on a 24 000-term × 30 000-protein synthetic corpus, so it does not move the upstream tolerance.

Phase	Area	Change
C1	Graph	Sparse DAG. The dense `(n_terms, n_terms)` bool adjacency is replaced by CSR parent/child index arrays + degree vectors built from the per-term `adj` / `children` sets. `top_sort` uses `collections.deque` (`O(1)` `popleft`) instead of `list.pop(0)`.
C3	Prediction	The dense `(n_prot, n_terms)` `float64` prediction matrix is never built. `graph.propagate_to_coo` propagates the parser’s COO sparse-natively; `Prediction.matrix` is a CSR; the sparse kernels read its `data`/`indices`/`indptr` directly.

Measured¶

Graph construction on the full GO (go-basic.obo, 24 547 BP / 10 123 MF / 4 069 CC terms):

Measurement	Before C1	After C1
Build time (3 namespaces)	1.87 s	0.61 s (~3×)
Peak RSS (DAG only)	312 MB	108 MB (~3×)

End-to-end cafa_eval on a 24 000-term × 30 000-protein synthetic corpus (IA-weighted, PK, th_step=0.01):

Measurement	Before C3	After C3
Peak RSS	11.9 GB	7.0 GB (−41 %)
Wall-clock	51 s	42 s

What is left on the table¶

Phase C4 — sparse ground-truth / exclude matrices. Prototyped and bit-exact, but the sparse TP gather (g[r, c] membership) is slower than a dense fancy index, so it traded ~1.2 GB for ~2× wall clock. Not merged: the time regression outweighs the memory gain. Would need a sparse-intersection TP gather (pred.multiply(gt)) to be worthwhile.
Phase B5 — optional numba kernel on the per-namespace parser reduction. Not scheduled; the current PyArrow path already brings parser time below the confusion-matrix cost.
A numba-JIT legacy parser for environments where pyarrow is unavailable. Only worth building if profiling on a legacy-only install warrants it.