Performance¶
All benchmarks below are on a real GOA-derived benchmark corpus
(8 712 BP / 4 992 MF / 5 125 CC ground-truth proteins, ~4.45 M
prediction rows) with th_step=0.01 and n_cpu=1. Upstream is
pristine claradepaolis/CAFA-evaluator-PK 16a6a6d.
Headline¶
Variant |
Upstream |
Fork (B1–B7) |
Speedup |
|---|---|---|---|
NK / LK |
92.96 s |
4.08 s |
22.8× |
PK |
418.53 s |
10.33 s |
40.5× |
NK and LK share the same code path (gt_exclude is None); the fork
speeds up both equally. The PK path (gt_exclude is not None) gets
a larger win because its per-protein Python loops were heavier to
begin with.
Phase-by-phase breakdown¶
Phase A — cherry-picked ideas from T0chka/CAFA-evaluator-PK-speedup
(Antonina Dolgorukova). Bit-exact (atol=0, rtol=0) against the
oracle.
Phase |
Change |
|---|---|
A.1 |
|
A.2 |
|
A.3 |
|
A.4 |
|
Phase B — sparse + vectorised rewrites. Parity gate loosened to
rtol=1e-6, atol=1e-9 because the rewrites reorder inner per-protein
sums.
Phase |
Area |
Change |
|---|---|---|
B1 |
NK kernel |
Sparse scatter + right-to-left cumsum. Replaces per-threshold
dense mask scan ( |
B2 |
PK kernel |
Sparse scatter extended with per-protein exclude AND — the
per-protein |
B3 |
Parser |
PyArrow |
B4 |
Propagation |
Sparse push-up kernel. Flat CSR ancestor cache + scatter over
input non-zeros + |
B6 |
gt_parser |
Skip dead dense scans in |
B7 |
Prep |
Trim |
Phase B7 drilldown¶
Before B7, the PK prep block dominated BP namespace cost:
Line |
Wall time |
|---|---|
|
1.15 s |
|
1.07 s |
|
2.29 s |
|
0.59 s |
|
0.61 s |
After B7, end-to-end on the same corpus:
Measurement |
Before B7 |
After B7 |
|---|---|---|
NK end-to-end |
6.68 s |
4.08 s |
PK end-to-end |
28.73 s |
10.33 s |
|
5.443 s |
0.599 s |
|
1.805 s |
0.746 s |
|
5.68 s |
2.05 s |
Micro-benchmarks¶
Individual hot spots, isolated:
Kernel |
Corpus |
Before |
After |
|---|---|---|---|
|
4.45M rows |
1.22 s |
0.45 s (2.72×) |
|
4.45M rows |
1.44 s |
0.63 s (2.28×) |
|
real PK |
2.73 s |
1.45 s (1.88×) |
|
real corpus |
1.71 s |
1.46 s |
|
real corpus |
2.21 s |
0.64 s |
|
real corpus |
2.21 s |
0.16 s |
What is left on the table¶
Phase B5 — optional numba kernel on the per-namespace parser reduction. Not scheduled; the current PyArrow path already brings parser time below the confusion-matrix cost.
A numba-JIT legacy parser for environments where
pyarrowis unavailable. Only worth building if profiling on a legacy-only install warrants it.