13.3.1. Cython Acceleration#
The inner loop of histogram distribution matching (DyS, HDy,
HDx) is accelerated by an optional compiled Cython kernel. It is a
pure performance optimisation: the estimates are identical with or without it,
and if the compiled extension is not available the library transparently falls
back to a pure-Python implementation.
13.3.1.1. What is the “histogram sweep”?#
A histogram matching quantifier represents each class by a normalised histogram of classifier scores. For a binary problem it stores the positive and negative class histograms \(H_+\) and \(H_-\), and estimates the positive prevalence \(\alpha\) as the mixing proportion whose mixed histogram is closest to the test histogram \(H_U\):
Finding that \(\alpha\) means evaluating the distance \(D\) at many candidate values of \(\alpha\) — a sweep over the interval (a grid search, or a ternary search since the objective is unimodal). Each evaluation:
builds the mixture
alpha * Hpos + (1 - alpha) * Hneg,normalises it,
computes a distance (Hellinger / Topsøe / prob-symm) against the test histogram.
For HDy/DyS this sweep is repeated over several bin counts
and the per-bin estimates are aggregated by their median. So a single predict
performs hundreds to thousands of these tiny distance evaluations, each on
short histogram vectors (typically 8–110 bins).
13.3.1.2. What Cython does here#
Those thousands of evaluations are the hot path — and because each one works on a
tiny array, the cost is dominated by Python/numpy per-call overhead
(np.asarray, ufunc dispatch, temporary allocation), not by the actual
arithmetic.
The Cython kernel (mlquantify.matching._histogram_sweep) collapses the whole
mixture → normalise → distance → search into a single compiled C function:
typed, contiguous
float64memory views (no Python objects in the loop);the unimodal search (ternary or grid) runs inside the C function, so the thousands of distance evaluations never cross back into Python;
the inner loop is
nogil— pure C, no interpreter overhead.
The result is bit-for-bit the same \(\alpha\), computed without the per-iteration Python overhead.
13.3.1.3. Where it is more efficient — speed and memory#
Speed. The kernel removes the Python/numpy call overhead that dominates the
sweep over tiny arrays, and keeps the search in compiled C. In practice it roughly
halves DyS/HDy predict time relative to the generic Python solver
(more on a fine bin grid or large test sets), and the gap grows with the number of
bin counts swept.
Memory. This is where the kernel helps most relative to the pure-Python path:
Path |
Allocation behaviour |
|---|---|
generic Python solver |
Allocates a fresh mixture array (and normalised copies) on every
\(\alpha\) evaluation — thousands of small temporary arrays per
|
Cython sweep kernel |
Allocates one scratch buffer of length |
So the kernel turns a stream of thousands of tiny allocations into a single reused buffer, which both cuts allocator/GC pressure and keeps the working set in cache.
13.3.1.4. Enabling, checking and toggling#
The kernel is used automatically when the compiled extension is present (it ships in the binary wheels on PyPI). Everything still works without it — just slower — via the pure-Python fallback.
import mlquantify.matching._histogram as H
H._HAS_FAST_KERNEL # True if the compiled extension is built/imported
H.USE_SWEEP_KERNEL # runtime on/off switch (default True)
# force the generic Python solver (e.g. for debugging / benchmarking)
H.USE_SWEEP_KERNEL = False
# ... and back on
H.USE_SWEEP_KERNEL = True
If the compiled kernel is not built, the first histogram-matching predict
emits a one-time PerformanceWarning. To build it from a source checkout:
pip install -e . # compiles mlquantify/matching/_histogram_sweep.pyx
Note
The kernel only changes speed and memory, never the result: the estimated prevalence is identical (to floating-point tolerance) whether it runs the compiled kernel, the pure-Python sweep, or the generic solver.
See also
Distribution Matching for the methods that use the sweep, and
Representations for the HistogramRepresentation
it operates on.