Change-history (VCS) metrics

bca vcs ranks files by change-history risk — signals derived from version-control history rather than the source AST. It is the project's first language-agnostic, non-AST metric family. The goal is to surface the files most likely to harbour bugs or vulnerabilities, using the signals the empirical defect- and vulnerability-prediction literature most consistently backs.

A single history walk runs once per invocation (never per file) and produces per-file signals over two configurable windows — a long window (default 12mo ≈ 365 days) and a recent window (default 90d).

Quick start

$ bca vcs --paths src --top 20
Change-history risk (long window 365d, recent 90d, formula v2)
 RANK      RISK  COMMITS rec/long  CHURN rec/long  AUTHORS long  FILE
    1       7.2             68/68     11634/11634             1  src/metrics/cyclomatic.rs
    2       6.9             68/68       7299/7299             1  src/metrics/npa.rs
    ...

With no --format, a human-readable ranked table is printed. Pass --format markdown|html for a rendered report page, or --format json|yaml|toml|cbor|csv for structured output. Unlike bca metrics / bca ops (whose --output-dir is a directory of per-file emissions), a change-history report is a single whole-repo document, so bca vcs --output <file> writes one file (CBOR, being binary, requires --output). The global --paths / --include / --exclude / --no-ignore filters are reused to pick which tracked files to report.

bca vcs errors clearly when run outside a git working tree.

File-type scope

By default bca vcs ranks only the files bca computes metrics for — the same set bca metrics would analyse. High-churn non-source files (CHANGELOG.md, Cargo.lock, generated config) carry no maintainability meaning yet maximise the churn / commit / author signals, so ranking them beside source code is noise; scoping to files-with-metrics also keeps the standalone ranking aligned with the AST hotspot tables in bca report --vcs.

--file-types <SCOPE> selects the scope:

Value	Meaning
`metrics` (default)	Only files bca has a language/metrics for, by extension
`all`	Every tracked, non-binary, non-symlink text file
`rs,py,toml,…`	A comma-separated extension allow-list (leading dots optional, case-insensitive)

bca vcs                          # rank source files only (default)
bca vcs --file-types all         # rank every tracked text file
bca vcs --file-types rs,py       # rank only Rust and Python files

The check is extension-only (no file content is read) and ANDs with the --paths / --include / --exclude / --no-ignore filters — a file must pass both to be ranked. Extension-less files (Makefile, Dockerfile, LICENSE) and unknown extensions are out of the metrics scope; a custom list is a literal extension filter, so it can include a non-metrics type like toml. An empty or all-blank custom list is a clear error rather than a scope that silently ranks nothing.

Rendered report page

bca vcs --top 50 --format html --output vcs.html
bca vcs --top 50 --format markdown --output vcs.md

--format html produces a self-contained, sortable page styled exactly like bca report html (click any column header to re-sort); --format markdown produces the same ranked table as GitHub-Flavored Markdown. Both render every signal column (the complete, sortable view of the same data the structured formats carry). The column set is defined once and shared by both renderers, so they cannot drift.

To fold the ranking into the aggregated quality report instead of a standalone page, pass bca report --vcs, which appends a "Change-history risk" section to report markdown / report html.

Signals

Field	Type	Description
`commits_long` / `commits_recent`	u32	Distinct commits touching the file in each window
`churn_long` / `churn_recent`	u64	Σ(added + deleted) lines in each window
`authors_long` / `authors_recent`	u32	Distinct canonical author identities in each window
`ownership_top_share`	f64 ∈ [0,1]	Share of edits attributable to the top author (lower = more diluted)
`burst`	f64 ∈ [0,1]	`commits_recent / commits_long`
`bug_fix_commits`	u32	Long-window commits whose message matches a bug-fix keyword
`security_fix_commits`	u32	Long-window commits matching security keywords (`CVE-####`, `security`, `vuln`, `exploit`, `sanitize`, …)
`revert_commits`	u32	Long-window commits whose subject is a revert / rollback
`age_days`	u32	Days since the file's first in-window commit (capped at the long window)
`last_modified_days`	u32	Days since the file's most recent in-window commit
`change_entropy_long` / `change_entropy_recent`	f64	Change entropy in bits per window (see below)
`cochange_entropy_long` / `cochange_entropy_recent`	f64	Co-change graph entropy in bits per window (see below)
`risk_score`	f64	Composite, formula-versioned (see below) — ordinal, not cardinal
`hotspot_score`	f64?	`complexity × churn_recent`; present only when AST metrics are computed alongside
`risk_score_version` / `vcs_schema_version`	u32	Forward-compatibility version stamps. Carried once on the report envelope, alongside `long_window_days` / `recent_window_days` — not repeated inside each per-file `vcs` block (issue #635)

Author identities are canonicalised through the repository .mailmap and counted by lowercased email; Co-authored-by: trailers add participants. Bot identities (dependabot[bot], renovate[bot], github-actions[bot], …) are excluded by default. Binary files and symlinks are skipped; an untracked file has no record at all (distinct from a tracked file with zero in-window activity).

Change & co-change entropy

Two process-entropy signals (added in risk_score_version 2) capture how a file changes, not just how much:

Change entropy (Hassan, 2009 — Predicting Faults Using the Complexity of Code Changes). For each commit, the Shannon entropy (in bits) of its churn distribution across the files it touched measures how scattered that change was: a one-file commit is 0; a commit spreading churn evenly across n files approaches log₂(n). Each file is then credited its churn share pᵢ·H of every commit it took part in (Hassan's History Complexity Metric). Higher = the file is repeatedly caught up in diffuse, cross-cutting changes. Later work (arXiv 2504.18511, below) measured file-level change entropy at a Pearson correlation up to 0.54 with defect counts on eight Apache projects.
Co-change graph entropy (arXiv 2504.18511, 2025). Files that change in the same commit are joined by a weighted edge (weight = number of shared commits). A file's co-change entropy is the Shannon entropy of its edge-weight distribution: low when it always co-changes with the same partner, high when its changes ripple across many different files. Combined with change entropy it improved AUROC in 82.5% of cases over the v1 signal set on eight Apache projects.

Both are reported per window. A 0.0 is computed, not missing: the file only ever changed alone (no co-change neighbours, or single-file commits with zero change entropy). Bulk-import commits touching more than 1000 files are excluded from the co-change graph — its edge count grows O(width²) — but still contribute their O(width) change entropy.

Composite risk score

The default weighted formula is a log-scaled weighted sum with categorical multiplicative bumps:

recency_churn  = ln(1 + churn_recent)
recency_count  = ln(1 + commits_recent)
long_count     = ln(1 + commits_long)
long_churn     = ln(1 + churn_long)
author_factor  = ln(1 + authors_long)
dilution       = (1 - ownership_top_share).clamp(0, 1)
fix_factor     = ln(1 + bug_fix_commits + 2 * security_fix_commits)
size_factor    = ln(1 + sloc)^2 / 100              // full coefficient, not a tie-breaker
entropy_factor = 0.10 * change_entropy_recent + 0.05 * cochange_entropy_recent
new_file_bonus = 0.15 if age_days < recent_window_days else 0
dev_bonus      = 0.35 if authors_long >= 9 else 0.15 if authors_long >= 6 else 0

base = 0.30 * recency_churn
     + 0.25 * recency_count
     + 0.15 * long_count
     + 0.15 * author_factor * (1 + dilution)
     + 0.10 * fix_factor
     + 0.05 * long_churn
     + entropy_factor
     + size_factor

risk_score = base * (1 + dev_bonus + new_file_bonus)

The term weights are grounded in the literature: recent churn and commit frequency carry the highest weight (Nagappan & Ball relative churn; just-in-time defect prediction; Firefox NumChanges PD 86); the author factor is scaled by ownership dilution (Avelino DoA / truck-factor; Bird et al.); the categorical developer-count bumps encode the RHEL4 finding that files touched by ≥9 developers were ~16× more likely to harbour a vulnerability; security fixes are double-weighted (Sentence-Level VFC studies; PySecDB); and the recent-window change- and co-change-entropy terms enter additively (Hassan 2009; arXiv 2504.18511). The full derivation lives in src/vcs/score.rs.

The score is ordinal: only relative ranks have meaning. A single risk_score_version (now 2) versions both formulas — any change to the weighted sum or the --risk-formula percentile blend bumps it; the recent entropy pair joins both.

--risk-formula percentile is an alternative: each signal is re-ranked to its percentile within the analyzed set, then averaged — the literature recommends relative triggers over hard thresholds for cross-project robustness.

Flags

Flag	Default	Meaning
`--long-window <DUR>`	`12mo`	Long window (`12mo`, `2y`, `8w`, `365d`, ISO 8601 `P1Y`)
`--recent-window <DUR>`	`90d`	Recent window
`--top <N>`	`50`	Show only the top N (`0` = all)
`--file-types <SCOPE>`	`metrics`	Files to rank: `metrics`, `all`, or an extension list (`rs,py`)
`--ref <REF>`	`HEAD`	Revision to analyse
`--full-history`	off	Walk the full DAG (default: first-parent only)
`--include-merges`	off	Include merge commits
`--no-follow-renames`	off	Stop following renames (default: follow)
`--no-exclude-bots` / `--bot-pattern <RE>`	exclude	Bot-author filtering
`--as-of <WHEN>`	wall clock	Reference "now" (RFC 3339 / `@unix` / git date) for reproducible snapshots
`--risk-formula {weighted\|percentile}`	`weighted`	Composite formula
`--emit-author-details`	off	Emit SHA-256-hashed canonical author IDs
`--author-hash-key <KEY>`	unset	Harden the emitted author digests into a keyed HMAC (see Author-detail privacy); requires `--emit-author-details`
`--include-deleted`	off	Also rank files deleted at the target ref
`--no-cache`	off	Skip the persistent history cache (always walk fresh)
`--clear-cache`	off	Wipe this repo's cached history before running
`--cache-dir <DIR>`	platform cache	Override the cache directory

Caching

Ranking re-walks only the part of history inside the long window, but on a large, active repository that is still the dominant cost — and in CI the interesting deltas between runs are just the commits pushed since the last one. bca vcs therefore keeps a persistent cache of each walk, keyed by the resolved HEAD SHA and the repository's identity:

On an unchanged tree the prior result is replayed, no history walk.
When HEAD has advanced the walk visits only the new commits and splices them onto the cached history.
A force-push (the cached head is no longer an ancestor of the new one) falls back to a full walk.

The cache is a pure optimization: a hit is bit-identical to a fresh walk, and the time windows are recomputed against the current moment on every run, so a cached result is never stale. An entry is ignored — and the history recomputed — whenever the schema, the score-formula version, or the walk-affecting options differ; in particular changing a window forces a fresh walk. (Finalization-only knobs such as --risk-formula, --emit-author-details, --author-hash-key, and --include-deleted are applied on replay, so they reuse the same cached walk — a cached walk even re-finalizes under a different author-hash key without re-walking.)

By default the cache lives under $XDG_CACHE_HOME/big-code-analysis/vcs (%LOCALAPPDATA% on Windows, ~/.cache otherwise). Author identities are stored only as their SHA-256 digests — never plaintext — so the cache holds no raw author emails. Note this is pseudonymization, not anonymization: the digests are recoverable against a candidate email set (see --emit-author-details). The same cache transparently accelerates bca metrics --vcs and bca report --vcs.

# First run primes the cache; the second replays it.
bca vcs --paths .
bca vcs --paths .                 # reuses prior work

bca vcs --no-cache --paths .      # ignore the cache for this run
bca vcs --clear-cache --paths .   # rebuild from scratch
bca vcs --cache-dir /tmp/bca-cache --paths .

The REST (POST /v1/vcs) and Python (vcs.rank) surfaces expose the same behaviour through optional no_cache / cache_dir parameters.

The cache is specific to the file ranking. The trend and commit subcommands — and the /v1/vcs/trend and /v1/vcs/jit endpoints — do not use it, so the cache flags do not apply there: passing --no-cache / --cache-dir alongside a subcommand is a usage error, and the trend endpoint rejects a no_cache / cache_dir field rather than silently ignoring it (issue #961).

In `bca metrics`

Pass bca metrics --vcs to attach a vcs block (plus a hotspot_score computed from the file's cyclomatic sum) to each file's metrics:

$ bca metrics --vcs --paths src/parser.rs --format json
{ "name": "src/parser.rs",
  "metrics": { "cyclomatic": { ... },
    "vcs": { "commits_long": 15, "churn_recent": 211,
             "risk_score": 3.7, "hotspot_score": 7596.0, ... } } }

bca metrics --vcs uses the default windows and weighted formula; for window / formula tuning use bca vcs.

Per-function attribution

bca metrics --vcs-per-function (which implies --vcs) additionally attaches a vcs block to every nested function, method, and class space. It blames each file once with git blame and buckets the surviving lines into the AST function spans, so you can rank the risky function inside a risky file:

$ bca metrics --vcs-per-function --paths src/parser.rs --format json
{ "name": "src/parser.rs",
  "metrics": { "vcs": { "risk_score": 3.7, ... } },   // file-level block
  "spaces": [
    { "name": "parse", "kind": "function",
      "metrics": { "vcs": { "commits_long": 4, "churn_recent": 12,
                            "risk_score": 2.1, "hotspot_score": 144.0 } } } ] }

The per-function block is a current-blame snapshot and is not directly comparable to the file-level block: its churn counts surviving lines whose last touch falls inside the window (not historical added+deleted churn), and ownership is credited per touching commit. A function nobody has changed within the window reports zero counts. Lines whose last touch predates the long window contribute to the function's size but to none of the windowed counts.

Limitations. Blame follows file renames (so edits under a former path still attribute), but attributes a line moved between functions to its current position only. A function split into two has no record of its pre-split identity, and a deleted-then-recreated function attributes to the recreating commits. If a file cannot be blamed — untracked, or the rare gix-blame failure on pathologically repetitive content — its per-function blocks are simply omitted while the file-level block (and the AST metrics) still emit.

Just-in-time (commit-level) scoring

Where everything above ranks files at a ref, bca vcs commit <commit> scores a single commit for defect-induction risk — the unit a CI gate reviews at check-in. (The subcommand was renamed from bca vcs jit in 2.0; the old jit spelling keeps working as a hidden alias for one release cycle. "Just-in-time (JIT)" stays the literature term, below.) It is a static, rule-based scorer (no trained model, so nothing drifts as the project ages), with the feature groups and signs taken from the just-in-time defect-prediction literature: Kamei et al., A Large-Scale Empirical Study of Just-in-Time Quality Assurance, IEEE TSE 2013, with the open replications Commit Guru (FSE 2015) and McIntosh & Kamei, Are Fix-Inducing Changes a Moving Target? (IEEE TSE 2018).

$ bca vcs commit HEAD --pretty
{
  "jit_schema_version": 3,
  "jit_score_version": 1,
  "source": "commit",
  "risk_score": 4.40,
  "commit": { "id": "5176d3e…", "parent_count": 1, "is_merge": false,
              "purpose": { "is_fix": true, "is_security_fix": false,
                           "is_revert": false } },
  "features": {
    "size":       { "lines_added": 942, "lines_deleted": 60,
                    "files_touched": 19, "hunks": 78 },
    "diffusion":  { "subsystems": 5, "directories": 8, "entropy": 3.48 },
    "history":    { "prior_changes": 275, "prior_distinct_authors": 1,
                    "prior_bug_fix_commits": 237,
                    "prior_security_fix_commits": 21,
                    "file_risk_max": 10.97, "file_risk_mean": 3.87,
                    "new_files": 2 },
    "experience": { "author_prior_commits": 962,
                    "author_recent_commits": 962 }
  },
  "contributions": { "size": 2.74, "diffusion": 0.97, "history": 1.57,
                     "purpose": 0.15, "experience": -1.03 }
}

The five feature groups, and how each moves the score:

Group	Features	Direction
Size	lines added / deleted, files touched, diff hunks	larger ⇒ riskier
Diffusion	distinct subsystems & directories, within-commit change entropy	more scattered ⇒ riskier
History	the touched files' priors — prior changes, distinct authors, bug- and security-fix counts, and the composite `risk_score` — measured from history before the commit	turbulent file history ⇒ riskier
Experience	the author's prior commit count (long & recent)	more experience ⇒ less risky (this group subtracts)
Purpose	fix / security-fix / revert classification of the message	fixes add, reverts dampen

The contributions block reports each group's signed contribution to the ordinal risk_score, so a consumer can see why a commit ranked where it did. Like the file-level risk_score, the score is ordinal: rank commits by it, or compare a commit against the repository's own distribution, but do not read the magnitude as a probability. Any formula change bumps jit_score_version (separate from the file-level risk_score_version).

The commit is scored against its first parent. A merge commit is flagged (is_merge, parent_count ≥ 2) and scored against that first parent. A root commit and any new files carry zero priors by construction — the score then leans on size and author experience, exactly as the literature prescribes for changes with no file history.

The window / --ref / bot / merge / rename flags are shared with the parent bca vcs command; the commit-only flags are the positional <commit> (default HEAD), --format json|yaml|toml|cbor (default json), --output, --pretty, and:

# CI gate: exit 2 when the commit scores at or above the threshold.
bca vcs commit HEAD --fail-above 6.0

--fail-above uses exit code 2 (the same "metric gate" convention as bca check; exit 1 stays reserved for tool errors). Because the score is ordinal, calibrate the threshold against your repository's own commit-score distribution rather than treating it as an absolute.

Scoring an arbitrary diff (`--diff`)

bca vcs commit --diff <file> scores a git diff instead of a commit (use --diff - to read the diff from stdin). This is handy in a pre-commit hook or a code-review bot, where the change exists only as a diff and has not been committed yet.

git diff --cached | bca vcs commit --diff - --pretty

The input must be a git-style unified diff carrying diff --git file headers, as produced by git diff or git format-patch. Plain diff -u / diff -ru output (which has ---/+++ header lines but no diff --git header) parses to zero files, and combined / merge diffs (git diff --cc, with @@@ hunk headers) are rejected as a malformed diff — pipe a regular two-way git diff instead.

A bare diff carries no author, parent, or file history, so only the size and diffusion groups are computable. The output is therefore a deliberately partial report — a distinct shape from a commit report:

$ git diff | bca vcs commit --diff - --pretty
{
  "jit_schema_version": 3,
  "jit_score_version": 1,
  "source": "diff",
  "partial_risk_score": 1.83,
  "size":      { "lines_added": 42, "lines_deleted": 8,
                 "files_touched": 3, "hunks": 6 },
  "diffusion": { "subsystems": 2, "directories": 3, "entropy": 1.46 },
  "contributions": { "size": 1.18, "diffusion": 0.65 }
}

The source field is a permanent "diff" marker, and the history / experience / purpose groups are absent from the report entirely — not present as zero. Zero is a real value (a commit genuinely with no prior history scores those groups at zero); an absent group means "unavailable", so a consumer can never mistake an unscored group for "low risk". For the same reason the score field is named partial_risk_score, not risk_score.

A diff-only score is not comparable to a commit score. The partial score sums only size + diffusion, so it is always lower than the full commit score for the same change (which also folds in history, experience, and purpose). Rank diffs against other diffs, never against commit scores. --diff and the positional <commit> are mutually exclusive; --fail-above works in both modes (calibrate the diff-mode threshold against your own diff-score distribution).

The parser understands git's default C-style path quoting (core.quotePath=true), so a diff touching a file with a non-ASCII or spaced name (which git emits as "a/na\303\257ve.txt") is grouped under its decoded path in the diffusion features, not the raw quoted string.

REST and Python parity

The JIT score is also available off the CLI:

REST: POST /v1/vcs/jit with { "id", "repo_path", "commit" } returns the commit JitReport JSON, or { "id", "diff" } returns the partial diff report. See Driving the REST API.
Python: vcs.commit(repo_path, commit=...) returns the commit report as a dict, and vcs.score_diff(diff) the partial diff report. See Change-history (VCS) metrics.

ML-based JIT models and server-side hook integration remain out of scope.

Historical trend (over time)

A single bca vcs run answers "what is risky now." bca vcs trend answers "is it getting better or worse" — the actionable question for a technical-debt programme — by sampling the metrics at several points in time and emitting a per-file time series.

$ bca vcs --top 20 trend --points 12 --span 24mo --pretty
{
  "trend_schema_version": 1,
  "vcs_schema_version": 2,
  "risk_score_version": 2,
  "long_window_days": 365,
  "recent_window_days": 90,
  "truncated_shallow_clone": false,
  "as_of_points": [ 1700000000, 1705259520, ... ],
  "files": {
    "src/parser.rs": [
      null,                       // did not exist at the oldest point
      { "as_of": 1705259520, "vcs": { "risk_score": 4.1, ... } },
      { "as_of": 1710519040, "vcs": { "risk_score": 6.8, ... } }
    ]
  },
  "deltas": {
    "improved":  [ { "path": "src/old.rs",    "delta": -3.2, ... } ],
    "regressed": [ { "path": "src/parser.rs", "delta":  2.7, ... } ]
  }
}

--points N evenly-spaced samples (inclusive of both endpoints) cover --span DURATION, ending at --as-of (or wall-clock now). as_of_points lists the sample timestamps oldest-first; every file's array aligns to it 1:1, with a null element marking a point where the file did not exist yet. deltas ranks the files whose risk_score fell the most (improved) and rose the most (regressed) between each file's earliest and latest present points; --top-deltas trims each list.

Crucially, each point re-anchors at the mainline tip that existed at or before that moment — it does not just re-window today's HEAD tree. That is what makes a file born later show as null at older points (rather than leaking its present-day metrics backwards). Files kept in the series are the --top highest-risk by their most-recent sample.

Flags reused from the parent bca vcs command: the window (--long-window / --recent-window), --ref, --file-types, bot / merge / rename toggles, --as-of (the most-recent anchor), and --top. -O accepts json (default), yaml, or cbor; TOML is excluded because an absent point serializes as null, which TOML cannot represent. The point count is bounded (2–120) to keep the per-point history walks tractable on deep histories.

Rename caveat. Renames are followed within each sample's walk, but a file renamed between two samples appears as two separate path series (its old name, then its new name) rather than one continuous line. Cross-sample rename stitching is a deferred follow-up.

Bus factor (directory & repo level)

Where the per-file ownership_top_share measures concentration within a file, the bus factor (a.k.a. truck factor) measures it across a set of files: the minimum number of developers whose departure would leave more than half of a directory's files without a knowledgeable maintainer. bca vcs emits it as a top-level vcs_aggregate object alongside the ranked files:

{
  "vcs_aggregate": {
    "bus_factor": {
      "bus_factor_schema_version": 2,
      "coverage_threshold": 0.5,
      "doa_threshold": 0.75,
      "repo": { "bus_factor": 3, "files": 412, "authors": 11 },
      "by_directory": [
        { "directory": "src", "bus_factor": 2, "files": 180, "authors": 7 },
        { "directory": "src/vcs", "bus_factor": 1, "files": 24, "authors": 3 }
      ]
    }
  }
}

Each developer's authorship of each file is scored with the Avelino Degree-of-Authorship heuristic (Avelino, Passos, Hora & Valente, A Novel Approach for Estimating Truck Factors, ICPC 2016):

DoA(d, f) = 3.293 + 1.098·FA + 0.164·DL − 0.321·ln(1 + AC)

where FA is first authorship (1 if d created f), DL is d's deliveries (changes) to f, and AC is the changes made by other developers. A developer is an author of f when their DoA, normalised by the file's maximum, clears 0.75 (the paper's threshold). The truck factor is then a greedy removal: drop the developer who authors the most still-covered files, repeat until more than --bus-factor-threshold (default 0.5, per Avelino) of the files are orphaned, and report how many were removed. by_directory covers each top-level directory and each of its immediate subdirectories, computed over every file recursively beneath it.

Caveats, by construction:

A repository (or directory) of mostly single-author files reports a bus factor of 1 — losing that one author orphans each file. This is the heuristic working as intended, not a bug; treat the number as a planning signal, not a guarantee.
Bot identities are filtered (like the per-file signals), and files with no in-window activity carry no authorship and are excluded from the denominator.
"First authorship" means the earliest commit observed within the long window, not necessarily a file's true creation.

The aggregate reflects the whole repository within the file-type scope (one history walk covers every in-scope file — by default the files-with-metrics set, see File-type scope — so --file-types all widens the bus factor to every tracked file). --paths / --include / --exclude scope only the ranked per-file list, not the bus factor. To focus on a subsystem, read its entry in by_directory rather than filtering the walk.

--emit-author-details adds a key_author_ids list to each group — the SHA-256-hashed identities of the removed key developers, in removal order (plaintext identities never leave the process). The aggregate is computed only for the dedicated bca vcs / bca report --vcs reports and the REST / Python endpoints; the per-file bca metrics --vcs injection path does not pay for it.

Author-detail privacy

The key_author_ids digests are a stable pseudonym, not anonymization. Hashing keeps plaintext emails out of the report and the cache and deters casual disclosure, but the hash is not cryptographically irreversible. The pre-image is an email — low-entropy and enumerable — and commit histories are public, so anyone with a candidate set of emails can recover which digest belongs to whom by hashing each candidate or with a precomputed email→hash table. This is the same weakness that broke Gravatar's email hashing.

Treat published key_author_ids (and the per-file author_ids) as pseudonymization that avoids emitting plaintext emails, not as a guarantee that authors cannot be re-identified by a determined attacker. If you need that guarantee, do not publish the digests.

Hardened mode: `--author-hash-key`

For stronger resistance, pass a secret key with --author-hash-key <KEY> (requires --emit-author-details). The emitted digests then become an HMAC-SHA256(key, SHA-256(email)) instead of a bare hash: an attacker without the key can no longer hash a candidate email to recognise its digest, nor use a precomputed email→hash table — both attacks need the secret key. Pick a high-entropy key and keep it secret; anyone who learns it can re-run the enumeration.

The key is stable: the same key yields the same digests across every report and across a persistent-cache replay, so cross-report correlation and the cache still work. Different keys produce unrelated digests, so two teams sharing histories cannot cross-link authors unless they share the key.

Prefer the BCA_AUTHOR_HASH_KEY environment variable over the flag — a key on the command line is visible to other local users via the process list (ps) and is saved in shell history. The flag takes precedence when both are set:

export BCA_AUTHOR_HASH_KEY="$(cat ~/.config/bca/author-key)"
bca vcs --emit-author-details

What the key does not cover: the on-disk history cache (issue #334) deliberately stores the unkeyed inner SHA-256 digest, because the key is applied at finalization so a cached walk can be re-finalized under any key without re-walking. The cache is local-only and never published, but if your threat model includes an attacker reading your local cache directory, disable the cache (--no-cache) or clear it (--clear-cache). The same key option is available on the REST endpoint (author_hash_key) and in Python (vcs.Options(author_hash_key=…)).

Dogfooding in this repo

This project runs bca vcs on its own source. make vcs prints the ranked table (path selection and the .bcaignore deny-set come from the repo-root bca.toml manifest, the same config make self-scan and make report use; BCA_VCS_TOP overrides the row cap). The manifest's [vcs] file_types key sets the default scope (the --file-types CLI flag replaces it when given). On every push to main the Pages CI job folds the rendered ranking into the flagship report — bca report html --vcs / report markdown --vcs — so the published reports/index.html shows the change-history risk section side-by-side with the AST hotspots, and additionally publishes the full top-100 ranking as reports/vcs-report.json for tooling.

REST and Python

REST: POST /v1/vcs with a JSON body { "id": "...", "repo_path": "/path/to/repo", ... } returns the ranked report, and POST /v1/vcs/trend (same fields plus points / span / top_deltas) returns the historical time series. See Driving the REST API.
Python: big_code_analysis.vcs.rank(repo_path, …) returns the ranked report as a dict, vcs.trend(repo_path, points=…, span=…, …) returns the time series, and analyze(path, vcs=True) attaches a vcs block to a single file's metrics.

Both POST /v1/vcs and vcs.rank() (through vcs.Options) accept an optional file_types ("metrics" / "all" / "rs,py") to scope which files are ranked, mirroring the CLI --file-types.

Both include the vcs_aggregate bus factor in the result and accept a bus_factor_threshold (in (0, 1)) to tune the coverage fraction.

big-code-analysis Documentation