Flat-record iteration

bca.flatten_spaces(result) walks the nested FuncSpace tree in pre-order and yields one flat, scalar-only dict per node — ready for sqlite3.executemany, pandas.DataFrame.from_records, or any other tabular consumer.

Metric keys use the same dotted convention as the CLI's CSV writer (cyclomatic.modified.sum, halstead.volume, loc.lloc_average, …). Identity keys (path, name, kind, start_line, end_line, parent_name, depth) are added on every record.

SQLite via executemany

The example below analyses one file and inserts one row per FuncSpace into a sqlite table whose columns are the union of all flattened keys.

"""Flatten a FuncSpace tree into scalar rows for sqlite / pandas.

Demonstrates ``bca.flatten_spaces`` + ``sqlite3.executemany``. The
pandas equivalent is shown in the book as a non-executed snippet so
this example stays dependency-free (sqlite ships with the stdlib).

Tied to the book's ``python/flat-records.md`` page.
"""

from __future__ import annotations

import sqlite3
from contextlib import closing
from pathlib import Path

import big_code_analysis as bca

# SQLite identifier names are case-insensitive, so the Halstead
# pair `N1` / `n1` (and `N2` / `n2`) collide on one column. Rewrite
# the uppercase totals to a distinct name before insertion. The
# explicit map (not a `.replace(".N", "...")` substring rewrite)
# means a hypothetical future `halstead.NN_metric` would not be
# silently mangled.
_RENAME_FOR_SQLITE: dict[str, str] = {
    "halstead.N1": "halstead.total_1",
    "halstead.N2": "halstead.total_2",
}


def _safe_column(key: str) -> str:
    return _RENAME_FOR_SQLITE.get(key, key)


def run(path: Path, db_path: Path) -> int:
    """Analyse ``path`` and insert one row per FuncSpace into ``db_path``.

    Returns the number of rows inserted so the test can assert on it.
    """
    result = bca.analyze(path)
    if result is None:
        msg = f"{path} was skipped (looks generated)"
        raise SystemExit(msg)

    records = [{_safe_column(k): v for k, v in r.items()} for r in bca.flatten_spaces(result)]
    if not records:
        return 0

    columns = sorted({k for r in records for k in r})
    cols_sql = ", ".join(f'"{c}"' for c in columns)
    placeholders = ", ".join("?" for _ in columns)
    rows = [tuple(r.get(c) for c in columns) for r in records]

    # `closing(sqlite3.connect(...))` is the documented idiom — the
    # bare ``with sqlite3.connect(...)`` context manager only commits
    # / rolls back the transaction; it does NOT close the connection,
    # so a long-running consumer leaks file descriptors (and on
    # Windows holds an exclusive write lock on the db file).
    with closing(sqlite3.connect(db_path)) as db, db:
        db.execute(f"CREATE TABLE IF NOT EXISTS metrics ({cols_sql})")
        db.executemany(
            f"INSERT INTO metrics ({cols_sql}) VALUES ({placeholders})",
            rows,
        )

    return len(rows)


if __name__ == "__main__":
    import sys

    if len(sys.argv) != 3:
        sys.exit("usage: python flat_records.py <source-file> <out.db>")
    inserted = run(Path(sys.argv[1]), Path(sys.argv[2]))
    print(f"inserted {inserted} rows into {sys.argv[2]}")

The iterator is lazy and single-use: it walks the input once without materialising the whole list. A second iteration of the same iterator yields nothing — call list() once if you need to re-iterate.

Pandas

flatten_spaces is the natural input to pandas.DataFrame.from_records. Pandas is not a dependency of the bindings; install it separately if you want the DataFrame view.

import big_code_analysis as bca
import pandas as pd

result = bca.analyze("src/lib.rs")
if result is not None:
    df = pd.DataFrame.from_records(bca.flatten_spaces(result))
    print(df.head())
    # Group by space kind to inspect the average cyclomatic per
    # function vs. per class vs. per file.
    by_kind = df.groupby("kind")["cyclomatic.sum"].mean()

Identity columns vs CLI CSV

The flat-record schema is mostly aligned with the CLI's CSV writer, with a couple of intentional deltas:

  • Identity columns use name / kind here; the CSV writer uses space_name / space_kind. Flat records also add parent_name / depth; the CSV writer omits those.
  • tokens.* flattens to the JSON shape (tokens.tokens, tokens.tokens_average, …), while CSV renames those to tokens.sum / .average / .min / .max. Rename in the consumer if you need exact CSV alignment.

Anonymous spaces (Rust closures, JavaScript function expressions / arrows) keep their name == "<anonymous>" marker verbatim — flatten_spaces does not normalise.

Caveats

  • parent_name alone cannot disambiguate same-named siblings nested under different parents (e.g. two Inner classes under two different outer classes both surface as parent_name == "Inner" for their own children). Pair with depth and source-order position, or rebuild the qualified name in your consumer, if you need a fully-qualified path.
  • Do not mutate the input result while iterating: the walker keeps references into it, so mutations to not-yet-yielded subtrees will be observed in later records.
  • Missing metric subtrees produce no keys (absent, not None), matching the "Halstead disabled" edge case for metric selection.
  • flatten_spaces raises TypeError if the input is not a mapping; callers must filter None returns from bca.analyze (e.g. generated files with skip_generated=True) before passing.