Supporting a new language

This section is to help developers implement support for a new language in big-code-analysis.

To implement a new language, two steps are required:

  1. Generate the grammar
  2. Add the grammar to big-code-analysis

A number of metrics are supported and help to implement those are covered elsewhere in the documentation.

Generating the grammar

As a prerequisite for adding a new grammar, there needs to exist a tree-sitter version for the desired language that matches the version used in this project.

The grammars are generated by a project in this repository called enums. The following steps add the language support from the language crate and generate an enum file that is then used as the grammar in this project to evaluate metrics.

  1. Add the language specific tree-sitter crate to the enums crate, making sure the dependency is pinned with =X.Y.Z to the same version used in the root big-code-analysis Cargo.toml. For example, for the Rust support the following line exists in the /enums/Cargo.toml: tree-sitter-rust = "=0.24.2".
  2. Append the language to the enum crate in /enums/src/languages.rs. Keeping with Rust as the example, the line would be (Rust, tree_sitter_rust). The first parameter is the name of the Rust enum that will be generated, the second is the tree-sitter function to call to get the language's grammar.
  3. Add a case to the end of the match in mk_get_language macro rule in /enums/src/macros.rs. The current convention uses the LANGUAGE constant exposed by modern grammar crates: for Rust that line is Lang::Rust => tree_sitter_rust::LANGUAGE.into().
  4. Lastly, we execute the /recreate-grammars.sh script that runs the enums crate to generate the grammar for the new language.

At this point we should have a new grammar file for the new language in /src/languages/. See /src/languages/language_rust.rs as an example of the generated enum.

Adding the new grammar to big-code-analysis

  1. Add the language specific tree-sitter crate to the big-code-analysis workspace, with the same =X.Y.Z pin as the enums crate uses. For example, for the Rust support the line in the root Cargo.toml is tree-sitter-rust = "=0.24.2".
  2. Next we add the new tree-sitter language namespace to /src/languages/mod.rs eg.
#![allow(unused)]
fn main() {
pub mod language_rust;
pub use language_rust::*;
}
  1. Lastly, we add a definition of the language to the arguments of mk_langs! macro in /src/langs.rs.
#![allow(unused)]
fn main() {
// 1) Name for enum
// 2) Language description
// 3) Display name
// 4) Empty struct name to implement
// 5) Parser name
// 6) tree-sitter function to call to get a Language
// 7) file extensions
// 8) emacs modes
(
    Rust,
    "The `Rust` language",
    "rust",
    RustCode,
    RustParser,
    tree_sitter_rust,
    [rs],
    ["rust"]
)
}

Implementing traits and tests

Wiring the grammar is only the first step. The new <Lang>Code type must also implement the AST plumbing and every metric trait the workspace defines:

  • Checker in /src/checker.rs — comment, function, closure, call, string-literal, and else-if predicates over the grammar's kind_ids.
  • Getter in /src/getter.rsget_space_kind plus the Halstead operator/operand classification table.
  • Alterator in /src/alterator.rs — usually only string-literal preservation; the default impl works for most languages.
  • All twelve metric traits: Abc, Cognitive, Cyclomatic, Exit, Halstead, Loc, Mi, NArgs, Nom, Npa, Npm, Wmc. Register each via the implement_metric_trait! macro invocation in /src/metrics/ to start with default (no-op) bodies, then replace with real impls for the metrics that have meaningful semantics for the language.

Audit aliased grammar variants

Tree-sitter grammars frequently emit several distinct kind_ids that map to the same node.kind() string (Identifier / Identifier2 / Identifier3 in Go, InvocationExpression / InvocationExpression2 in C#, QuotedContentQuotedContent20 in Elixir). Every match node.kind_id() arm that touches an aliasable rule must either list every numbered variant or compare on the string node.kind() instead. Missing an alias silently drops nodes from the metric. See the add-lang skill for the mechanical audit procedure and lessons 2, 4, and 13 in docs/development/lessons_learned.md for the failure modes.

Tests

Add per-language tests under each src/metrics/*.rs test module — aim for parity with the Rust coverage (≥ 34 tests total across the metric files). Every insta::assert_json_snapshot! call MUST be anchored: either with an inline expected block, a positive assert_eq! on the headline integer accessor above it, or an explanatory // expected: comment. make snapshot-anchors (run as part of make pre-commit) enforces this against .snapshot-anchor-baseline.txt.

End-to-end workflow

For an opinionated, end-to-end recipe — including the alias audit, test layout, snapshot anchoring, and code-quality post-passes — see the project's add-lang Claude Code skill. It is the canonical workflow used by recent language additions (Elixir, PHP, C#, Bash, Go).