Supporting a new language
This section is to help developers implement support for a new
language in big-code-analysis.
To implement a new language, two steps are required:
- Generate the grammar
- Add the grammar to
big-code-analysis
A number of metrics are supported and help to implement those are covered elsewhere in the documentation.
Generating the grammar
As a prerequisite for adding a new grammar, there needs to exist a tree-sitter version for the desired language that matches the version used in this project.
The grammars are generated by a project in this repository called enums. The following steps add the language support from the language crate and generate an enum file that is then used as the grammar in this project to evaluate metrics.
- Add the language specific
tree-sittercrate to theenumscrate, making sure the dependency is pinned with=X.Y.Zto the same version used in the rootbig-code-analysisCargo.toml. For example, for the Rust support the following line exists in the /enums/Cargo.toml:tree-sitter-rust = "=0.24.2". - Append the language to the
enumcrate in /enums/src/languages.rs. Keeping with Rust as the example, the line would be(Rust, tree_sitter_rust). The first parameter is the name of the Rust enum that will be generated, the second is thetree-sitterfunction to call to get the language's grammar. - Add a case to the end of the match in
mk_get_languagemacro rule in /enums/src/macros.rs. The current convention uses theLANGUAGEconstant exposed by modern grammar crates: for Rust that line isLang::Rust => tree_sitter_rust::LANGUAGE.into(). - Lastly, we execute the
/recreate-grammars.sh
script that runs the
enumscrate to generate the grammar for the new language.
At this point we should have a new grammar file for the new language in /src/languages/. See /src/languages/language_rust.rs as an example of the generated enum.
Adding the new grammar to big-code-analysis
- Add the language specific
tree-sittercrate to thebig-code-analysisworkspace, with the same=X.Y.Zpin as theenumscrate uses. For example, for the Rust support the line in the root Cargo.toml istree-sitter-rust = "=0.24.2". - Next we add the new
tree-sitterlanguage namespace to /src/languages/mod.rs eg.
#![allow(unused)] fn main() { pub mod language_rust; pub use language_rust::*; }
- Lastly, we add a definition of the language to the arguments of
mk_langs!macro in /src/langs.rs.
#![allow(unused)] fn main() { // 1) Name for enum // 2) Language description // 3) Display name // 4) Empty struct name to implement // 5) Parser name // 6) tree-sitter function to call to get a Language // 7) file extensions // 8) emacs modes ( Rust, "The `Rust` language", "rust", RustCode, RustParser, tree_sitter_rust, [rs], ["rust"] ) }
Implementing traits and tests
Wiring the grammar is only the first step. The new <Lang>Code type
must also implement the AST plumbing and every metric trait the
workspace defines:
Checkerin /src/checker.rs — comment, function, closure, call, string-literal, andelse-ifpredicates over the grammar'skind_ids.Getterin /src/getter.rs —get_space_kindplus the Halstead operator/operand classification table.Alteratorin /src/alterator.rs — usually only string-literal preservation; the default impl works for most languages.- All twelve metric traits:
Abc,Cognitive,Cyclomatic,Exit,Halstead,Loc,Mi,NArgs,Nom,Npa,Npm,Wmc. Register each via theimplement_metric_trait!macro invocation in /src/metrics/ to start with default (no-op) bodies, then replace with real impls for the metrics that have meaningful semantics for the language.
Audit aliased grammar variants
Tree-sitter grammars frequently emit several distinct kind_ids that
map to the same node.kind() string (Identifier /
Identifier2 / Identifier3 in Go,
InvocationExpression / InvocationExpression2 in C#,
QuotedContent ⋯ QuotedContent20 in Elixir). Every match node.kind_id() arm that touches an aliasable rule must either list
every numbered variant or compare on the string node.kind()
instead. Missing an alias silently drops nodes from the metric. See
the add-lang skill for the mechanical audit procedure and
lessons 2, 4, and 13 in
docs/development/lessons_learned.md
for the failure modes.
Tests
Add per-language tests under each src/metrics/*.rs test module —
aim for parity with the Rust coverage (≥ 34 tests total across the
metric files). Every insta::assert_json_snapshot! call MUST be
anchored: either with an inline expected block, a positive
assert_eq! on the headline integer accessor above it, or an
explanatory // expected: comment. make snapshot-anchors (run as
part of make pre-commit) enforces this against
.snapshot-anchor-baseline.txt.
End-to-end workflow
For an opinionated, end-to-end recipe — including the alias audit,
test layout, snapshot anchoring, and code-quality post-passes — see
the project's
add-lang
Claude Code skill. It is the canonical workflow used by recent
language additions (Elixir, PHP, C#, Bash, Go).