diff options
Diffstat (limited to 'admin/notes/tree-sitter/starter-guide')
-rw-r--r-- | admin/notes/tree-sitter/starter-guide | 455 |
1 files changed, 455 insertions, 0 deletions
diff --git a/admin/notes/tree-sitter/starter-guide b/admin/notes/tree-sitter/starter-guide new file mode 100644 index 00000000000..123dabd9f29 --- /dev/null +++ b/admin/notes/tree-sitter/starter-guide @@ -0,0 +1,455 @@ +STARTER GUIDE ON WRITING MAJOR MODE WITH TREE-SITTER -*- org -*- + +This document guides you on adding tree-sitter support to a major +mode. + +TOC: + +- Building Emacs with tree-sitter +- Install language definitions +- Setup +- Naming convention +- Font-lock +- Indent +- Imenu +- Navigation +- Which-func +- More features? +- Common tasks (code snippets) +- Manual + +* Building Emacs with tree-sitter + +You can either install tree-sitter by your package manager, or from +source: + + git clone https://github.com/tree-sitter/tree-sitter.git + cd tree-sitter + make + make install + +Then pull the tree-sitter branch (or the master branch, if it has +merged) and rebuild Emacs. + +* Install language definitions + +Tree-sitter by itself doesn’t know how to parse any particular +language. We need to install language definitions (or “grammars”) for +a language to be able to parse it. There are a couple of ways to get +them. + +You can use this script that I put together here: + + https://github.com/casouri/tree-sitter-module + +You can also find them under this directory in /build-modules. + +This script automatically pulls and builds language definitions for C, +C++, Rust, JSON, Go, HTML, Javascript, CSS, Python, Typescript, +and C#. Better yet, I pre-built these language definitions for +GNU/Linux and macOS, they can be downloaded here: + + https://github.com/casouri/tree-sitter-module/releases/tag/v2.1 + +To build them yourself, run + + git clone git@github.com:casouri/tree-sitter-module.git + cd tree-sitter-module + ./batch.sh + +and language definitions will be in the /dist directory. You can +either copy them to standard dynamic library locations of your system, +eg, /usr/local/lib, or leave them in /dist and later tell Emacs where +to find language definitions by setting ‘treesit-extra-load-path’. + +Language definition sources can be found on GitHub under +tree-sitter/xxx, like tree-sitter/tree-sitter-python. The tree-sitter +organization has all the "official" language definitions: + + https://github.com/tree-sitter + +* Setting up for adding major mode features + +Start Emacs and load tree-sitter with + + (require 'treesit) + +Now check if Emacs is built with tree-sitter library + + (treesit-available-p) + +* Tree-sitter major modes + +Tree-sitter modes should be separate major modes, so other modes +inheriting from the original mode don't break if tree-sitter is +enabled. For example js2-mode inherits js-mode, we can't enable +tree-sitter in js-mode, lest js-mode would not setup things that +js2-mode expects to inherit from. So it's best to use separate major +modes. + +If the tree-sitter variant and the "native" variant could share some +setup, you can create a "base mode", which only contains the common +setup. For example, there is python-base-mode (shared), python-mode +(native), and python-ts-mode (tree-sitter). + +In the tree-sitter mode, check if we can use tree-sitter with +treesit-ready-p, it will error out if tree-sitter is not ready. + +* Naming convention + +Use tree-sitter for text (documentation, comment), use treesit for +symbol (variable, function). + +* Font-lock + +Tree-sitter works like this: You provide a query made of patterns and +capture names, tree-sitter finds the nodes that match these patterns, +tag the corresponding capture names onto the nodes and return them to +you. The query function returns a list of (capture-name . node). For +font-lock, we use face names as capture names. And the captured node +will be fontified in their capture name. + +The capture name could also be a function, in which case (NODE +OVERRIDE START END) is passed to the function for fontification. START +and END are the start and end of the region to be fontified. The +function should only fontify within that region. The function should +also allow more optional arguments with (&rest _), for future +extensibility. For OVERRIDE check out the docstring of +treesit-font-lock-rules. + +** Query syntax + +There are two types of nodes, named, like (identifier), +(function_definition), and anonymous, like "return", "def", "(", +"}". Parent-child relationship is expressed as + + (parent (child) (child) (child (grand_child))) + +Eg, an argument list (1, "3", 1) could be: + + (argument_list "(" (number) (string) (number) ")") + +Children could have field names in its parent: + + (function_definition name: (identifier) type: (identifier)) + +Match any of the list: + + ["true" "false" "none"] + +Capture names can come after any node in the pattern: + + (parent (child) @child) @parent + +The query above captures both parent and child. + + ["return" "continue" "break"] @keyword + +The query above captures all the keywords with capture name +"keyword". + +These are the common syntax, see all of them in the manual +("Parsing Program Source" section). + +** Query references + +But how do one come up with the queries? Take python for an example, +open any python source file, type M-x treesit-explore-mode RET. Now +you should see the parse-tree in a separate window, automatically +updated as you select text or edit the buffer. Besides this, you can +consult the grammar of the language definition. For example, Python’s +grammar file is at + + https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js + +Neovim also has a bunch of queries to reference: + + https://github.com/nvim-treesitter/nvim-treesitter/tree/master/queries + +The manual explains how to read grammar files in the bottom of section +"Tree-sitter Language Definitions". + +** Debugging queries + +If your query has problems, use ‘treesit-query-validate’ to debug the +query. It will pop a buffer containing the query (in text format) and +mark the offending part in red. + +** Code + +To enable tree-sitter font-lock, set ‘treesit-font-lock-settings’ and +‘treesit-font-lock-feature-list’ buffer-locally and call +‘treesit-major-mode-setup’. For example, see +‘python--treesit-settings’ in python.el. Below I paste a snippet of +it. + +Note that like the current font-lock, if the to-be-fontified region +already has a face (ie, an earlier match fontified part/all of the +region), the new face is discarded rather than applied. If you want +later matches always override earlier matches, use the :override +keyword. + +Each rule should have a :feature, like function-name, +string-interpolation, builtin, etc. Users can then enable/disable each +feature individually. + +#+begin_src elisp +(defvar python--treesit-settings + (treesit-font-lock-rules + :feature 'comment + :language 'python + '((comment) @font-lock-comment-face) + + :feature 'string + :language 'python + '((string) @font-lock-string-face + (string) @contextual) ; Contextual special treatment. + + :feature 'function-name + :language 'python + '((function_definition + name: (identifier) @font-lock-function-name-face)) + + :feature 'class-name + :language 'python + '((class_definition + name: (identifier) @font-lock-type-face)) + + ...)) +#+end_src + +Then in ‘python-mode’, enable tree-sitter font-lock: + +#+begin_src elisp +(treesit-parser-create 'python) +(setq-local treesit-font-lock-settings python--treesit-settings) +(setq-local treesit-font-lock-feature-list + '((comment string function-name) + (class-name keyword builtin) + (string-interpolation decorator))) +... +(treesit-major-mode-setup) +#+end_src + +Concretely, something like this: + +#+begin_src elisp +(define-derived-mode python-mode prog-mode "Python" + ... + (cond + ;; Tree-sitter. + ((treesit-ready-p 'python-mode 'python) + (treesit-parser-create 'python) + (setq-local treesit-font-lock-settings python--treesit-settings) + (setq-local treesit-font-lock-feature-list + '((comment string function-name) + (class-name keyword builtin) + (string-interpolation decorator))) + (treesit-major-mode-setup)) + (t + ;; No tree-sitter + (setq-local font-lock-defaults ...) + ...))) +#+end_src + +* Indent + +Indent works like this: We have a bunch of rules that look like + + (MATCHER ANCHOR OFFSET) + +When the indentation process starts, point is at the BOL of a line, we +want to know which column to indent this line to. Let NODE be the node +at point, we pass this node to the MATCHER of each rule, one of them +will match the node (eg, "this node is a closing bracket!"). Then we +pass the node to the ANCHOR, which returns a point, eg, the BOL of the +previous line. We find the column number of that point (eg, 4), add +OFFSET to it (eg, 0), and that is the column we want to indent the +current line to (4 + 0 = 4). + +Matchers and anchors are functions that takes (NODE PARENT BOL &rest +_). Matches return nil/non-nil for no match/match, and anchors return +the anchor point. Below are some convenient builtin matchers and anchors. + +For MATHCER we have + + (parent-is TYPE) => matches if PARENT’s type matches TYPE as regexp + (node-is TYPE) => matches NODE’s type + (query QUERY) => matches if querying PARENT with QUERY + captures NODE. + + (match NODE-TYPE PARENT-TYPE NODE-FIELD + NODE-INDEX-MIN NODE-INDEX-MAX) + + => checks everything. If an argument is nil, don’t match that. Eg, + (match nil nil TYPE) is the same as (parent-is TYPE) + +For ANCHOR we have + + first-sibling => start of the first sibling + parent => start of parent + parent-bol => BOL of the line parent is on. + prev-sibling => start of previous sibling + no-indent => current position (don’t indent) + prev-line => start of previous line + +There is also a manual section for indent: "Parser-based Indentation". + +When writing indent rules, you can use ‘treesit-check-indent’ to +check if your indentation is correct. To debug what went wrong, set +‘treesit--indent-verbose’ to non-nil. Then when you indent, Emacs +tells you which rule is applied in the echo area. + +#+begin_src elisp +(defvar typescript-mode-indent-rules + (let ((offset typescript-indent-offset)) + `((typescript + ;; This rule matches if node at point is "}", ANCHOR is the + ;; parent node’s BOL, and offset is 0. + ((node-is "}") parent-bol 0) + ((node-is ")") parent-bol 0) + ((node-is "]") parent-bol 0) + ((node-is ">") parent-bol 0) + ((node-is "\\.") parent-bol ,offset) + ((parent-is "ternary_expression") parent-bol ,offset) + ((parent-is "named_imports") parent-bol ,offset) + ((parent-is "statement_block") parent-bol ,offset) + ((parent-is "type_arguments") parent-bol ,offset) + ((parent-is "variable_declarator") parent-bol ,offset) + ((parent-is "arguments") parent-bol ,offset) + ((parent-is "array") parent-bol ,offset) + ((parent-is "formal_parameters") parent-bol ,offset) + ((parent-is "template_substitution") parent-bol ,offset) + ((parent-is "object_pattern") parent-bol ,offset) + ((parent-is "object") parent-bol ,offset) + ((parent-is "object_type") parent-bol ,offset) + ((parent-is "enum_body") parent-bol ,offset) + ((parent-is "arrow_function") parent-bol ,offset) + ((parent-is "parenthesized_expression") parent-bol ,offset) + ...)))) +#+end_src + +Then you set ‘treesit-simple-indent-rules’ to your rules, and call +‘treesit-major-mode-setup’: + +#+begin_src elisp +(setq-local treesit-simple-indent-rules typescript-mode-indent-rules) +(treesit-major-mode-setup) +#+end_src + +* Imenu + +Not much to say except for utilizing ‘treesit-induce-sparse-tree’ (and +explicitly pass a LIMIT argument: most of the time you don't need more +than 10). See ‘js--treesit-imenu-1’ in js.el for an example. + +Once you have the index builder, set ‘imenu-create-index-function’ to +it. + +* Navigation + +Mainly ‘beginning-of-defun-function’ and ‘end-of-defun-function’. +You can find the end of a defun with something like + +(treesit-search-forward-goto "function_definition" 'end) + +where "function_definition" matches the node type of a function +definition node, and ’end means we want to go to the end of that node. + +Tree-sitter has default implementations for +‘beginning-of-defun-function’ and ‘end-of-defun-function’. So for +ordinary languages, it is enough to set ‘treesit-defun-type-regexp’ +to something that matches all the defun struct types in the language, +and call ‘treesit-major-mode-setup’. For example, + +#+begin_src emacs-lisp +(setq-local treesit-defun-type-regexp (rx bol + (or "function" "class") + "_definition" + eol)) +(treesit-major-mode-setup) +#+end_src> + +* Which-func + +If you have an imenu implementation, set ‘which-func-functions’ to +nil, and which-func will automatically use imenu’s data. + +If you want an independent implementation for which-func, you can +find the current function by going up the tree and looking for the +function_definition node. See the function below for an example. +Since Python allows nested function definitions, that function keeps +going until it reaches the root node, and records all the function +names along the way. + +#+begin_src elisp +(defun python-info-treesit-current-defun (&optional include-type) + "Identical to `python-info-current-defun' but use tree-sitter. +For INCLUDE-TYPE see `python-info-current-defun'." + (let ((node (treesit-node-at (point))) + (name-list ()) + (type nil)) + (cl-loop while node + if (pcase (treesit-node-type node) + ("function_definition" + (setq type 'def)) + ("class_definition" + (setq type 'class)) + (_ nil)) + do (push (treesit-node-text + (treesit-node-child-by-field-name node "name") + t) + name-list) + do (setq node (treesit-node-parent node)) + finally return (concat (if include-type + (format "%s " type) + "") + (string-join name-list "."))))) +#+end_src + +* More features? + +Obviously this list is just a starting point, if there are features in +the major mode that would benefit from a parse tree, adding tree-sitter +support for that would be great. But in the minimal case, just adding +font-lock is awesome. + +* Common tasks + +How to... + +** Get the buffer text corresponding to a node? + +(treesit-node-text node) + +BTW ‘treesit-node-string’ does different things. + +** Scan the whole tree for stuff? + +(treesit-search-subtree) +(treesit-search-forward) +(treesit-induce-sparse-tree) + +** Move to next node that...? + +(treesit-search-forward-goto) + +** Get the root node? + +(treesit-buffer-root-node) + +** Get the node at point? + +(treesit-node-at (point)) + +* Manual + +I suggest you read the manual section for tree-sitter in Info. The +section is Parsing Program Source. Typing + + C-h i d m elisp RET g Parsing Program Source RET + +will bring you to that section. You can also read the HTML version +under /html-manual in this directory. I find the HTML version easier +to read. You don’t need to read through every sentence, just read the +text paragraphs and glance over function names. |