summaryrefslogtreecommitdiff
path: root/admin/notes/tree-sitter/starter-guide
diff options
context:
space:
mode:
Diffstat (limited to 'admin/notes/tree-sitter/starter-guide')
-rw-r--r--admin/notes/tree-sitter/starter-guide455
1 files changed, 455 insertions, 0 deletions
diff --git a/admin/notes/tree-sitter/starter-guide b/admin/notes/tree-sitter/starter-guide
new file mode 100644
index 00000000000..123dabd9f29
--- /dev/null
+++ b/admin/notes/tree-sitter/starter-guide
@@ -0,0 +1,455 @@
+STARTER GUIDE ON WRITING MAJOR MODE WITH TREE-SITTER -*- org -*-
+
+This document guides you on adding tree-sitter support to a major
+mode.
+
+TOC:
+
+- Building Emacs with tree-sitter
+- Install language definitions
+- Setup
+- Naming convention
+- Font-lock
+- Indent
+- Imenu
+- Navigation
+- Which-func
+- More features?
+- Common tasks (code snippets)
+- Manual
+
+* Building Emacs with tree-sitter
+
+You can either install tree-sitter by your package manager, or from
+source:
+
+ git clone https://github.com/tree-sitter/tree-sitter.git
+ cd tree-sitter
+ make
+ make install
+
+Then pull the tree-sitter branch (or the master branch, if it has
+merged) and rebuild Emacs.
+
+* Install language definitions
+
+Tree-sitter by itself doesn’t know how to parse any particular
+language. We need to install language definitions (or “grammars”) for
+a language to be able to parse it. There are a couple of ways to get
+them.
+
+You can use this script that I put together here:
+
+ https://github.com/casouri/tree-sitter-module
+
+You can also find them under this directory in /build-modules.
+
+This script automatically pulls and builds language definitions for C,
+C++, Rust, JSON, Go, HTML, Javascript, CSS, Python, Typescript,
+and C#. Better yet, I pre-built these language definitions for
+GNU/Linux and macOS, they can be downloaded here:
+
+ https://github.com/casouri/tree-sitter-module/releases/tag/v2.1
+
+To build them yourself, run
+
+ git clone git@github.com:casouri/tree-sitter-module.git
+ cd tree-sitter-module
+ ./batch.sh
+
+and language definitions will be in the /dist directory. You can
+either copy them to standard dynamic library locations of your system,
+eg, /usr/local/lib, or leave them in /dist and later tell Emacs where
+to find language definitions by setting ‘treesit-extra-load-path’.
+
+Language definition sources can be found on GitHub under
+tree-sitter/xxx, like tree-sitter/tree-sitter-python. The tree-sitter
+organization has all the "official" language definitions:
+
+ https://github.com/tree-sitter
+
+* Setting up for adding major mode features
+
+Start Emacs and load tree-sitter with
+
+ (require 'treesit)
+
+Now check if Emacs is built with tree-sitter library
+
+ (treesit-available-p)
+
+* Tree-sitter major modes
+
+Tree-sitter modes should be separate major modes, so other modes
+inheriting from the original mode don't break if tree-sitter is
+enabled. For example js2-mode inherits js-mode, we can't enable
+tree-sitter in js-mode, lest js-mode would not setup things that
+js2-mode expects to inherit from. So it's best to use separate major
+modes.
+
+If the tree-sitter variant and the "native" variant could share some
+setup, you can create a "base mode", which only contains the common
+setup. For example, there is python-base-mode (shared), python-mode
+(native), and python-ts-mode (tree-sitter).
+
+In the tree-sitter mode, check if we can use tree-sitter with
+treesit-ready-p, it will error out if tree-sitter is not ready.
+
+* Naming convention
+
+Use tree-sitter for text (documentation, comment), use treesit for
+symbol (variable, function).
+
+* Font-lock
+
+Tree-sitter works like this: You provide a query made of patterns and
+capture names, tree-sitter finds the nodes that match these patterns,
+tag the corresponding capture names onto the nodes and return them to
+you. The query function returns a list of (capture-name . node). For
+font-lock, we use face names as capture names. And the captured node
+will be fontified in their capture name.
+
+The capture name could also be a function, in which case (NODE
+OVERRIDE START END) is passed to the function for fontification. START
+and END are the start and end of the region to be fontified. The
+function should only fontify within that region. The function should
+also allow more optional arguments with (&rest _), for future
+extensibility. For OVERRIDE check out the docstring of
+treesit-font-lock-rules.
+
+** Query syntax
+
+There are two types of nodes, named, like (identifier),
+(function_definition), and anonymous, like "return", "def", "(",
+"}". Parent-child relationship is expressed as
+
+ (parent (child) (child) (child (grand_child)))
+
+Eg, an argument list (1, "3", 1) could be:
+
+ (argument_list "(" (number) (string) (number) ")")
+
+Children could have field names in its parent:
+
+ (function_definition name: (identifier) type: (identifier))
+
+Match any of the list:
+
+ ["true" "false" "none"]
+
+Capture names can come after any node in the pattern:
+
+ (parent (child) @child) @parent
+
+The query above captures both parent and child.
+
+ ["return" "continue" "break"] @keyword
+
+The query above captures all the keywords with capture name
+"keyword".
+
+These are the common syntax, see all of them in the manual
+("Parsing Program Source" section).
+
+** Query references
+
+But how do one come up with the queries? Take python for an example,
+open any python source file, type M-x treesit-explore-mode RET. Now
+you should see the parse-tree in a separate window, automatically
+updated as you select text or edit the buffer. Besides this, you can
+consult the grammar of the language definition. For example, Python’s
+grammar file is at
+
+ https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js
+
+Neovim also has a bunch of queries to reference:
+
+ https://github.com/nvim-treesitter/nvim-treesitter/tree/master/queries
+
+The manual explains how to read grammar files in the bottom of section
+"Tree-sitter Language Definitions".
+
+** Debugging queries
+
+If your query has problems, use ‘treesit-query-validate’ to debug the
+query. It will pop a buffer containing the query (in text format) and
+mark the offending part in red.
+
+** Code
+
+To enable tree-sitter font-lock, set ‘treesit-font-lock-settings’ and
+‘treesit-font-lock-feature-list’ buffer-locally and call
+‘treesit-major-mode-setup’. For example, see
+‘python--treesit-settings’ in python.el. Below I paste a snippet of
+it.
+
+Note that like the current font-lock, if the to-be-fontified region
+already has a face (ie, an earlier match fontified part/all of the
+region), the new face is discarded rather than applied. If you want
+later matches always override earlier matches, use the :override
+keyword.
+
+Each rule should have a :feature, like function-name,
+string-interpolation, builtin, etc. Users can then enable/disable each
+feature individually.
+
+#+begin_src elisp
+(defvar python--treesit-settings
+ (treesit-font-lock-rules
+ :feature 'comment
+ :language 'python
+ '((comment) @font-lock-comment-face)
+
+ :feature 'string
+ :language 'python
+ '((string) @font-lock-string-face
+ (string) @contextual) ; Contextual special treatment.
+
+ :feature 'function-name
+ :language 'python
+ '((function_definition
+ name: (identifier) @font-lock-function-name-face))
+
+ :feature 'class-name
+ :language 'python
+ '((class_definition
+ name: (identifier) @font-lock-type-face))
+
+ ...))
+#+end_src
+
+Then in ‘python-mode’, enable tree-sitter font-lock:
+
+#+begin_src elisp
+(treesit-parser-create 'python)
+(setq-local treesit-font-lock-settings python--treesit-settings)
+(setq-local treesit-font-lock-feature-list
+ '((comment string function-name)
+ (class-name keyword builtin)
+ (string-interpolation decorator)))
+...
+(treesit-major-mode-setup)
+#+end_src
+
+Concretely, something like this:
+
+#+begin_src elisp
+(define-derived-mode python-mode prog-mode "Python"
+ ...
+ (cond
+ ;; Tree-sitter.
+ ((treesit-ready-p 'python-mode 'python)
+ (treesit-parser-create 'python)
+ (setq-local treesit-font-lock-settings python--treesit-settings)
+ (setq-local treesit-font-lock-feature-list
+ '((comment string function-name)
+ (class-name keyword builtin)
+ (string-interpolation decorator)))
+ (treesit-major-mode-setup))
+ (t
+ ;; No tree-sitter
+ (setq-local font-lock-defaults ...)
+ ...)))
+#+end_src
+
+* Indent
+
+Indent works like this: We have a bunch of rules that look like
+
+ (MATCHER ANCHOR OFFSET)
+
+When the indentation process starts, point is at the BOL of a line, we
+want to know which column to indent this line to. Let NODE be the node
+at point, we pass this node to the MATCHER of each rule, one of them
+will match the node (eg, "this node is a closing bracket!"). Then we
+pass the node to the ANCHOR, which returns a point, eg, the BOL of the
+previous line. We find the column number of that point (eg, 4), add
+OFFSET to it (eg, 0), and that is the column we want to indent the
+current line to (4 + 0 = 4).
+
+Matchers and anchors are functions that takes (NODE PARENT BOL &rest
+_). Matches return nil/non-nil for no match/match, and anchors return
+the anchor point. Below are some convenient builtin matchers and anchors.
+
+For MATHCER we have
+
+ (parent-is TYPE) => matches if PARENT’s type matches TYPE as regexp
+ (node-is TYPE) => matches NODE’s type
+ (query QUERY) => matches if querying PARENT with QUERY
+ captures NODE.
+
+ (match NODE-TYPE PARENT-TYPE NODE-FIELD
+ NODE-INDEX-MIN NODE-INDEX-MAX)
+
+ => checks everything. If an argument is nil, don’t match that. Eg,
+ (match nil nil TYPE) is the same as (parent-is TYPE)
+
+For ANCHOR we have
+
+ first-sibling => start of the first sibling
+ parent => start of parent
+ parent-bol => BOL of the line parent is on.
+ prev-sibling => start of previous sibling
+ no-indent => current position (don’t indent)
+ prev-line => start of previous line
+
+There is also a manual section for indent: "Parser-based Indentation".
+
+When writing indent rules, you can use ‘treesit-check-indent’ to
+check if your indentation is correct. To debug what went wrong, set
+‘treesit--indent-verbose’ to non-nil. Then when you indent, Emacs
+tells you which rule is applied in the echo area.
+
+#+begin_src elisp
+(defvar typescript-mode-indent-rules
+ (let ((offset typescript-indent-offset))
+ `((typescript
+ ;; This rule matches if node at point is "}", ANCHOR is the
+ ;; parent node’s BOL, and offset is 0.
+ ((node-is "}") parent-bol 0)
+ ((node-is ")") parent-bol 0)
+ ((node-is "]") parent-bol 0)
+ ((node-is ">") parent-bol 0)
+ ((node-is "\\.") parent-bol ,offset)
+ ((parent-is "ternary_expression") parent-bol ,offset)
+ ((parent-is "named_imports") parent-bol ,offset)
+ ((parent-is "statement_block") parent-bol ,offset)
+ ((parent-is "type_arguments") parent-bol ,offset)
+ ((parent-is "variable_declarator") parent-bol ,offset)
+ ((parent-is "arguments") parent-bol ,offset)
+ ((parent-is "array") parent-bol ,offset)
+ ((parent-is "formal_parameters") parent-bol ,offset)
+ ((parent-is "template_substitution") parent-bol ,offset)
+ ((parent-is "object_pattern") parent-bol ,offset)
+ ((parent-is "object") parent-bol ,offset)
+ ((parent-is "object_type") parent-bol ,offset)
+ ((parent-is "enum_body") parent-bol ,offset)
+ ((parent-is "arrow_function") parent-bol ,offset)
+ ((parent-is "parenthesized_expression") parent-bol ,offset)
+ ...))))
+#+end_src
+
+Then you set ‘treesit-simple-indent-rules’ to your rules, and call
+‘treesit-major-mode-setup’:
+
+#+begin_src elisp
+(setq-local treesit-simple-indent-rules typescript-mode-indent-rules)
+(treesit-major-mode-setup)
+#+end_src
+
+* Imenu
+
+Not much to say except for utilizing ‘treesit-induce-sparse-tree’ (and
+explicitly pass a LIMIT argument: most of the time you don't need more
+than 10). See ‘js--treesit-imenu-1’ in js.el for an example.
+
+Once you have the index builder, set ‘imenu-create-index-function’ to
+it.
+
+* Navigation
+
+Mainly ‘beginning-of-defun-function’ and ‘end-of-defun-function’.
+You can find the end of a defun with something like
+
+(treesit-search-forward-goto "function_definition" 'end)
+
+where "function_definition" matches the node type of a function
+definition node, and ’end means we want to go to the end of that node.
+
+Tree-sitter has default implementations for
+‘beginning-of-defun-function’ and ‘end-of-defun-function’. So for
+ordinary languages, it is enough to set ‘treesit-defun-type-regexp’
+to something that matches all the defun struct types in the language,
+and call ‘treesit-major-mode-setup’. For example,
+
+#+begin_src emacs-lisp
+(setq-local treesit-defun-type-regexp (rx bol
+ (or "function" "class")
+ "_definition"
+ eol))
+(treesit-major-mode-setup)
+#+end_src>
+
+* Which-func
+
+If you have an imenu implementation, set ‘which-func-functions’ to
+nil, and which-func will automatically use imenu’s data.
+
+If you want an independent implementation for which-func, you can
+find the current function by going up the tree and looking for the
+function_definition node. See the function below for an example.
+Since Python allows nested function definitions, that function keeps
+going until it reaches the root node, and records all the function
+names along the way.
+
+#+begin_src elisp
+(defun python-info-treesit-current-defun (&optional include-type)
+ "Identical to `python-info-current-defun' but use tree-sitter.
+For INCLUDE-TYPE see `python-info-current-defun'."
+ (let ((node (treesit-node-at (point)))
+ (name-list ())
+ (type nil))
+ (cl-loop while node
+ if (pcase (treesit-node-type node)
+ ("function_definition"
+ (setq type 'def))
+ ("class_definition"
+ (setq type 'class))
+ (_ nil))
+ do (push (treesit-node-text
+ (treesit-node-child-by-field-name node "name")
+ t)
+ name-list)
+ do (setq node (treesit-node-parent node))
+ finally return (concat (if include-type
+ (format "%s " type)
+ "")
+ (string-join name-list ".")))))
+#+end_src
+
+* More features?
+
+Obviously this list is just a starting point, if there are features in
+the major mode that would benefit from a parse tree, adding tree-sitter
+support for that would be great. But in the minimal case, just adding
+font-lock is awesome.
+
+* Common tasks
+
+How to...
+
+** Get the buffer text corresponding to a node?
+
+(treesit-node-text node)
+
+BTW ‘treesit-node-string’ does different things.
+
+** Scan the whole tree for stuff?
+
+(treesit-search-subtree)
+(treesit-search-forward)
+(treesit-induce-sparse-tree)
+
+** Move to next node that...?
+
+(treesit-search-forward-goto)
+
+** Get the root node?
+
+(treesit-buffer-root-node)
+
+** Get the node at point?
+
+(treesit-node-at (point))
+
+* Manual
+
+I suggest you read the manual section for tree-sitter in Info. The
+section is Parsing Program Source. Typing
+
+ C-h i d m elisp RET g Parsing Program Source RET
+
+will bring you to that section. You can also read the HTML version
+under /html-manual in this directory. I find the HTML version easier
+to read. You don’t need to read through every sentence, just read the
+text paragraphs and glance over function names.