Spelunking CodeMirror’s Guts

As part of a recent project I’ve needed to integrate CodeMirror, the web-based text editor that powers Chrome Developer Tools (the Console and Sources tab), Obsidian, Replit, &c. I’ve always found text editor construction captivating—they’re the central tools of knowledge work—yet the existing technical literature does not seem commensurate. The latest major CodeMirror release (v6) is appreciably modular, so I couldn’t help peeling back a layer and seeing how a small part of it works.

I. Interface

CodeMirror is architected as several independent packages (@codemirror/state, @codemirror/view, @codemirror/language, @codemirror/commands, @codemirror/search, &c.) which are mix and matched to assemble a CodeMirror instance. The core data structures are found in @codemirror/state, with Text forming the base primitive, undergirding all other constructs—src/text.ts contains no import statements—and it’s what we’ll be looking at. Text’s public API:

class Text {
  length: number
  lines: number

  replace(from: number, to: number, text: Text): Text
  append(other: Text): Text 
  slice(from: number, to: number): Text 

  lineAt(pos: number): Line 
  line(n: number): Line 
  eq(other: Text): boolean 
  iter(): TextIterator 
  iterRange(from: number, to: number): TextIterator 
  iterLines(from: number, to: number): TextIterator 

  static of(text: Array<string>): Text 
}

The static method of is the entrypoint, the effective constructor. With replace being the primary operation. Substituting a subsequence, but also: inserting at an arbitrary interior position, when from and end are equal; deleting at an arbitrary interior position, when text.length < (to - from); and deleting the entire sequence, when to - from == length and text is empty.

II. Construction

Text is an abstract class that is concretely instantiated as either a TextLeaf containing lines of text, or a TextNode containing TextLeafs, forming an n‑ary tree. Expressed as types it’s the following, where length is the number of characters the subtree contains and is not used during construction. Naturally, it’s recursive.

type Text = TextNode | TextLeaf

type TextNode = {
  children: Array<Text>
  length: number
  lines: number
}

type TextLeaf = {
  text: Array<string>
  length: number
}

The entrypoint to a Text object is the Text.of function. Converting from a regular runtime string object (well, an implicitly newline-delimited array of strings) to a Text instance.

A Text object is structured in a manner similar to a B+ Tree: internal nodes store intervals (TextNode) and leaves store references to values (TextLeaf). Internal nodes act as a directory, representing some range of text. (Much like similarly-structured storage indices.) The branch factor is defined by constants Tree.BranchShift (5) and its accompanying Tree.Branch (1 << Tree.BranchShift, 32)—these dictate the topography of the tree.

As Text.of has ownership over the entire string, the construction of a balanced tree is relatively undemanding.

The static constructor Text.of receives an array of lines, passes this sequence to TextLeaf.split, which then constructs all the leaves of the eventual tree (signature Array<string> → Array<TextLeaf> ). Each leaf node is bound to contain at most Tree.Branch lines, resulting in 25 2^5-line leaf chunks. It’s then the job of TextNode.from to build the interior nodes, to construct the directory.


This is best understood by example. Let’s use @codemirror/state’s own package.json file. It’s 36 lines. To strain the full call graph and get a more representative example let’s also set Tree.BranchShift to 2, making Tree.Branch 4. The result of Text.of([lines]) is then the following.

The leaves of the tree being the result of TextLeaf.split, the internal structure TextNode.from. The dimensions of the tree descend from Tree.Branch. The root must encompass all 36 lines, and have at most Tree.Branch children. (Which in practice becomes five, as the conditional guard is i22 i \leq 2^2, and it’s zero indexed.) Following, each child can hold at most 36 >> 2 (9) lines. Advancing the same logic, this child’s children can then hold at most 9 >> 2 (2) lines—but this is then clamped by a lower bound of Tree.Branch, leaving us with 4. Which is where the recursion terminates, returning a TextLeaf with at most four lines.

This process repeats, left to right. Grabbing all leaves required for the construction of a full subtree, then assembling interior nodes in a bottom-up, post-order fashion, until all leaves are processed. The depth of the tree being bound by log436\lceil \log_{4}{36} \rceil .

Constructing the first interior node above looks like this.

Each child can hold at most nine lines, so we grab as many TextLeafs as possible and recurse. Now each child can hold at most four lines, so we do the same, but there’s only a single TextLeaf available this time, so it’s returned and the recursion terminates. These two TextLeafs then become the children of the interior TextNode. Which then becomes the child of the root TextNode.


The constrained package.json example is helpful but non-representative. What does a real, non-toy Text instance look like? A larger test case of /usr/share/dict/words with the true Tree.BranchShift of 5 is instructive.

$ cat /usr/share/dict/words | wc -l
  235976

It’s too large to plot in its entirety, there’s roughly 10,000 nodes. (Which didn’t stop me from trying. It took dot a while but it ultimately spat out a 100MB SVG.) So let’s look at a illustrative slice: the middle child of the first 33 children, and its middle three children. To grasp the full perspective picture 32 more outgoing edges from the root each pointing to a child with 30 more internal nodes than seen here, each holding seven TextLeafs, with 32 lines in each.

The depth of this tree is log32235 976=4 \lceil \log_{32}{235\ 976} \rceil = 4. The branch factor of 25 2^5 creates considerable fanout: A total of 235,976 lines creates 7375 TextLeafs (n25 \lceil \frac{n}{2^5} \rceil). The same calculation determines the first level of the tree, as there are 32 TextNodes holding at most 7374 lines (235976 >> 5). Each of these then holds at most 7374 >> 5 (230) lines. Once more: 230 >> 5 = 7, but this is now confined by the lower bound of 252^5, leaving us with seven TextLeafs containing 32 lines each.

This all ultimately produces the desired result of a shallow tree. The benefits of shallowness: (1) all operations are logarithmic in the size of the tree, (2) there are fewer allocations needed during construction, (3) consequently there are fewer dereferences needed during traversal, and (4) it’s thereby more cache friendly.

III. Dynamism

The purpose of Text is to form the basis of a text editor: to make inserting, traversing, deleting, updating, undoing, redoing, folding, &c. seamless. This is notoriously rich in edge cases, with most expected behaviour poorly specified.

Complexity is tamed by adopting the Functional Core, Imperative Shell architecture, simplifying the implementation. Text is the functional core, the immutable (“persistent”) data structure. Concretely, this means that all Text.of calls, and all internal nodes, are immutable: once constructed their values do not change. Methods like replace and append are the imperative shell. They do not mutate this instance, but return new Text objects reflecting the desired changes. Leaving the original objects untouched, and retained in their entirety.

This would be horribly space inefficient, each change needing a copy of the entire structure, were it not for structural sharing. We don’t need to copy the entire object, only the parts that have changed. All other references remain intact. In practice only a single root-to-leaf path is newly constructed, the TextLeaf terminating the path and containing the change, and all ancestor TextNode’s connected up to the root. Which is why maintaining a balanced, shallow tree is important. The 235,000-line example earlier would only require four allocations, out of the ten thousand or so nodes, to make a change.


Continuing with the package.json example let’s apply this patch inserting 15 lines, amounting to a Text.replace(450, 450, text) call.

--- a/package.json
+++ b/package.json
@@ -17,6 +17,21 @@
   },
   "type": "module",
   "main": "dist/index.cjs",
+  "dependencies": {
+    "package_one": "1.0.0",
+    "package_two": "1.0.0",

+    "package_twelve": "1.0.0",
+    "package_thirteen": "1.0.0"
+  },
   "exports": {
     "import": "./dist/index.js",
     "require": "./dist/index.cjs"

Due to structural sharing the same tree from above now looks like this. The only newly-constructed nodes are those directly in the root-to-leaf path, all others remain unmarred.

Internally, the replace call is run on the root TextNode. It first locates the subtree enclosed by from and to, recursing to the particular child containing the replace range, and invoking replace on that node. This will return a new TextLeaf or TextNode with the text concatenated. If the size of this element is an acceptable length (within ±1 of Tree.BranchShift) then it’s packaged in a new TextNode and returned up the call stack. If it under- or overflows this range then this updated node is passed to Text.replace.

Aggregation is then done by decomposing the tree into before-from parts, updated-node parts, and after-to parts. Handling the edge cases of an inclusive to range, an inclusive from range, and then merging all parts into a single buffer.

A crucial step is then—to neatly tie it all together—executing TextNode.from on this buffer. Thereby ensuring that this subtree is well-balanced, merging TextLeafs where possible, and returning fresh TextNodes.

July 2023


Appendix

A Python reproduction of Text.from.

from typing import List


TREE_BRANCH_SHIFT = 5
TREE_BRANCH = 1 << TREE_BRANCH_SHIFT


class Text:
    @staticmethod
    def of(text: List[str]) -> "Text":
        if len(text) == 0:
            raise ValueError("A document must have at least one line")
        elif len(text) == 1 and text[0] == "":
            return TextLeaf([""], 0)
        elif len(text) <= TREE_BRANCH:
            return TextLeaf(text)
        else:
            return TextNode.from_(TextLeaf.split(text, []))


class TextNode(Text):
    def __init__(self, children: List[Text], length: int):
        self.children, self.length = children, length
        self.lines: int = sum(child.lines for child in children)

    @staticmethod
    def from_(children: List[Text], length: int = None) -> Text:
        if length is None:
            length = sum(child.length + 1 for child in children) - 1

        def add(child: Text):
            nonlocal current_lines, current_length

            last = current_chunk[-1] if len(current_chunk) > 0 else None

            if child.lines > min_chunk and (
                current_lines > min_chunk or current_lines == 0
            ):
                flush()
                chunked.append(child)
            elif (
                isinstance(child, TextLeaf)
                and current_lines
                and isinstance(last, TextLeaf)
                and child.lines + last.lines <= TREE_BRANCH
            ):
                current_lines += child.lines
                current_length += child.length + 1
                current_chunk[-1] = TextLeaf(
                    last.text + child.text, last.length + 1 + child.length
                )
            else:
                if (current_lines + child.lines) > chunk:
                    flush()
                current_lines += child.lines
                current_length += child.length + 1
                current_chunk.append(child)

        def flush():
            nonlocal current_lines, current_length, current_chunk
            if current_lines == 0:
                pass
            else:
                if len(current_chunk) == 1:
                    result = current_chunk[0]
                else:
                    result = TextNode.from_(current_chunk, current_length)

                chunked.append(result)
                current_length = -1
                current_lines, current_chunk = 0, []

        lines = sum(child.lines for child in children)
        chunk = max(TREE_BRANCH, lines >> TREE_BRANCH_SHIFT)
        min_chunk = chunk >> 1

        chunked: List[Text] = []
        current_chunk: List[Text] = []
        current_lines, current_length = 0, -1

        for child in children:
            add(child)
        flush()

        if len(chunked) == 1:
            return chunked[0]
        else:
            return TextNode(chunked, length)


class TextLeaf(Text):
    def __init__(self, text: List[str], length: int = None):
        self.text, self.length = text, len(text) if length is None else length

    @staticmethod
    def split(text: List[str], result: List[str]) -> List[Text]:
        buffer, num_chars = [], -1

        for line in text:
            buffer.append(line)
            num_chars += len(line) + 1

            if len(buffer) == TREE_BRANCH:
                result.append(TextLeaf(buffer, num_chars))
                buffer, num_chars = [], -1
        if num_chars > -1:
            result.append(TextLeaf(buffer, num_chars))

        return result

    @property
    def lines(self) -> int:
        return len(self.text)

    def __repr__(self) -> str:
        return self.text[0] if len(self.text) > 0 else ""