Spelunking CodeMirror’s Guts
As part of a recent project I’ve needed to integrate CodeMirror, the web-based text editor that powers Chrome Developer Tools (the Console and Sources tab), Obsidian, Replit, &c. I’ve always found text editor construction captivating—they’re the central tools of knowledge work—yet the existing technical literature does not seem commensurate. The latest major CodeMirror release (v6) is appreciably modular, so I couldn’t help peeling back a layer and seeing how a small part of it works.
I. Interface
CodeMirror is architected as several independent packages
(@codemirror/state
, @codemirror/view
,
@codemirror/language
, @codemirror/commands
,
@codemirror/search
, &c.) which are
mix and matched to assemble a CodeMirror instance.
The core data structures are found in
@codemirror/state
, with Text
forming the base primitive, undergirding all
other constructs—src/text.ts
contains no import
statements—and it’s what we’ll be looking at.
Text
’s public API:
class Text {
: number
length: number
lines
replace(from: number, to: number, text: Text): Text
append(other: Text): Text
slice(from: number, to: number): Text
lineAt(pos: number): Line
line(n: number): Line
eq(other: Text): boolean
iter(): TextIterator
iterRange(from: number, to: number): TextIterator
iterLines(from: number, to: number): TextIterator
static of(text: Array<string>): Text
}
The static method of
is the entrypoint, the effective
constructor. With replace
being the primary operation.
Substituting a subsequence, but also: inserting at an arbitrary
interior position, when from
and end
are
equal; deleting at an arbitrary interior position, when
text.length < (to - from)
; and deleting the entire
sequence, when to - from == length
and text
is
empty.
II. Construction
Text
is an abstract class that is concretely instantiated as either a
TextLeaf
containing lines of text, or a
TextNode
containing TextLeaf
s, forming an
n‑ary tree. Expressed as types it’s the
following, where length
is the number of characters the subtree
contains and is not used during construction. Naturally, it’s recursive.
type Text = TextNode | TextLeaf
type TextNode = {
: Array<Text>
children: number
length: number
lines
}
type TextLeaf = {
: Array<string>
text: number
length }
The entrypoint to a Text
object is the
Text.of
function. Converting from a regular runtime
string
object (well, an implicitly newline-delimited array of
strings) to a Text
instance.
A Text
object is structured in a manner similar to a
B+ Tree: internal
nodes store intervals (TextNode
) and leaves store references to
values (TextLeaf
). Internal nodes act as a directory,
representing some range of text. (Much like
similarly-structured storage indices.) The branch factor is defined by
constants Tree.BranchShift
(5) and its accompanying
Tree.Branch
(1 << Tree.BranchShift
,
32)—these dictate the topography of the tree.
As Text.of
has ownership over the entire string, the construction of a
balanced tree is relatively undemanding.
The static constructor Text.of
receives an array of lines,
passes this sequence to TextLeaf.split
, which then
constructs all the leaves of the eventual tree (signature
Array<string> → Array<TextLeaf>
). Each
leaf node is bound to contain at most
Tree.Branch
lines, resulting in
-line leaf chunks. It’s then the job
of TextNode.from
to build the interior nodes, to construct
the directory.
This is best understood by example. Let’s use
@codemirror/state
’s own
package.json
file. It’s 36 lines. To strain the full call graph and get a more
representative example let’s also set Tree.BranchShift
to
2, making Tree.Branch
4. The result of
Text.of([lines])
is then the following.
The leaves of the tree being the result of TextLeaf.split
,
the internal structure TextNode.from
. The dimensions of the
tree descend from Tree.Branch
. The root must encompass all
36 lines, and have at most Tree.Branch
children. (Which in
practice becomes five, as the conditional guard is
,
and it’s zero indexed.) Following, each child can hold at most
36 >> 2
(9) lines. Advancing the same logic, this
child’s children can then hold at most 9 >> 2
(2) lines—but
this is then clamped by a lower bound of Tree.Branch
, leaving us with
4. Which is where the recursion terminates, returning a
TextLeaf
with at most four lines.
This process repeats, left to right. Grabbing all leaves required for the construction of a full subtree, then assembling interior nodes in a bottom-up, post-order fashion, until all leaves are processed. The depth of the tree being bound by .
Constructing the first interior node above looks like this.
Each child can hold at most nine lines, so we grab as many
TextLeaf
s as possible and recurse.
Now each child can hold
at most four lines, so we do the same, but there’s only a single
TextLeaf
available this time, so it’s returned and the
recursion terminates. These two TextLeaf
s then become the
children of the interior TextNode
. Which then becomes the
child of the root TextNode
.
The constrained package.json
example is helpful but non-representative.
What does a real, non-toy Text
instance look like? A
larger test case of /usr/share/dict/words
with the true
Tree.BranchShift
of 5 is instructive.
$ cat /usr/share/dict/words | wc -l
235976
It’s too large to plot in its entirety, there’s roughly 10,000 nodes.
(Which didn’t stop me from trying. It took dot
a while but it
ultimately spat out a 100MB SVG.) So let’s look at a illustrative slice:
the middle child of the first 33 children, and its middle three
children. To grasp the full perspective picture 32 more outgoing edges
from the root each pointing to a child with 30 more internal nodes than
seen here, each holding seven TextLeaf
s, with 32 lines in
each.
The depth of this tree is
.
The branch factor of
creates
considerable fanout: A total of 235,976 lines creates 7375
TextLeaf
s
().
The same calculation determines the first level of the tree, as
there are 32 TextNode
s holding at most 7374 lines (235976 >> 5
). Each of these then holds at most 7374 >> 5
(230)
lines. Once more: 230 >> 5 = 7
, but this is now
confined by the lower bound of
,
leaving us with seven TextLeaf
s containing 32 lines each.
This all ultimately produces the desired result of a shallow tree. The benefits of shallowness: (1) all operations are logarithmic in the size of the tree, (2) there are fewer allocations needed during construction, (3) consequently there are fewer dereferences needed during traversal, and (4) it’s thereby more cache friendly.
III. Dynamism
The purpose of Text
is to form the basis of a text
editor: to make inserting, traversing, deleting, updating, undoing, redoing,
folding, &c. seamless. This is notoriously rich in edge
cases, with most expected behaviour poorly specified.
Complexity is tamed by adopting the
Functional Core, Imperative Shell
architecture, simplifying the implementation.
Text
is the functional core, the immutable (“persistent”)
data structure. Concretely, this means that all
Text.of
calls, and all internal nodes, are immutable: once
constructed their values do not change. Methods like
replace
and append
are the imperative shell.
They do not mutate this instance, but return new
Text
objects reflecting the desired changes. Leaving the original
objects untouched, and retained in their entirety.
This would be horribly space inefficient, each change needing a copy of
the entire structure, were it not for structural sharing. We don’t need
to copy the entire object, only the parts that have changed. All other
references remain intact. In practice only a single root-to-leaf path
is newly constructed, the TextLeaf
terminating the path and containing the change, and all ancestor
TextNode
’s connected up to the root. Which is why maintaining
a balanced, shallow tree is important. The 235,000-line example earlier
would only require four allocations, out of the ten thousand or so
nodes, to make a change.
Continuing with the package.json
example let’s apply this
patch inserting 15 lines, amounting to a
Text.replace(450, 450, text)
call.
--- a/package.json
+++ b/package.json
@@ -17,6 +17,21 @@
},
"type": "module",
"main": "dist/index.cjs",+ "dependencies": {
+ "package_one": "1.0.0",
+ "package_two": "1.0.0",
…+ "package_twelve": "1.0.0",
+ "package_thirteen": "1.0.0"
+ },
"exports": {
"import": "./dist/index.js", "require": "./dist/index.cjs"
Due to structural sharing the same tree from above now looks like this. The only newly-constructed nodes are those directly in the root-to-leaf path, all others remain unmarred.
Internally, the replace
call is run on the root TextNode
.
It first locates the subtree enclosed by from
and
to
, recursing to the particular child containing the
replace range, and invoking replace
on that node. This will
return a new TextLeaf
or TextNode
with the
text
concatenated. If the size of this element is an
acceptable length (within ±1 of Tree.BranchShift
) then it’s
packaged in a new TextNode
and returned up the call stack.
If it under- or overflows this range then this updated node is passed to
Text.replace
.
Aggregation is then done by decomposing the tree into before-from
parts, updated-node parts, and after-to
parts. Handling the
edge cases of an inclusive to
range, an inclusive
from
range, and then merging all parts into a single
buffer.
A crucial step is then—to neatly tie it all together—executing
TextNode.from
on this buffer. Thereby ensuring that this
subtree is well-balanced, merging TextLeaf
s where possible,
and returning fresh TextNode
s.
July 2023
Appendix
A Python reproduction of Text.from
.
from typing import List
= 5
TREE_BRANCH_SHIFT = 1 << TREE_BRANCH_SHIFT
TREE_BRANCH
class Text:
@staticmethod
def of(text: List[str]) -> "Text":
if len(text) == 0:
raise ValueError("A document must have at least one line")
elif len(text) == 1 and text[0] == "":
return TextLeaf([""], 0)
elif len(text) <= TREE_BRANCH:
return TextLeaf(text)
else:
return TextNode.from_(TextLeaf.split(text, []))
class TextNode(Text):
def __init__(self, children: List[Text], length: int):
self.children, self.length = children, length
self.lines: int = sum(child.lines for child in children)
@staticmethod
def from_(children: List[Text], length: int = None) -> Text:
if length is None:
= sum(child.length + 1 for child in children) - 1
length
def add(child: Text):
nonlocal current_lines, current_length
= current_chunk[-1] if len(current_chunk) > 0 else None
last
if child.lines > min_chunk and (
> min_chunk or current_lines == 0
current_lines
):
flush()
chunked.append(child)elif (
isinstance(child, TextLeaf)
and current_lines
and isinstance(last, TextLeaf)
and child.lines + last.lines <= TREE_BRANCH
):+= child.lines
current_lines += child.length + 1
current_length -1] = TextLeaf(
current_chunk[+ child.text, last.length + 1 + child.length
last.text
)else:
if (current_lines + child.lines) > chunk:
flush()+= child.lines
current_lines += child.length + 1
current_length
current_chunk.append(child)
def flush():
nonlocal current_lines, current_length, current_chunk
if current_lines == 0:
pass
else:
if len(current_chunk) == 1:
= current_chunk[0]
result else:
= TextNode.from_(current_chunk, current_length)
result
chunked.append(result)= -1
current_length = 0, []
current_lines, current_chunk
= sum(child.lines for child in children)
lines = max(TREE_BRANCH, lines >> TREE_BRANCH_SHIFT)
chunk = chunk >> 1
min_chunk
= []
chunked: List[Text] = []
current_chunk: List[Text] = 0, -1
current_lines, current_length
for child in children:
add(child)
flush()
if len(chunked) == 1:
return chunked[0]
else:
return TextNode(chunked, length)
class TextLeaf(Text):
def __init__(self, text: List[str], length: int = None):
self.text, self.length = text, len(text) if length is None else length
@staticmethod
def split(text: List[str], result: List[str]) -> List[Text]:
buffer, num_chars = [], -1
for line in text:
buffer.append(line)
+= len(line) + 1
num_chars
if len(buffer) == TREE_BRANCH:
buffer, num_chars))
result.append(TextLeaf(buffer, num_chars = [], -1
if num_chars > -1:
buffer, num_chars))
result.append(TextLeaf(
return result
@property
def lines(self) -> int:
return len(self.text)
def __repr__(self) -> str:
return self.text[0] if len(self.text) > 0 else ""