How Diff Algorithms Work (Myers' Algorithm Explained)
Every time you run git diff, use a code review tool, or paste two texts into a text diff utility, a diff algorithm decides exactly what changed between the two versions. The most widely used algorithm — Myers' diff, published by Eugene W. Myers in 1986 — is elegant, efficient, and used in Git, most code review platforms, and most browser-based diff tools. Understanding how it works makes you a better user of diff tools and gives you insight into why diffs sometimes look different from what you expected.
The Problem Diff Algorithms Solve
The fundamental problem a diff algorithm solves is: given two sequences of lines (or characters), find the smallest set of additions and deletions that transforms one sequence into the other. This is called the Edit Distance problem or the Shortest Edit Script (SES) problem. A 'shortest edit script' is the minimum number of insertions and deletions needed to convert text A into text B. There are always many possible edit scripts — many ways you could describe the transformation — but the shortest one produces the cleanest, most readable diff output. Consider a simple example. Original text: ['The', 'cat', 'sat']. Modified text: ['The', 'dog', 'sat']. One edit script is: delete 'cat', insert 'dog'. Another is: delete all three words, insert all three. The first is the shortest edit script (2 operations vs 6 operations) and is clearly the right answer. The diff algorithm finds this automatically. For real documents with hundreds or thousands of lines, finding the shortest edit script is computationally expensive if done naively. Checking all possible ways to align two documents of N lines produces an exponential number of possibilities. Efficient diff algorithms solve this in polynomial time by exploiting the structure of the problem. The edit script produced by the algorithm is what you see in the diff output: lines marked as additions (+) were inserted, lines marked as deletions (-) were removed, and unchanged lines were kept. The algorithm's job is to find the assignment of + and - to lines that minimizes the total number of changes. Before Myers' algorithm, the standard approach was the Longest Common Subsequence (LCS) algorithm, which finds the longest sequence of lines that appears in both documents in the same order. The diff is everything that is not in the LCS. LCS-based algorithms work but are slower than Myers' approach for most practical inputs.
Myers' Algorithm: Core Intuition
Eugene Myers published his diff algorithm in 1986 in a paper titled 'An O(ND) Difference Algorithm and Its Variations', where N is the total length of the two sequences and D is the number of differences. The key insight is that the algorithm's runtime is proportional to the size of the diff, not the size of the documents — if two large documents have only a few differences, the algorithm runs very fast. Myers models the problem as finding the shortest path through a grid. Imagine a grid where the X axis represents positions in the original document and the Y axis represents positions in the modified document. A horizontal move means 'delete a line from the original'. A vertical move means 'insert a line from the modified'. A diagonal move means 'keep a matching line unchanged'. The shortest path from the top-left corner (start of both documents) to the bottom-right corner (end of both documents) corresponds to the shortest edit script. Diagonal moves are 'free' — they represent matched lines that do not count as an edit. Horizontal and vertical moves each cost one edit operation. Myers' algorithm finds this shortest path using a frontier-search strategy. It starts at the origin and explores all possible edit scripts of length 0 (no changes), then length 1 (one change), then length 2, and so on, extending diagonal moves as far as possible at each step. It stops when the frontier reaches the bottom-right corner. The 'V' array at the heart of the algorithm stores the furthest-reaching position on each diagonal at each edit distance. By extending diagonal moves (matching lines) greedily at each step, the algorithm makes maximum use of unchanged lines and minimizes the number of edits it needs to account for. The result is a diff that finds the optimal edit script — the cleanest, most natural-looking set of changes — in time proportional to the size of the changes, which is why it remains the standard algorithm almost forty years after publication.
Why Myers' Algorithm Produces Clean-Looking Diffs
One of the reasons Myers' algorithm has remained standard is that it produces diffs that match human intuition about what changed — the diff looks 'right' rather than technically correct but confusing. The algorithm tends to group related changes together rather than scattering them. When you add a paragraph in the middle of a document, Myers produces a diff that shows the paragraph as one cohesive addition rather than distributing it as individual-line insertions interleaved with context lines. The 'prefer later changes' bias in Myers' algorithm means that when there are multiple equally short edit scripts, it tends to place changes toward the end of a matching region rather than the beginning. This aligns with how programmers and writers think about edits — a change at the end of a function or paragraph is described more naturally as an addition after the existing content than as a modification of the beginning. The handling of duplicate lines is where Myers' output can sometimes surprise users. If the same line appears multiple times in a document — a common occurrence in code with repeated patterns — the algorithm may match different instances than you expect. For example, if you have three identical closing braces in a function and you add a new line before the last one, Myers might show the match aligned differently than you mentally expect, producing a diff that shows the same number of changes but aligns them differently. Git uses Myers' algorithm as its default diff algorithm. You can see this by running `git diff` — the output is produced by Myers. Git also offers alternatives: the `patience` algorithm (which produces more readable diffs for code by matching unique lines first) and the `histogram` algorithm (an improved patience algorithm). These are alternative heuristics on top of the same fundamental edit distance problem. For browser text diff tools, Myers' algorithm is the standard choice because it is fast, produces clean output, and is well-understood. The implementation in JavaScript runs efficiently for document-sized inputs (up to several thousand lines) without requiring optimization.
Edge Cases and Limitations of Text Diff
Understanding the edge cases in diff algorithms helps you interpret unexpected diff output and understand the limitations of text-based comparison. Moved blocks are the most common source of confusion. If you move a paragraph from section two to section five, the diff algorithm does not recognize this as a 'move' — it shows the text as deleted in section two and added in section five. This is technically correct — from the algorithm's perspective, the text disappeared in one location and appeared in another — but it makes the diff output harder to read than if the move were shown as a single operation. Some tools offer a 'moved blocks' detection feature that identifies and labels moved text, but standard Myers' diff does not include this. Whitespace and line ending differences create large-looking diffs for small actual changes. A file where every line had CRLF line endings changed to LF line endings will show every line as modified, even though the visible content is identical. The 'ignore whitespace' option in diff tools addresses this by normalizing whitespace before comparison. Large blocks of similar but changed text can produce diffs that look overwhelming even when the actual changes are minimal. If a paragraph was heavily revised, the diff may show 20 red lines followed by 20 green lines rather than a word-level comparison showing which specific words changed. Word-level or character-level diff modes address this by running the Myers algorithm at a finer granularity — comparing individual words or characters rather than complete lines. Binary files cannot be compared with text diff. A diff tool operates on text characters; binary files contain byte sequences that do not map to readable text. Attempting to diff binary files produces either an error or meaningless output. For binary files that contain text in a structured format (PDF, Word .docx, Excel .xlsx), extract the text content first. Very long single-line documents — some minified JavaScript files or minified CSS are a single very long line — will not produce useful diff output because the algorithm compares whole lines. Formatting the content with line breaks before comparison (using a JSON formatter, code beautifier, or CSS formatter) produces much more readable diff output.
Frequently Asked Questions
- What is the difference between Myers' algorithm and LCS-based diff?
- LCS (Longest Common Subsequence) and Myers' algorithm solve the same fundamental problem but with different approaches. LCS finds the longest sequence of lines common to both documents and marks everything else as changed. Myers' algorithm frames the problem as a shortest path problem and finds the minimum edit distance directly. Myers' is faster for practical inputs — especially when changes are sparse — because its runtime is O(ND) where D is the number of differences, while LCS is O(NM) where N and M are the full lengths of both documents. The output quality is similar, but Myers' tends to produce cleaner diffs for typical code and document editing patterns.
- Why does Git's diff sometimes look different from what I expected?
- Git uses Myers' algorithm by default, which sometimes aligns changes differently than your mental model of what you edited. This often happens with repeated lines, indentation changes, or added/removed blank lines. You can try different diff algorithms: `git diff --diff-algorithm=patience` often produces more intuitive diffs for code by matching unique lines first. The `--ignore-all-space` flag removes whitespace-only changes. For word-level diffs, `git diff --word-diff` shows changes at the word level rather than line level, which is often clearer for prose and config files.
- Can diff algorithms compare images or binary files?
- Standard text diff algorithms like Myers' cannot compare binary files or images — they operate on sequences of text lines or characters. For images, specialized pixel-diff tools compare the visual content pixel by pixel or use perceptual hashing. For binary files like executables or archives, binary diff tools compare byte sequences directly. For structured binary formats that encode text (PDF, Word, Excel), the standard approach is to extract the text content and then apply a text diff to the extracted content.