There are at least 3 fundamentally different kinds of diff:
* Single-dimensional. Diffs of text lines are just this.
* Multi-dimensional. Diffs of words or characters are usually going to be this since lines still matter, but there are multiple approaches (line-first? weighted tokens?).
* Tree-based. Unfortunately, these are woefully scarce and poorly documented.
For text diffs, it's nontrivial to get the "missing newline at end of file" logic working.
For tree diffs, consider that for HTML something like `<p>x</p><p>y</p>` should be unmergeable, whereas `<b>x</b><b>y</b>` should be mergeable.
(Aside: the blind promotion of `<em>` and `<strong>` did great harm to the notion of semantic HTML. Most things people use italics for (book titles, thoughts, foreign words) are explicitly things that `<em>` should not be used for.)
Another thing I’ve encountered with tree/structured diffs is a concept of identity. diff([{id:1,name:foo}],[{id:2,name:foo}] should show object w/ id:1 removed and id:2 added, not id changed from 1 to 2. Tough because then your diffing algo needs to be aware of the object structure (imo using convention and saying “no objects can contain this key” is pretty tough when you accept any user generated data).
I love this. I think you could simplify it by generalizing. Something like immutability. These keys can’t be changed, only an object destroyed and another created. A case of that is a primary key (maybe that’s the only case).
You can always represent a change as a removal and an addition. It’s smart to actually consider when should you. “Never” and “whenever possible” don’t seem like the best answers.
tho i would say that a diff has to define the set of operations allowed to be done to the thing being diff'ed.
E.g., in the example scenario of the diff in json objects, if a possible operation is a change in a property value (such as the "id" field), then the diff correctly deduced the smallest change possible is indeed a change in the field.
However, if you can define the set of operation to only be a change in an entire object (and no changing of id field), then surely, you can create a diff that produces the desired object structure change. It would be a custom diff algorithm of course...but it'd be quite a useful one tbh.
I implemented tree based diff for a JSON superset https://github.com/gritzko/go-rdx
It boils down to single dimensional, very much like JSON or DOM tree is represented as a linear text.
Can you explain why the `p` example is unmergeable whereas the `b` one isn't? I can't see any difference between the two examples other than the tag used.
The second is two bold letters, one after another in a single word.
However if the html is "an application" more than it is "a document" - a b-tag with two letters, might be meaningfully different from two b-tags in sequence (for example with css:)
b { display: block }
So, I'd say as a fragment two bold tags might be mergable - but not in the general case?
Ed: ie if diffing input from a html input field (rich editor) merging bold tags would probably be what you want - when the first edit bolds first letter, and second edit bolds second letter.
The creator of the Myers algorithm is Gene Myers. He also helped create the BLAST algorithm, one of the fastest and most important DNA and protein search algorithms, and also implemented most of the original human genome assembly done by Celera. he also helped invent and publish the suffix array.
while work on pure algorithms is invaluable i always feel work on knowledge augmented algorithms has lots of untapped potential. two examples: recording key events like move and delete on a more fine grained timescale or directly from editors and then storing those as mutable metadata in commits that is only allowed to be used for diff generation. as its provable if diffs are technically correct these do not weaken the consistency guarantees while adding helpful context. they are also highly compressable and pruneable. another one is optimizing diffs for consumption by llms and let those generate for optimal human readability.
Do you have examples of any of these ideas being implemented? In general I agree, there’s so much opportunity for these “knowledge augmented” algorithms
Great work. I was just recently dealing with creating diff in Go and faced the same problem of finding a good library. Some diffmatchpatch APIs expect/return escaped texts, and I was screaming why would you do that??? Why doesn't the library just return raw strings? I ended up using diffmatchpatch to get patch objects and then produced unified diffs with some vibecoding. I'll definitely try this out when I revisit this.
I wonder about the importance of minimality: it itself seems like a heuristic for “interest” or some other thing that users of diffs actually care about.
For example, a diff like:
+ x += 1
- x -= 1
Seems almost useless: it doesn’t provide any context about the meaning of x and, as a result, nearly every source review tool provides unchanged line in addition to highlighting the change. And, even then, by preventing comments on arbitrary lines of the file, GitHub’s code review makes it pretty hard to call out other relevant code.
A minimal diff is one where the number of edits is minimal. The context around edited lines does not count as edits, because they are matching lines. That said, minimal is definitely only a proxy, that’s why is a good property to relax.
I wish that the diff/patch would be able to better take into account moved data (not only as deletion+add but with with a proper semantic indicating the moved block). This would both lead to smaller and more readable patches.
I encountered one about 17 years ago. It was for diffing IP packets, TCP segments, and network event payloads. At the time I worked at RSA Security on a network forensics and security analytics product, written in a mix of C and C++. In one of the projects I worked on, we needed to let users diff the packets, segments, and payloads. Back then we were very conservative about adding third-party libraries to the product. I have written more about that culture here: https://news.ycombinator.com/item?id=39951673
Long story short, due to the conservative culture, most data structures and algorithms were implemented in house. The diff algorithm for packets/segments/payloads was written in house too and I was the one to write it.
My implementation was based on a straightforward dynamic programming solution to the longest common subsequence problem. If I recall correctly, it ran in O(mn) time and O(min(m, n)) space in the worst case, where m and n are the lengths of the two sequences. I knew there were more efficient algorithms, but this code was not performance critical. I chose to keep the implementation simple so anyone could understand it, learn it quickly, and fix bugs if they arose. It served us well for the next seven years until the product was replaced with a new one.
On a related note, I sometimes miss that older style of software development where we would dive deep into a problem domain, master it, and design solutions ourselves. I am not being naively nostalgic though. I am very well aware that modern development, with its reliance on well established libraries, usually delivers much greater productivity and reliability. Still, I think the slower and more deliberate approach of building things from the ground up had a certain charm.
Aside from the others already mentioned, it's very useful in infrastructure-as-code context like Kubernetes.
I also used diff at work today to compare the output of two different 'docker history' outputs to look for what a high-level overview of changes made by a contractor tasked with hardening a base image.
"Approval" / "Golden Master" / "Snapshot" / "Characterization" testing can be very helpful.
They all seem to be names for more or less the same idea.
The first time a test runs successfully it auto captures the output as a file. This is the "approved" output and is committed with the code or saved in whatever test system you use.
The next time the test runs, it captures the new output and auto compares it with the approved output. If identical, the test passes. If different, the test fails and a human should investigate the diff.
The technique works with many types of data:
* Plain text.
* Images of UI components / rendered web pages. This can check that your code change or a new browser version do not unexpectedly change the appearance.
* Audio files created by audio processing code.
* Large text logs from code that has no other tests. This can help when refactoring, hopefully an accidental side effect will appear as an unexpected diff.
One use case where I never want to miss it is in tests: Understanding what the differences between the expectation and the actual result are is invaluable.
There are a bunch of more (and less) specialised ones used for contract red-lining.
Also in the legal space, sorting through discovery can be incredibly tedious. There are lots of diff-based and diff-like solutions in this space; most are completely proprietary and undocumented.
There is this debate of virtual DOM vs no virtual DOM, and from time to time you see people on HN claim how great vanilla JS is. Won't get into the former debate, but for the latter, people who make such comments probably aren't aware how different it is to create a UI as complex as Outlook/reddit/Spotify vs their personal website or a simple demo. For complex sites with lots of widgets and data, being able to write JSX and also efficiently update the DOM makes a huge difference. It is almost impossible to build and maintain a complex site with vanilla JS.
Diff algorithms are astoundingly widely applicable.
curses had the task of updating your 2400 baud terminal screen to look like a TUI program's memory image of what should be on it. (The TUI might be vi, nethack, a menu system, a database entry screen, ntalk, whatever.) The simple solution is to repaint the whole screen. But 80x24 is 2000 bytes, which is 8 seconds at 2400 baud. Users won't use a program that takes 8 seconds to respond after their latest keystroke. So curses uses a diff algorithm and the terminal capabilities database to figure out a minimal set of updates to send over the serial line. (Some terminals have an escape sequence to insert or delete a line or a character, which can save a lot of data transmission, but others don't.) React's virtual DOM is the same idea.
If you run `du -k > tmp.du` you can later `du -k | diff -u tmp.du -` to see which directories have changed in size. This can be very useful when something is rapidly filling up your disk and you need to stop it, but you don't know what it is.
If you `sha1sum $(find -type f) > good.shas` you can later use diff in the same way to see which files, if any, have changed.
rsync uses rolling hashes to compute diffs between two computers at opposite ends of a slow or expensive connection in order to efficiently bring an out-of-date file up-to-date without transmitting too much data. It's like curses, but for files, and not limited by terminal capabilities.
rdiff uses the rsync algorithm to efficiently store multiple versions of a file, which does not in general have to contain source code. Deduplicating filesystems can use this kind of approach, but often they don't. But it's common in data backup systems like Restic.
It's common for unit tests to say that the result of some expression should be some large data structure, and sometimes they fail by producing some slightly different large data structure. diff algorithms are very valuable in understanding why the test failed. py.test does this by default.
Genomic analysis works by finding which parts of two versions of a genome are the same and which parts are different; BLAST, by Gene Myers, is a common algorithm for this. This is useful for an enormous variety of things, such as understanding the evolutionary history of species, identifying cancer-causing mutations in DNA from tumors, or discovering who your great-grandmother cheated on her husband with. It's maybe reasonable to say that all of bioinformatics is real-world use-cases of diff algorithms. They have to handle difficult pathological cases like transposition and long repeated sequences.
Video compression like H.264 works to a great extent by "motion estimation", where the compressor figures out which part of the previous frame is most similar to the current one and only encodes the differences. This is a more difficult generalization of the one-dimensional source-code-diff problem because while the pixel values are moving across the image they also can get brighter, dimmer, blurrier, or sharper, or rotate. Also, it's in two dimensions. This is pretty similar to the curses problem in a way: you want to minimize the bandwidth required to update a screen image by sending deltas from the image currently on the screen. Xpra actually works this way.
There are at least 3 fundamentally different kinds of diff:
* Single-dimensional. Diffs of text lines are just this.
* Multi-dimensional. Diffs of words or characters are usually going to be this since lines still matter, but there are multiple approaches (line-first? weighted tokens?).
* Tree-based. Unfortunately, these are woefully scarce and poorly documented.
For text diffs, it's nontrivial to get the "missing newline at end of file" logic working.
For tree diffs, consider that for HTML something like `<p>x</p><p>y</p>` should be unmergeable, whereas `<b>x</b><b>y</b>` should be mergeable.
(Aside: the blind promotion of `<em>` and `<strong>` did great harm to the notion of semantic HTML. Most things people use italics for (book titles, thoughts, foreign words) are explicitly things that `<em>` should not be used for.)
Another thing I’ve encountered with tree/structured diffs is a concept of identity. diff([{id:1,name:foo}],[{id:2,name:foo}] should show object w/ id:1 removed and id:2 added, not id changed from 1 to 2. Tough because then your diffing algo needs to be aware of the object structure (imo using convention and saying “no objects can contain this key” is pretty tough when you accept any user generated data).
I love this. I think you could simplify it by generalizing. Something like immutability. These keys can’t be changed, only an object destroyed and another created. A case of that is a primary key (maybe that’s the only case).
You can always represent a change as a removal and an addition. It’s smart to actually consider when should you. “Never” and “whenever possible” don’t seem like the best answers.
tho i would say that a diff has to define the set of operations allowed to be done to the thing being diff'ed.
E.g., in the example scenario of the diff in json objects, if a possible operation is a change in a property value (such as the "id" field), then the diff correctly deduced the smallest change possible is indeed a change in the field.
However, if you can define the set of operation to only be a change in an entire object (and no changing of id field), then surely, you can create a diff that produces the desired object structure change. It would be a custom diff algorithm of course...but it'd be quite a useful one tbh.
I implemented tree based diff for a JSON superset https://github.com/gritzko/go-rdx It boils down to single dimensional, very much like JSON or DOM tree is represented as a linear text.
Can you explain why the `p` example is unmergeable whereas the `b` one isn't? I can't see any difference between the two examples other than the tag used.
The first is:
The second is two bold letters, one after another in a single word.However if the html is "an application" more than it is "a document" - a b-tag with two letters, might be meaningfully different from two b-tags in sequence (for example with css:)
So, I'd say as a fragment two bold tags might be mergable - but not in the general case?Ed: ie if diffing input from a html input field (rich editor) merging bold tags would probably be what you want - when the first edit bolds first letter, and second edit bolds second letter.
I've also worked with probabilistic diff- like tree-based, but tolerant of parsing errors.
The creator of the Myers algorithm is Gene Myers. He also helped create the BLAST algorithm, one of the fastest and most important DNA and protein search algorithms, and also implemented most of the original human genome assembly done by Celera. he also helped invent and publish the suffix array.
It seems he is still active in the bioinformatics space: https://github.com/thegenemyers/FASTK
Out of curiosity, are there any algorithms faster than BLAST? (For DNA search).
For DNA? Not sure. https://pmc.ncbi.nlm.nih.gov/articles/PMC3197634/ claims it's reached parity on protein sequences.
I'd much rather have a slower, but more accurate searcher, or one that was easier to use as an API.
Mildly related: my favorite tool for viewing .git diffs diff2html - a CLI that with one command opens the diff in your browser
https://diff2html.xyz/ -- https://github.com/rtfpessoa/diff2html
Mine is meld. (One my phone now, so cannot compare which one seems superior)
A while ago I discovered the lesser known Tichy diff algorithm that seems to preserve more context and is better suited for big code refactors compared to Myers: https://www.researchgate.net/publication/220439403_The_Strin... via https://bryanpendleton.blogspot.com/2010/04/more-study-of-di...
while work on pure algorithms is invaluable i always feel work on knowledge augmented algorithms has lots of untapped potential. two examples: recording key events like move and delete on a more fine grained timescale or directly from editors and then storing those as mutable metadata in commits that is only allowed to be used for diff generation. as its provable if diffs are technically correct these do not weaken the consistency guarantees while adding helpful context. they are also highly compressable and pruneable. another one is optimizing diffs for consumption by llms and let those generate for optimal human readability.
Do you have examples of any of these ideas being implemented? In general I agree, there’s so much opportunity for these “knowledge augmented” algorithms
Great work. I was just recently dealing with creating diff in Go and faced the same problem of finding a good library. Some diffmatchpatch APIs expect/return escaped texts, and I was screaming why would you do that??? Why doesn't the library just return raw strings? I ended up using diffmatchpatch to get patch objects and then produced unified diffs with some vibecoding. I'll definitely try this out when I revisit this.
PS regarding readability, I think VSCode put a lot of effort into creating nice-to-read diffs (e.g. https://code.visualstudio.com/updates/v1_81#_diff-editor), some of which is done in the algorithm itself (https://code.visualstudio.com/updates/v1_78#_diff-algorithm-...). But apparently that's in TypeScript, and not all heuristics done there for an editor is suitable to be in a generic diff algorithm. Still, there might be something worth exploring.
I wonder about the importance of minimality: it itself seems like a heuristic for “interest” or some other thing that users of diffs actually care about.
For example, a diff like:
Seems almost useless: it doesn’t provide any context about the meaning of x and, as a result, nearly every source review tool provides unchanged line in addition to highlighting the change. And, even then, by preventing comments on arbitrary lines of the file, GitHub’s code review makes it pretty hard to call out other relevant code.A minimal diff is one where the number of edits is minimal. The context around edited lines does not count as edits, because they are matching lines. That said, minimal is definitely only a proxy, that’s why is a good property to relax.
I wish that the diff/patch would be able to better take into account moved data (not only as deletion+add but with with a proper semantic indicating the moved block). This would both lead to smaller and more readable patches.
I noticed that some peoble have worked on such an algorithm, e.g. https://en.wikipedia.org/wiki/User:Cacycle/diff
If you squint hard enough, that's also what `git` does at the file level, when it detects renames even if the file changed.
Have you seen https://semanticdiff.com/ ?
and https://github.com/Wilfred/difftastic
Apart from source code versioning what are the other most important real world use cases of diff algorithms ?
I encountered one about 17 years ago. It was for diffing IP packets, TCP segments, and network event payloads. At the time I worked at RSA Security on a network forensics and security analytics product, written in a mix of C and C++. In one of the projects I worked on, we needed to let users diff the packets, segments, and payloads. Back then we were very conservative about adding third-party libraries to the product. I have written more about that culture here: https://news.ycombinator.com/item?id=39951673
Long story short, due to the conservative culture, most data structures and algorithms were implemented in house. The diff algorithm for packets/segments/payloads was written in house too and I was the one to write it.
My implementation was based on a straightforward dynamic programming solution to the longest common subsequence problem. If I recall correctly, it ran in O(mn) time and O(min(m, n)) space in the worst case, where m and n are the lengths of the two sequences. I knew there were more efficient algorithms, but this code was not performance critical. I chose to keep the implementation simple so anyone could understand it, learn it quickly, and fix bugs if they arose. It served us well for the next seven years until the product was replaced with a new one.
On a related note, I sometimes miss that older style of software development where we would dive deep into a problem domain, master it, and design solutions ourselves. I am not being naively nostalgic though. I am very well aware that modern development, with its reliance on well established libraries, usually delivers much greater productivity and reliability. Still, I think the slower and more deliberate approach of building things from the ground up had a certain charm.
Aside from the others already mentioned, it's very useful in infrastructure-as-code context like Kubernetes.
I also used diff at work today to compare the output of two different 'docker history' outputs to look for what a high-level overview of changes made by a contractor tasked with hardening a base image.
"Approval" / "Golden Master" / "Snapshot" / "Characterization" testing can be very helpful.
They all seem to be names for more or less the same idea.
The first time a test runs successfully it auto captures the output as a file. This is the "approved" output and is committed with the code or saved in whatever test system you use.
The next time the test runs, it captures the new output and auto compares it with the approved output. If identical, the test passes. If different, the test fails and a human should investigate the diff.
The technique works with many types of data:
* Plain text.
* Images of UI components / rendered web pages. This can check that your code change or a new browser version do not unexpectedly change the appearance.
* Audio files created by audio processing code.
* Large text logs from code that has no other tests. This can help when refactoring, hopefully an accidental side effect will appear as an unexpected diff.
See: * https://approvaltests.com/ * https://cucumber.io/blog/podcast/approval-testing/ * https://en.wikipedia.org/wiki/Characterization_test
One use case where I never want to miss it is in tests: Understanding what the differences between the expectation and the actual result are is invaluable.
- Backup and restore
- integrity checks from security perspective
- nlp, finding same tokens in text
Etc
There are a bunch of more (and less) specialised ones used for contract red-lining.
Also in the legal space, sorting through discovery can be incredibly tedious. There are lots of diff-based and diff-like solutions in this space; most are completely proprietary and undocumented.
We diff construction schedules! These tend to be massive Gantt charts (400-700 pages is common).
Minimizing DOM mutation operations. In react for instance. But not only.
Interesting, didn’t think of it that way
There is this debate of virtual DOM vs no virtual DOM, and from time to time you see people on HN claim how great vanilla JS is. Won't get into the former debate, but for the latter, people who make such comments probably aren't aware how different it is to create a UI as complex as Outlook/reddit/Spotify vs their personal website or a simple demo. For complex sites with lots of widgets and data, being able to write JSX and also efficiently update the DOM makes a huge difference. It is almost impossible to build and maintain a complex site with vanilla JS.
As someone mostly familiar with non-web UIs - isn't the real question "why aren't you using MVC instead of a big lump of spaghetti?"
Diff algorithms are astoundingly widely applicable.
curses had the task of updating your 2400 baud terminal screen to look like a TUI program's memory image of what should be on it. (The TUI might be vi, nethack, a menu system, a database entry screen, ntalk, whatever.) The simple solution is to repaint the whole screen. But 80x24 is 2000 bytes, which is 8 seconds at 2400 baud. Users won't use a program that takes 8 seconds to respond after their latest keystroke. So curses uses a diff algorithm and the terminal capabilities database to figure out a minimal set of updates to send over the serial line. (Some terminals have an escape sequence to insert or delete a line or a character, which can save a lot of data transmission, but others don't.) React's virtual DOM is the same idea.
If you run `du -k > tmp.du` you can later `du -k | diff -u tmp.du -` to see which directories have changed in size. This can be very useful when something is rapidly filling up your disk and you need to stop it, but you don't know what it is.
If you `sha1sum $(find -type f) > good.shas` you can later use diff in the same way to see which files, if any, have changed.
rsync uses rolling hashes to compute diffs between two computers at opposite ends of a slow or expensive connection in order to efficiently bring an out-of-date file up-to-date without transmitting too much data. It's like curses, but for files, and not limited by terminal capabilities.
rdiff uses the rsync algorithm to efficiently store multiple versions of a file, which does not in general have to contain source code. Deduplicating filesystems can use this kind of approach, but often they don't. But it's common in data backup systems like Restic.
It's common for unit tests to say that the result of some expression should be some large data structure, and sometimes they fail by producing some slightly different large data structure. diff algorithms are very valuable in understanding why the test failed. py.test does this by default.
Genomic analysis works by finding which parts of two versions of a genome are the same and which parts are different; BLAST, by Gene Myers, is a common algorithm for this. This is useful for an enormous variety of things, such as understanding the evolutionary history of species, identifying cancer-causing mutations in DNA from tumors, or discovering who your great-grandmother cheated on her husband with. It's maybe reasonable to say that all of bioinformatics is real-world use-cases of diff algorithms. They have to handle difficult pathological cases like transposition and long repeated sequences.
Video compression like H.264 works to a great extent by "motion estimation", where the compressor figures out which part of the previous frame is most similar to the current one and only encodes the differences. This is a more difficult generalization of the one-dimensional source-code-diff problem because while the pixel values are moving across the image they also can get brighter, dimmer, blurrier, or sharper, or rotate. Also, it's in two dimensions. This is pretty similar to the curses problem in a way: you want to minimize the bandwidth required to update a screen image by sending deltas from the image currently on the screen. Xpra actually works this way.
.PDF data sheets from asshat chip companies that don't include detailed changelogs.