davidbau.com Innovation, Copying, and Eukaryotes

November 23, 2006

Innovation, Copying, and Eukaryotes

Is it bad to copy-and-paste code?

When you are just beginning to learn to program, you discover that it is not easy to start from scratch. The easier way to program is to copy somebody else's program and then modify it.

But as you get further along, you quickly learn that the copy-and-paste technique is verboten in programming...

Copies And Cognitive Dissonance

Any programmer will tell you why copy-and-paste is a bad thing.

The standard party line: "Don't copy code. Reuse the old code." - "Reused code is smaller, cleaner, and more efficient." - "Copied code is unmaintainable." - "If there is a bug in copied code, you will have to track down all the copies to fix it." - "Abstract your concepts, encapsulate your implemenations, and reuse your libraries." Good heavens, don't copy your code.

Copied code is harder to understand than reused code. We humans think in terms of canonical platonic models. For example, once we know how to sort data, we would like to have a single concept for "sorted data," which deals with nulls and infinities and ties and edge cases in the same way everywhere. If there were a dozen different ways to sort things, it would drive us bananas.

So we train programmers to go back to the original code and modify it and generalize it when we need it to do something new. We add features, add parameters, wrap things up as reusable functions, classes, templates, or languages. We prefer to build towers of abstractions that allow us to reuse and improve a single authoritative copy of the code. In programming, standardization rules supreme: we abhor the idea of duplicating any parts of a program.

And this conventional wisdom works well for small, low-level pieces of code.

But once you start working with larger pieces of code, I'm not so sure that the argument against copying is actually so smart. The bias against copies comes from the way our brains work, but the demands of the real world are not always so clean and logical.

Reuse and Risk

Our tendency to favor code reuse over copy-and-paste has a real disadvantage that we rarely give credence to. Code reuse hampers innovation, because code reuse makes it more expensive and riskier to try anything new.

For example, suppose you are making a cool new database at Microsoft and you want something that is just-like a classical MS-DOS filesystem but slightly different. Maybe it deals with unlinking deleted files differently, and maybe that little change is the key trick for improving reliability of your easy-to-use database.

If you went and modified the actual filesystem that the operating system uses, there is a huge set of things that depend on that code, and you'd probably cause lots of programs that run on Windows to fall down, if not due to outright crashes, then certainly due to subtle bugs. While improving your database, you might break your web browser. You might never get the Norton antivirus program to ever work again. The whole "modify the one and only filesystem" approach would never get very far.

But why did you ever think the new filesystem was a cool idea in the first place? Because in the lab, you made a private copy of the filesystem code that you tweaked and only used for certain things. You broke certain assumptions, but in your own little copy, your database ran swimmingly well. Your private filesystem was better than the standard one - for the limited set of things that you wanted to do. It might work great in addition to the plain old filesystem.

What would happen if you showed your team that you had done this by copying a hundred thousand lines of filesystem code and making a few changes here and there? Every senior engineer would be screaming at you to merge your new idea into the old one. Copying a 100,000 lines of code is evil. It is inefficient, unmaintainable, and a huge cognitive dissonance. No matter that it might be impossible to actually incorporate your clever modification into the original system without killing it - the dislike of making a new copy runs far too deep for it to be considered practical. Better to generalize the old code than to copy it, even if it takes another hundred thousand lines of code to generalize.

And so you would never ship your product.

Copying is sometimes the best way to make a risky bet. But our bias against code copying means that it tends to be much harder to take risks.

And maybe that is a mistake.

Biological Innovation

Copy or Modify?

What made me think about this Innovator's Dilemma is that the same choice is faced by evolutionary systems in biology, and an interesting paper on the topic appeared in Nature today.

When it comes to copying, there are two different styles of genomes you find in biology:

On one side, we have some species that have genomes that are ruthlessly efficient, with only a single canonical copy of a gene for any given function. When you crack open this kind of genome, you are quickly struck by the elegance of the design - the DNA lays itself out almost like source code. In this sort of genome, every gene serves its essential and unique purpose, and you can even see at times that the genes seem to organize themselves into functional subdirectories of related code. Look at the lambda phage genome and you will realize that a master programmer wouldn't have done a better job than natural selection has somehow managed to do.
On the other side, we have some species that have flamboyantly messy genomes, with lots of disorganized, dead, and commented-out code, and many of extra copies of almost-the-same code doing the same thing over and over. Reading these genomes is like reading an extremely bad book, with the same point being made repeatedly, with lots of the same paragraphs and sentences copied five or ten times over. A programmer who checked in code like this would be fired on the spot; it comes as no surpise that the code was generated by a random process.

Why are some genomes so cleanly efficient and others are so sloppy? (Maybe intelligent design people will suggest that God must have been involved in the design of one type and not the other! - just joking...)

The answer has to do with recombination and meiosis. The tiny, boring single-celled prokaryotes and viruses are the creatures that have the efficient well-designed genomes. Bacteria avoid the enormous mess and cost of sexual reproduction; they duplicate their genomes with few errors, and they very rarely introduce a spurious copy of a gene. The result is a finely tuned, but simple program: the result of billions of years of continuous spot-bugfixing.

The beautiful, diverse, and innovative eukaryotes - all the trees and mammals and yeasts, with feathers and brains and chlorophyll and all sorts of amazing inventions - these are the creatures that have sloppy genomes. Somehow, the process of mixing up the genetic code of two parents results in a far greater degree of innovation than just spot-changing one line of code at a time. And it is apparent, when reading the DNA, that mixing up genes involves lots of unnecessary code duplication.

Copy-and-paste is messy and hard to understand. But - at least in biology - it seems that the occasional copy-and-paste is essential to large-scale innovation. You don't get flight, immune systems, hemoglobin, or human self-awareness with clean, minimal code. The most innovative designs come from the sloppiest code.

Measuring Prevalence of Copying

Biologists have long noticed that many of the genes in eukaryotic genomes are approximate copies of each other. They have long been aware that over the eons, copy-and-paste has sometimes resulted in evolutionary innovation.

But what has been mapped for the first time in work in today's Nature is the amount of copying that is going on in our genes. And the result is a surprise: copying is no side-show, no rare event. It is stunningly common.

Previous analysis of single nucleotide polymorphisms - the one-at-a-time base pair mutations that we all learned about in biology class, suggested that about 0.1% of our base pairs vary between individuals in our species - and that even when compared to other species like chimpanzees, we might only be different by 1%. How can such miracles of evolution as oak trees and human beings emerge from such a low level of changes in point mutations like this? "On enormous time scales," went the party line in biology, "small changes add up." It is like imagining bugfixing your way continuously from the ENIAC to OS X - maybe it would be possible over billions of years.

But today's report on copy-number variations has strikingly different numbers. There is so much copying going on in the human genome, and so much variation in how much copying appears within any individual, that about 12% of the human genome is subject to active copy-and-paste variations. In other words, I have different numbers of copies of many thousands of genes than you have in your DNA. More than one out of ten of the lines of our genetic code is actively in the process of being copied-and-pasted a different number of times, and natural selection is weeding through an enormous amount of variation in the presence of and absence of these copies - within us.

These dramatic numbers suggest that the spot-bugfixing model may not actually be the main way that evolution works at all. Copy-and-paste is not just an curious side-techinque in genetic evolution. Copy-and-paste appears to be the main way genetic variation in eukaryotes works. Copy and paste is emerging as very engine of genetic innovation.

In Defense of Copying

And so, as programmers, when somebody comes to you with a huge piece of copied-and-pasted code, I would make this suggestion.

Think about the difference between prokaryotes and eukaryotes, and hesitate for a moment before demanding that they merge their code into the original. Ask yourself whether messy, unmaintainable, hard-to-understand proliferation of copies might actually be the key to innovation.

Copying may be a very good thing.

Posted by David at November 23, 2006 06:39 AM

Comments

Reuse is good when the user/customer expects that if you change behaviour in one part of the system, the same behaviour always changes in the other part.

Copies don't exhibit this.

Posted by: RichB at November 23, 2006 08:45 AM

Great writing. Thanks for providing me with some things to think about...

Genetically, I think you're wrong. But hey, I'm in the Dawkins camp, so I would. But in programming terms, I think you're right - copying is just fine. It's all about how you manage your software development, and if you can't manage copies then you're probably not doing it properly anyway...

An expansion of these thoughts is available over here: http://www.not-so-rapid.com/philipstorry/dxblog/not-so-rapid.nsf/dx/23112006144318MDOKA5.htm

Posted by: Philip Storry at November 23, 2006 09:47 AM

Reuse a lot. As much as possible, *but no more*.

That's my motto. When you need to make something different from what the reuse would allow, do write new code. However, take a moment to see if some part of two implementations could be generalized and extracted into a single place.

The DNA analogy as every other analogy out there is not bullet-proof. DNA is protected against possible copy-paste mistakes both by chemical means (errors have not that big chance of being actually executed) and by the natural selection (killing the faulty creature is cheap for the evolution).

David, if you are talking about copying the well-tested code (with a good set of unit tests run daily) and about the situation, when faulty code is not a too big failure, then yes, I would agree with your point - reasonable amount of copy-pasting such code could improve the innovation speed. Is it how the code is handled at Google? :)

Posted by: Artem Marchenko at November 23, 2006 12:04 PM

This idea of copying lots of code on purpose is not a "best practice" or something done consciously anywhere I know; it is just my idle thinking over Thanksgiving about what makes innovation tick. By reusing code, are we programmers all missing something fundamental about innovation?

You mention in DNA, "errors have not that big chance of being actually executed."

Certainly, that was my own mental model of evolution until yesterday. But what struck me about the Nature article is how common DNA copy-and-paste is when measured! If you think about variation within the population as sort of the "step size" that natural selection can make in a single generation, it is amazing that 12% of our genome can be copy-and-pasted within a single generation. Evolution is far more rapid than I previously imagined.

Twelve percent variation must be pushing the limits of DNA changes that even result in a viable creature. While we humans might say "reuse as much as possible," biology seems to be trying to take the opposite tack: to me, 12% says, "copy as much as possible."

Does biology work to reduce errors? In eukaryotes, it is the opposite. Keep in mind that the whole point of meiosis is to increase the probability of new permutations. While point-mutations have a low probability, code-swapping is basically a sure thing on every generation. Eukaryotes reduce the probability of perfect cloning to near zero.

And the amazing thing about the recent work is the extent to which eukaryotic variation isn't even Mendelian.

If you do back-of-the-envelope calculations with reasonable guesstimates of recombination together with the recent measurements of copy number variations, it suddenly seems obvious that eukaryotes must also use recombination to copy-and-paste code wildly on every generation. Copy-and-paste "errors" are not low probability events. The probability that you do _not_ have a novel copy number somewhere in your own genes - a copy number that neither of your parents have - must be vanishingly small. You contain copy-and-pasted code that you didn't inherit from anybody, that possibly no human has ever tried duplicating before.

We are all editing mistakes.

Posted by: David at November 24, 2006 05:41 AM

Philip, in your blog post, you refer to space shuttle developers' practice of searching code to find similar bugs whenever a bugfix is made.

I think that when this sort of uniform bugfixing is what we want, we should probably be using code reuse instead of code copying. When we feel the need to grep the code for a certain kind of bug, then it would probably be best to find a way to express that idiom in a reusable-code way: abstract the subtlety as a function or class, or fix the language or make some tools so it's easier to do it right.

But biology doesn't "grep for similar bugs" after it copies code. If it did, I bet that the whole code-copying strategy would not result in any evolutionary innovation.

The idea and question behind the biological analogy is this: when should we let two alternative versions of a code coexist and try to flourish side-by-side? When is it okay to copy a piece of code within a program and let the two copies take divergent evolutionary paths?

Is it ever okay?

Posted by: David at November 24, 2006 06:33 AM

I can think of one major class of "copy lots of code on purpose" events: code forks in open source projects.

(Open source in particular because of the transparency it offers for analysis: you're more likely to be able to see the arguments/rationale leading into the fork decision and the resulting consequences, as compared to closed projects)

It's funny how a frequently cited benefit of open source is in fact this very ability to fork: to copy the code and evolve down a divergent path.. and how licenses that contrain the ability to do this (say, copyleft or other reciprocality clauses) impact decisions to go down that route (in addition to all the other factors that are at play)

BTW: Hi Dave -- long time no chat! :)

Posted by: Ken Tam at November 24, 2006 06:09 PM

Hi Ken! Good to hear from you! What are you up to these days?

Other than short-lived "released forks" it is interesting how divergent code forks seem to be looked down upon, and how rare successful code forks seem to be. Usually there is a golden cvs repository that is the single one where all the innovation happens.

Biology makes divergent copies much more often and much more aggressively.

Posted by: David at November 26, 2006 03:25 PM

David,

I'd agree that code re-use is useful in reducing bugs, but the same kinds of logic or bad maths can be used anywhere, so grep is still a useful tool in the bug-basher's armoury.

You're right that biology doesn't keep the improvements of one copy up to date with the improvements of another. That's possibly where computers copying code could be superior.

(I'm assuming that we use versioning to get aorund situations where old code is superior, by testing and regression analysis. Biology lacks those tools to a certain degree, so copying has benefits in biology that don't directly translate to computer science.)

That having been said, I think you're right that code copying begets innovation. In nature, innovation comes from random mutation giving an edge. In code copying, the programmer becomes the random mutation - excpet that the programmer isn't random. (We hope.)

I don't see code copying as bad if we're looking for innovation, and think you're right there. I just think that some people will raise spurious arguments against it, mostly relating to bug fixing etc., which have no real grounds when you look at an example like the Shuttle software development process.

But then, as I said, that process has a heck of a risk attached to failure... Not all software has to deal with that!

Posted by: Philip Storry at November 27, 2006 03:06 PM

"We are all editing mistakes."

David, that's the best demotivator I've ever read. :-)

As someone new to programming, I find it sort of difficult to understand the difference between code reuse and copy/paste. (I don't spend too much time reading about developing/coding outside of what's needed immediately.) I'm assuming that's because you're talking about code in modules because generally I'm writing PL/SQL and usually I have to start from scratch and there aren't that many existing modules I can call up. What starts out as a code example from the SQL Cookbook looks drastically different by the time I've tuned it up for performance.

Though I totally understand what you mean about watching code morph and the analogy to DNA. You know, there's a great need for people to work in bioinformatics...

Posted by: mapgirl at November 29, 2006 10:01 AM

That's right. Code in modules that you call over and over again is "code reuse", and code that you get from a book and tweak and change yourself is "code copying."

On bioinformatics - it is amazing how there is sort of a "Moore's law of genes" happening in biology, which is allowing small groups of people to analyze vastly larger amounts of data every year. The relatively small team that assembled the copy-number-variant map is just one example.

Posted by: David at December 1, 2006 06:15 AM