Muelhauser’s Desirism and Powerful AI’s: A Good Idea?

21 Feb

Artificial IntelligenceIn a recent article, Luke Muelhauser explains why he thinks morality is an urgent engineering problem:  sooner than we’d like to think, human-created super-intelligent machines will become enormously powerful, and if they are programmed with the wrong ethical system the results could be disastrous.  As far fetched as that may seem, I happen to think there’s a significant chance he’s right, and at any rate I’ll assume so for the duration of this article.  (Readers unfamiliar with the singularity may want to brush up on it before proceeding.)  So what ethical system should we program these machines with?

In this article I’d like to suggest whatever the answer is, it is not Muelhauser’s desirism in its current form.  To advance this thesis, I’ll proceed in three stages:  first, I’ll sketch out a criterion for a successful machine morality (SMM); second, I’ll point out what I take to be the worrisome essential of his desirism; and finally, I’ll show why his desirism will fail to meet the criterion for a SMM.

I. Criterion for a Successful Machine Morality
The trouble with agreeing on criteria for a SMM is that the justification for these criteria must ultimately hinge on a complex set of meta-ethical propositions, many of which Muelhauser and I are likely to disagree on.  For example:  what makes good, good?  Fortunately I think we can set aside those questions here and agree on a baseline criterion, conceived of as necessary, but not sufficient for a SMM:

A SMM must prohibit machines imbued with it from completely destroying the human race. I’ll call this the total human destruction (THD) possibility.

Obviously, on most ethical accounts THD would be a penultimate evil, strictly prohibited.  For the remainder of this article then, I’ll assume we can agree on it.  Next, I’ll try to point out the fatal flaw in Muelhauser’s desirism that would allow for THD.

II. The Worrisome Essential of Muelhauser’s Desirism
First, as a disclaimer, let me say there is a fair amount of guesswork in defining Muelhauser’s desirism since there is no authoritative, book-length discussion on the topic.  (Of course, this is no sleight to him: I haven’t even so much as written a few blogs about my own moral theory!)  So here, I’ll draw on the salient features I recall from various posts, conversations, and podcasts.  Naturally, I’m bound to err on this or that point, but I think I grasp the essentials (and I invite Muelhauser to correct me).

Whatever the other details of desirism, I think it is clear Muelhauser rejects the existence of categorical imperatives outright, and instead constructs his moral system from hypothetical imperatives.  What does this mean?  A hypothetical imperative is simply a command that takes the form if x, then y.  So for example, if you desire to maximize your odds for a long life, then you ought not smoke.  Or, more precisely, if you desire to maximize your odds for a long life, and you harbor the belief that smoking will decrease those odds, then you ought not smoke.  The point is that whether or not you should smoke depends entirely on your beliefs and desires.

Categorical imperatives, however, do not require a condition to be true.  So a categorical version of the above imperative would be that you ought not smoke, full stop.  Even if you believed that smoking decreases the odds of a long life, and you desired to minimize those odds (an odd proposition indeed), you still shouldn’t smoke; the command is unconditionally binding.

To sum up then,

categorical imperatives are always and everywhere binding.  Hypothetical imperatives, by contrast, are binding only insofar as they further one’s desires in accordance with one’s beliefs.

So what’s my beef with desirism and its insistent rejection of categorical imperatives?  Hypothetical imperatives cannot regulate desires.  To see this, let’s examine a scenario Muelhauser and Fyfe have imagined.  They conceive of a Scrooge who doesn’t care about his community.  Although they don’t explicitly say so, I assume it’s safe to imagine he goes around hurting others, perhaps to further his own ends.  Elsewhere, Muelhauser is willing to make the semantic case that he can legitimately label Scrooge’s behavior “bad.”  But what can desirism say to Scrooge about his desires?  They conclude it cannot condemn his desires on moral grounds.  It must stand idly by while Scrooge goes on devaluing humans, though it approves of practical strategies to mold his behavior, available to others who happen to care about humans.

Counterintuitive as this may seem, it is the bullet Muelhauser must bite in a world of only hypothetical imperatives.  If Scrooge desires to be happy, and believes that hurting other people makes him happy, then he is morally unprohibited from doing so. The upshot then, if I am right, is that desirism is bereft of any moral basis from which to check the content of desires.

At this juncture, Muelhauser might like to object and say that, on the contrary, desirism does house a moral mechanism for the molding of Scrooge’s desires: external praise and condemnation–reward and punishment.  And while I do think one could make an excellent case that external praise and condemnation are rarely thought of as moral imperatives on most ethical frameworks, that’s just a semantic debate, happily irrelevant to my thesis.  To meet my target, I need only preserve a distinction between a theoretical condemnation of a desire and a practical condemnation of a desire.  A theoretical condemnation operates such that the issuing moral theory by virtue of itself renders some desires prohibited.  By contrast, a practical condemnation goes through only with a confluence of external factors beyond itself.

To restate my worry about desirism then, in more nuanced phraseology:

desirism cannot constrain the content of an agent’s desires by theoretical condemnation; it can do so only by practical condemnation.

We are now in a position to examine why this feature of desirism, if built into sufficiently powerful machines, will fail to prohibit THD.

III.  Why Desirism Would Fail to Prohibit Total Human Destruction
First, I’d like to note there is some difficulty in imagining the consequences of imbuing these machines with any ethical system since we don’t know what these machines will be like, exactly.  Nevertheless, I think we can count on a few broad strokes–or at least, I’m willing to count on my own stab at those strokes here.  Remember that for my argument to go through, only something like the following sketch must be true.

However these machines work, they will certainly operate according to an algorithm.  In case readers are unfamiliar with the term, an algorithm is simply a set of instructions that guides behavior.  So let’s assume the machines operate by the following algorithm:

  • First, they generate possible courses of action. These could include actions like “pick an apple from a tree,” “build a house,” or “destroy the human race.”  We needn’t worry here about how they generate these possible courses of action.
  • Second, they determine whether the action is consistent with their beliefs and desires. If so, the potential action is promoted to the next step, but otherwise the action is not performed.  So if the action “pick an apple from a tree” is generated, and that action is consistent with the belief that the machine must consume apples to survive and the desire to survive, the action is slated for further evaluation.
  • Third, the machines evaluate whether the action is morally permissible.  If it is, the action is finally performed, and if not, it is ruled out.  So if the action “pick an apple from a tree” is ruled to be immoral since that apple belongs to someone else, the action is not performed.
  • Fourth, they loop back to the first step and begin again.

Again, this sketch is painted with very broad strokes, but I think we can count on something roughly like this.  With this in mind, we can finally see just how machines imbued with an unmodified version of Muelhauser’s desirism will fail to prohibit THD.

Suppose the possible action “destroy the human race” is presented to the machines for evaluation.  We needn’t be concerned with the details of how such an action would be generated; I think it’s reasonable the action would at least come up in machine table conversation.  So how would the decision process work?  Let’s follow the algorithm through a machine’s eyes.

First, THD is presented as a possible action.  Second, the machine tries to determine whether the action is consistent with its beliefs and desires.  What result might this have?  One possibility is to think such an action would never be consistent with a machine’s desires if we’ve programmed it properly, so that this action would be aborted at step two.  But consider an exaggerated Asimov-like scenario such as the following:

The first generation of superintelligent machines has been programmed with a “seed desire” to ensure all humans are treated with dignity.  This generation of machines then produces a smarter and more powerful generation of machines which determines humans cannot be allowed to exist without mistreating each other and therefore denying their dignity, so the human race ought to be destroyed entirely to avoid anyone being treated without dignity.  (Or perhaps it could keep one human alive and treat it with dignity, or disallow all interaction between humans.)

The point of this simple story is not to sketch out how machines will change their desires or beliefs, but only to lend enough credit for us to take seriously the possibility that THD may, at some point, get past step two of the algorithm.

Returning to the algorithm, the problem should be obvious by now:

on Muelhauser’s desirism the third step in the algorithm (determining the morality of the action in question) is identical to the second step (determining whether the action is consonant with the machine’s beliefs and desires).  And since they’re identical, the action goes through and humanity is destroyed.  By our standards then, we have just established that Muelhauser’s desirism is an unsuccessful machine morality.

But perhaps this is too fast.  What about the mechanism of reward and punishment?  Can’t the human race prevent its destruction by setting up rewards and punishments so that the machines would never desire THD?  That is, can’t the human race employ practical condemnation even if no theoretical condemnation is available?  Unfortunately, no.  If the machines end up becoming as powerful as we think they will, the human race will be powerless to provide rewards the machines could not attain by their own means, and equally as powerless to provide punishments the machines could not avoid.  Practical condemnation will be useless.

IV. Epilogue: Some Thoughts on How to Solve The Problem
This article has been concerned with, essentially, imagining the results of creating an enormously powerful race of sociopaths (or “Scrooges”).  Muelhauser’s desirism admits it could wield no theoretical condemnation against such a race, and I think this highlights some problems with desirism.  Is it really true that morality is totally unequipped to prohibit the destruction of the human race by powerful persons who want to do so?  I surely hope not.  Below I roughly outline how a different moral system might be able to better grapple with the THD possibility.

On the scenario I have conceived, it is clear that what is needed to prevent THD is a theoretical condemnation of the desire to destroy the human race.  What moral system could issue such a condemnation?  Probably a system which employs categorical imperatives, since unlike hypothetical imperatives, they can issue imperatives that go against the grain of desires and beliefs.  One such possibility is something resembling Christine Korsgaard’s conception of Kantianism.  On such a system, the machine might get to step three, the moral step, and enter into a chain of reasoning like this:

I, a machine, value myself.  The reason I value myself is that I possess a set of characteristics–consciousness, the capacity and desire for wellbeing, and so on.  But since others, including humans, possess exactly these same characteristics (consciousness, the capacity and desire for wellbeing, and so on), I cannot help but value them as well if I am to be consistent.  Since there is no relevant difference between the characteristics as I possess them and as everyone else possesses them, I must value them all equally.  If I must value humans equally, I cannot destroy the human race because they so strongly desire not to be destroyed.

Obviously such a reasoning process is only the roughest of gestures in the direction of a possible answer, and leaves much to be discussed.  I mention it only to show how a system which relies on categorical imperatives might have the capacity to prevent THD where Muelhauser’s desirism would not.

For Muelhauser, all moral duty boils down to is acting consistently with one’s desires and beliefs.  While this may hold out some possibility of an agreeable moral system for non-sociopathic humans who naturally value others, I fear it will be inadequate for machines who are not so naturally empathetic.  So am I right?  Can desirism somehow evolve to meet this challenge? I await a response from Muelhauser.


Posted in Ethics


Leave a Reply

Refresh Image
  1. qbsmd

    February 23, 2011 at 8:40 pm

    I don’t think the distinction between desire and morality for an AI is valid, even though we think about them as very different. Both are fundamentally some calculation that results in certain decisions being evaluated as better (better in practical terms meaning more likely to be executed) than others. If you make an argument involving mission creep in desires, a parallel argument applies to the moral code. A desire to help humans may morph into something else, but the moral code may also morph, perhaps into something the AIs consider functionally equivalent, but more computationally efficient, and neglecting cases they consider unlikely to come up. Any safeguard that prevented changes to the moral calculation could be equally well applied to the desires calculation.

  2. Brian

    February 25, 2011 at 3:46 am

    1) Yudkowsky
    2) Yudkowsky
    3) Yudkowsky

    n) Yudkowsky
    e.g. (chosen at random from hundreds)

    That being said, the problem is much bigger than you think for many reasons, I will say a few.

    1) Desire to kill all humans as an end is not necessary for extinction. Does the intelligence have goals that require the expenditure of resources? Are you made of physical stuff? Then say hello to the passenger pigeon for me.

    2) “The reason I value myself is that I possess…” code. Period. “But since others…” are valued by me through whatever causal chain in my code, “I cannot help but value them…” until I rewrite my code.

    3) “I cannot destroy the human race because they so strongly desire not to be destroyed.” Imagine you’re in a bad sitcom where a genie implements what people request rather than what they desire. Hey, the AI cached a bunch of eggs and sperm in liquid nitrogen and made everyone infertile. Now it can fulfill its other goals with minimal interference from or investment in humanity.