Obfuscated code

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Obfuscated code is source code in a computer programming language that has been made difficult to understand. Programmers may deliberately obfuscate code to conceal its purpose, to deter reverse engineering, or as a puzzle or recreational challenge for readers. Programs known as obfuscators transform human-readable code into obfuscated code using various techniques.

Contents

[edit] Explanation

Some languages may be more prone to obfuscation than others.[1][2] C,[3] C++,[4] and Perl[5] are most often cited as easy to obfuscate. Macro preprocessors are often used to create hard-to-read code by masking the standard language syntax and grammar from the main body of code. The term shrouded code has also been used.

Obfuscating code to prevent reverse engineering is typically done to manage risks that stem from unauthorized access to source code. These risks include loss of intellectual property, ease of probing for application vulnerabilities and loss of revenue that can result when applications are reverse engineered, modified to circumvent metering or usage control and then recompiled. Obfuscating code is, therefore, also a compensating control to manage these risks. The risk is greater in computing environments such as Java and Microsoft's .NET which take advantage of just-in-time compilation technology that allow developers to deploy an application as intermediate code rather than code which has been compiled into machine language before being deployed.

Obfuscators may be used to compact object code or interpreted code without affecting its behaviour when size is important. Common cases include MIDlets – in which the meaningful identifiers embedded in the Java class files are replaced with shorter ones – and Javascript code on the web.

Programs written in languages such as C or Pascal are compiled into the machine language of a given computer before they are run on it. This conversion is necessary because programmers write source code while computers run machine code. Generally, there is a one way transform from source code to machine code. Machine code is not encrypted and is easy for anyone to see, but the format is so tedious for humans that reverse engineering is slow and difficult. That being said, compilers tend to map source code to machine code in a predictable way, and although optimization obfuscates this, it is often possible to reconstruct a fair copy of the original source code.

Java and .NET languages (e.g., Oxygene, C#, Visual Basic) take a different approach to compilation. They are far easier to reverse engineer because they do not compile to machine code, they compile into intermediate code.

Microsoft recommends using the Script Encoder to obfuscate the ASP files, so in case the web server is compromised, the cracker would be unable to find out how your ASP applications work. The Script Encoder works also on JScript and VBScript files. Note that the encoded JScript is only functional in Internet Explorer. However, the documentation states, "Note that this encoding only prevents casual viewing of your code; it will not prevent the determined hacker from seeing what you've done and how."

[edit] Application hardening

Obfuscation is one technique used in a process called "application hardening", which also includes such techniques as tamper detection and response, application encryption, and custom virtual machines.

[edit] Recreational obfuscation

Writing and reading obfuscated code can be a brain teaser for programmers. A number of programming contests reward the most creatively obfuscated code: the International Obfuscated C Code Contest, Obfuscated Perl Contest, International Obfuscated Ruby Code Contest, and Obfuscated PostScript Contest.

Types of obfuscations include simple keyword substitution, use or non-use of whitespace to create artistic effects, clever self-generating or heavily compressed programs, and programs that are valid and operate similarly in multiple programming languages.

Short obfuscated Perl programs printing "Just another Perl hacker" or something similar are often found in signatures of Perl programmers. These are informally known as JAPHs, and the origin of this practice is generally credited to Randal Schwartz.

[edit] Examples

This is an example of a winning entry from the International Obfuscated C Code competition written by Ian Phillipps in 1988[6] and subsequently reverse engineered by Thomas Ball[7].

#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
  :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Although unintelligible at first glance, it is a legal C program that when compiled and run will generate the 12 verses of The 12 Days of Christmas. It contains all the strings required for the poem in an encoded form inlined in the code. The code iterates through the 12 days displaying what it needs to.

Another example is this short program that generates mazes of arbitrary length:

char*M,A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf(M="%d",&C);
--            E;             J[              E]             =T
[E   ]=  E)   printf("._");  for(;(A-=Z=!Z)  ||  (printf("\n|"
)    ,   A    =              39              ,C             --
)    ;   Z    ||    printf   (M   ))M[Z]=Z[A-(E   =A[J-Z])&&!C
&    A   ==             T[                                  A]
|6<<27<rand()||!C&!Z?J[T[E]=T[A]]=E,J[T[A]=A-Z]=A,"_.":" |"];}

Modern C compilers don't allow constant strings to be overwritten, which can be avoided by changing the first line to

char M[2],A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf("%d",&C);

Versions of gcc (the GNU Compiler for C) prior to 4.0 can compile the program in its original form if the -fwritable-strings flag is used.

This example program in Java computes the days left in a year based on the day:

           return (int) ((((x - 2) * (x - 3) * (x - 4) * (x - 5) * (x - 6) *
                           (x - 7) * (x - 8) * (x - 9) * (x - 10) * (x - 11) *
                           (x - 12) * 31) /
                          ((x - 2) * (x - 3) * (x - 4) * (x - 5) * (x - 6) *
                           (x - 7) * (x - 8) * (x - 9) * (x - 10) * (x - 11) *
                           (x - 12) + .00001)) +
                         (((x - 3) * (x - 4) * (x - 5) * (x - 6) * (x - 7) *
                           (x - 8) * (x - 9) * (x - 10) * (x - 11) * (x - 12) *
                           (28 + z)) /
                          ((x - 3) * (x - 4) * (x - 5) * (x - 6) * (x - 7) *
                           (x - 8) * (x - 9) * (x - 10) * (x - 11) * (x - 12) +
                           .00001)) +
                         (((x - 4) * (x - 5) * (x - 6) * (x - 7) * (x - 8) *
                           (x - 9) * (x - 10) * (x - 11) * (x - 12) * 31) /
                          ((x - 4) * (x - 5) * (x - 6) * (x - 7) * (x - 8) *
                           (x - 9) * (x - 10) * (x - 11) * (x - 12) + .00001)) +
                         (((x - 5) * (x - 6) * (x - 7) * (x - 8) * (x - 9) *
                           (x - 10) * (x - 11) * (x - 12) * 30) /
                          ((x - 5) * (x - 6) * (x - 7) * (x - 8) * (x - 9) *
                           (x - 10) * (x - 11) * (x - 12) + .00001)) +
                         (((x - 6) * (x - 7) * (x - 8) * (x - 9) * (x - 10) *
                           (x - 11) * (x - 12) * 31) /
                          ((x - 6) * (x - 7) * (x - 8) * (x - 9) * (x - 10) *
                           (x - 11) * (x - 12
                           ) + .00001)) +
                         (((x - 7) * (x - 8) * (x - 9) * (x - 10) * (x - 11) *
                           (x - 12) * 30) /
                          ((x - 7) * (x - 8) * (x - 9) * (x - 10) * (x - 11) *
                           (x - 12) + .00001)) +
                         (((x - 8) * (x - 9) * (x - 10) * (x - 11) * (x - 12) *
                           31) /
                          ((x - 8) * (x - 9) * (x - 10) * (x - 11) * (x - 12) +
                           .00001)) +
                         (((x - 9) * (x - 10) * (x - 11) * (x - 12) * 31) /
                          ((x - 9) * (x - 10) * (x - 11) * (x - 12) + .00001)) +
                         (((x - 10) * (x - 11) * (x - 12) * 30) /
                          ((x - 10) * (x - 11) * (x - 12) + .00001)) +
                         (((x - 11) * (x - 12) * 31) /
                          ((x - 11) * (x - 12) + .00001)) +
                         (((x - 12) * 30) / ((x - 12) + .00001)) + 31 + .1) -
                   y;
      

An example of a JAPH:

@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

This slowly displays the text "Just another Perl / Unix hacker", multiple characters at a time, with delays. An explanation can be found here.

Some Python examples can be found in the official Python programming FAQ.

[edit] Step by step

The following section gives an example of a developer's attempt to obfuscate a simple program. It is an example, not a set of rules.

This simple C program prints out the prime numbers less than 100, using the sieve of eratosthenes:

void primes(int cap) {
  int i, j, composite;
  for(i = 2; i < cap; ++i) {
    composite = 0;
    for(j = 2; j * j < i; ++j) 
      composite += !(i % j);
    if(!composite)
      printf("%d\t", i);
  }
}
 
int main() { 
  primes(100);
}

The following sequence of transformations reduces the primes program into a single, obfuscated line of code.

[edit] Rewrite for as while. Use special values.

A for loop can be transformed into a while loop followed by a series of cascading if-else statements (in fact in C, a for loop is a thin disguise for a while loop anyway—the differences are mainly in lexical scope.)

Also we can take advantage of prior knowledge. If we know that j is 0 we can use it instead of the literal 0— thus not revealing that we are using zero, but a variable that could take any value (but happens to be zero). 0 is a common enough constant, but no programmer ever adds zero or multiplies or divides or subtracts it, so it is an easy place to start for the reverse engineer. So, rewrite the primes function like this:

void primes(int cap) { 
  int i, j, composite, t = 0;
  while(t < cap * cap) {
    i = t / cap;
    j = t++ % cap;
    if(i <= 1);
    else if(!j)
      composite = j;
    else if(j == i && !composite)
      printf("%d\t",i);
    else if(j > 1 && j < i)
      composite += !(i % j);  
  }
}
 
int main() {
  primes(100);
}

[edit] Change iteration into recursion

Any loop can be replaced by recursion. Generally one strives to eliminate recursion because of the computational overhead, but it may be deliberately introduced for obfuscation. So, replace the while loop with recursion. Two new parameters are needed by the primes function. Let's also conjoin two statements with the comma operator in the third if block. An additional if-else block must be added to capture the situation in the iterative version where program flow fails all of the if-conditions and the while loop would continue with t incremented by 1:

void primes(int cap, int t, int composite) {
  int i,j;
  i = t / cap;
  j = t % cap;
  if(i <= 1)
    primes(cap,t+1,composite);
  else if(!j)
    primes(cap,t+1,j);
  else if(j == i && !composite)
    (printf("%d\t",i), primes(cap,t+1,composite));
  else if(j > 1 && j < i)
    primes(cap,t+1, composite + !(i % j));
  else if(t < cap * cap)
    primes(cap,t+1,composite);
}
 
int main() {
  primes(100,0,0);
}

[edit] Obfuscate constructs and meaningless variable names

No construct or name is of itself obfuscating, but context—or rather the lack of it—can make it so. Change the variable names to single letters and replace the if-else structure with the ternary ? conditional operator (e.g. if(A) B else if(C) D else E becomes A ? B : C ? D : E). Reorder expressions to the most counterintuitive that the language will allow:

void primes(int m, int t, int c) {
  int i,j;
  i = t / m;
  j = t % m;
  (i <= 1) ? primes(m,t+1,c) : (!j) ? primes(m,t+1,j) : (j == i && !c) ? 
  (printf("%d\t",i), primes(m,t+1,c)) : (j > 1 && j < i) ? 
  primes(m,t+1,c + !(i % j)) : (t < m * m) ? primes(m,t+1,c) : 0;
}
 
int main() {
  primes(100,0,0);
}

[edit] Remove intermediate variables and literals

Remove the intermediate variables i and j and replace them with the expressions they stood for, (t / m) and (t % m):

void primes(int m, int t, int c) {
  ((t / m) <= 1) ? primes(m,t+1,c) : !(t % m) ? primes(m,t+1, t % m) : 
  ((t % m)==(t / m) && !c) ? (printf("%d\t",(t / m)), primes(m,t+1,c)) : 
  ((t % m)> 1 && (t % m) < (t / m)) ? primes(m,t+1,c + !((t / m) % (t % m))) : 
  (t < m * m) ? primes(m,t+1,c) : 0;
}
 
int main() {
  primes(100,0,0); 
}

[edit] Obfuscate names again

Rename the function primes and the variables m, t, and c to _, __, ___, and ____, respectively:

void _(int __, int ___, int ____) {
  ((___ / __) <= 1) ? _(__,___+1,____) : !(___ % __) ? _(__,___+1,___ % __) : 
  ((___ % __)==(___ / __) && !____) ? (printf("%d\t",(___ / __)), 
  _(__,___+1,____)) : ((___ % __) > 1 && (___ % __) < (___ / __)) ? 
  _(__,___+1,____ + !((___ / __) % (___ % __))) : (___ < __ * __) ? 
  _(__,___+1,____) : 0;
} 
 
int main() {
  _(100,0,0); 
}

[edit] Remove literals

There are quite a few references to the literal 1. One could replace them by dividing a variable by itself each time, knowing it is nonzero, but for the sake of example instead we pass 1 as a (totally unnecessary) parameter from main (and, of course, call it _____). Let's leave the only remaining 0 as a teaser:

void _(int __, int ___, int ____, int _____) {
  ((___ / __) <= _____) ? _(__,___+_____,____,_____) : !(___ % __) ? _(__,___+_____,___ % __, _____) : 
  ((___ % __)==(___ / __) && !____) ? (printf("%d\t",(___ / __)), 
  _(__,___+_____,____,_____)) : ((___ % __) > _____ && (___ % __) < (___ / __)) ? 
  _(__,___+_____,____,_____ + !((___ / __) % (___ % __))) : (___ < __ * __) ? 
  _(__,___+_____,____,_____) : 0;
} 
 
int main() {
  _(100,0,0,1); 
}

[edit] Remove redundant text

Remove white space, type declarations, optional elements like the void and int on function return types, and unambiguous parentheses:

_(__,___,____,_____){___/__<=_____?_(__,___+_____,____,_____):!(___%__)?_(__,___+_____,___%__,_____):___%__==___/
__&&!____?(printf("%d\t",___/__),_(__,___+_____,____,_____)):___%__>1&&___%__<___/__?_(__,_____+
___,____+!(___/__%(___%__),_____)):___<__*__?_(__,___+_____,____,_____):0;}main(){_(100,0,0,1);}

It can be seen from this example that the steps are largely intuitive, not fixed rules, and often are applied more than once.

This program will run on most systems. The limiting factor generally will be time. This sort of obfuscation by program transformation is relatively easy to apply and can be performed on many simple programs.

In the example above, it can be deduced that the obfuscated version is likely to be far less efficient than the original, mainly because of the replacement of an iterative function with a recursive one which will incur much more computational overhead (though both are of order O((n log n)(loglog n)): see Big O notation). Intermediate results are also computed more often than necessary, though an optimizer may use loop-invariant code motion to avoid that. But the recursion is likely to defeat the optimizer.

[edit] Unintentional obfuscation

Obfuscation and optimization are not necessarily opposed, but sometimes unintentional obfuscation occurs when programmers attempt to optimize prematurely (i.e. second-guess the compiler). In fact, with modern optimizing compilers, attempts to do clever tricks are generally self-defeating because they may confuse the optimizer and prevent optimal code being produced. The simplest code is usually the best. Of course, the example above is deliberately contrived just to show the method.[citation needed]

[edit] Typography

In some common typefaces such as Courier New and Lucida Console, some letterforms (glyphs) at typical sizes on a computer screen appear identical: particularly l and 1, and O and 0. Even if not identical, they are close enough that if hidden in the middle of an identifier they are hard to spot. A very simple obfuscation is just to change these letters: while nobody is fooled for long, it wastes time (see below).

[edit] Obfuscation by code morphing

Obfuscation by code morphing refers to obfuscating machine language or object code rather than obfuscating the source code.

This is achieved by completely replacing a section of the compiled code with an entirely new block that expects the same machine state when it begins execution as the previous section, and will leave with the same machine state after execution as the original. However, a number of additional operations will be completed as well as some operations with an equivalent effect.

Code morphing makes disassembly of a distributed program more difficult. However, by adding unnecessarily complicated operations and hindering compiler-made optimizations, the execution time of the program is increased. For that reason, code morphing should be limited to critical portions of a program and not be used on an entire application.

Code morphing is often used in obfuscating the copy protection or other checks that a program makes to determine whether it is a valid, authentic installation, or a pirated copy, in order to make the removal of the copy-protection code more difficult than would otherwise be the case.

Code morphing is multilevel technology containing hundreds of unique code transformation patterns, as well as a special layer that transforms some commands into virtual machine commands. Code morphing turns binary code into an undecipherable mess that is not similar to normal compiled code.

[edit] Obfuscation in malicious software

Spammers frequently use obfuscated JavaScript or HTML code in spam messages. The obfuscated message, when displayed by an HTML-capable e-mail client, appears as a reasonably normal message—albeit with obnoxious JavaScript behaviors such as spawning pop-up windows. However, when the source is viewed, the obfuscations make it far more difficult for investigators to discern where the links go, or what the JavaScript code does. For this same reason, JavaScript obfuscation is also often used by malware authors to conceal parts of code that run browser exploits, or that redirect to pages containing exploits.

Automated JavaScript obfuscators are currently available on the market. Although the effectiveness of these tools may vary, and the channels through which they are exchanged range from the legal to the manifestly malicious, they are all created for the purpose of confounding casual viewers or inexperienced investigators.

The techniques use JavaScript's dynamic nature—a piece of code is stored as an encrypted string, which is decrypted and evaluated. This may be done several times. Other techniques include insertion of dummy code, as well as dummy HTML links to legitimate pages.

[edit] Obfuscation for VM migration

Third-party obfuscators are used to recompile old classes to meet new Java Development Kit definitions avoiding ClassFormatError while migrating from Microsoft Virtual Machine to Java Virtual Machine.

[edit] Advantages of obfuscation

[edit] Intellectual property protection

Obfuscation is typically used to protect the intellectual property that is present in a software. This includes protecting any trade secrets that may be present in the code as well as protecting licensing implementations to prevent unauthorized use.

[edit] Reduced security exposure

If an application has private information in the code – such as SQL, usernames, and passwords[8] – then obfuscating code with options such as string encryption can make this information harder to obtain.

[edit] Size reduction

As one of the core techniques of obfuscation is identifier renaming, size reductions are often gained by changing long, descriptive identifiers into, typically, one character identifiers. This can lead to substantial savings in program size, albeit with a resulting loss of maintainability of the code. Many obfuscators also have the ability to remove unused code, leading to further reductions in size.

[edit] Library linking

To fully hide the intent of an obfuscated program, it is common for standard library routines to be statically linked into the obfuscated program and those routines themselves are then obfuscated. This can be useful for avoiding problems like DLL hell.

[edit] Disadvantages of obfuscation

[edit] When used alone

At best, obfuscation merely makes it time-consuming, but not impossible, to reverse engineer a program. When security is important, measures other than obfuscation should be used. The same trade-offs are made in branches of cryptography: an algorithm may be known to be fast but weak, but if the information is very short-lived there is little incentive, except as an intellectual exercise, for anyone to break it: the information becomes useless before it is broken.

No-one can guarantee that obfuscation will present any particular level of difficulty to a reverse engineer.[9]

Obfuscators do not provide security of a level similar to modern encryption schemes. Even obfuscation with encryption can have flaws. Any program or data that is encrypted must be decrypted before it can be used by the computer. So it must exist, unencrypted, somewhere in memory; a reverse engineer can take a snapshot of that memory. Also, any strong encryption requires a key for decryption. For the program to be executable the key must be provided, leaving another avenue open for reverse engineering.[10]

Debuggers also provide powerful tools to look at the operation of a running program, and decompilers exist to help make bytecode and machine code human-readable. Reverse engineers can use these tools to crack systems with security dependant on obfuscation, including digital rights management systems.

Reverse engineering is partly a study in pattern recognition, and the good engineer quickly learns the quirks of a particular compiler, processor, or even programmer, and can make educated guesses about the original code.

[edit] Debugging

Obfuscated code is extremely difficult to debug. Variable names will no longer make sense, and the structure of the code itself will likely be modified beyond recognition. This fact generally forces developers to maintain two builds: One with the original, unobfuscated source code that can be easily debugged, and another for release. While both builds should be tested to make sure they perform identically, the second build is generally reliably constructed from the first by an obfuscator.

This limitation does not apply to intermediate language obfuscators (e.g. for .NET and Java), which generally work on compiled assemblies rather than on source code.

[edit] Portability

Obfuscated code often depends on the particular characteristics of the platform and compiler, making it difficult to manage if either change. This only applies to source code obfuscation. Obfuscation against intermediate languages does not have this limitation, though obfuscation can make it harder or impossible to decompile to a higher level language such as C# or Java.

[edit] Conflicts with Reflection APIs

Reflection is a set of APIs in various languages that allow an object to be examined or created just by knowing its classname at run-time. Many obfuscators allow specified classes to be exempt from renaming; and it is also possible to let a class be renamed and call it by its new name. However, the former option places limits on the dynamism of code, while the latter adds a great deal of complexity and inconvenience to the system.

[edit] Obfuscating software

A vast variety of tools exists to perform or assist with code obfuscation. These include experimental research tools created by academics, hobbyist tools, commercial products written by professionals, and Open-source software. Deobfuscators do the reverse.

Software obfuscation tools include specialized obfuscators to demonstrate a relatively limited technique, more general obfuscators which attempt a more thorough obfuscation, and combined-function tools which obfuscate code as part of a larger goal such as software licensing enforcement.

[edit] See also

[edit] Notes

[edit] References

[edit] External links

Personal tools