Compilation vs Interpretation in Perl

From:	Tom Christiansen <tchrist@mox.perl.com>
Newsgroups:	comp.lang.perl.modules, comp.lang.perl.misc
Subject:	FMTEYEWTK about Compilation vs Interpretation in Perl (was: Question: Compile vs. Run time?)
Date:	1 Jun 1996 14:49:55 GMT
Organization:	Perl Consulting and Training
Message-ID:	<4oplaj$rau@csnews.cs.colorado.edu>
References:	<4ono8h$hhe@bastion4.hal.com>
Keywords:	perl5,compile,run,use,require

In comp.lang.perl.misc, berggren@hal.com writes:

I read in the perl5 docs that "use Module" imports the Module.pm exported names at compile time, whereas the "require Module" loads the module only at run time. Can someone please explain the difference here? I thought that Perl(5) is an interpreted language, and is neither compiled nor run. When I'm running my script, when is the compile time, and when is the run time?

Charitably put, I fear you have an overly naïve definition of the technical terms ``compile'' and ``interpret'' as applied to programming languages and programs.

With the possible exception of the eval("string") facility, nothing in the de facto definition of the perl language will clue you in about whether it's compiled or interpreted. In fact, these terms don't even make a great deal of sense if you look at it a bit. Let's do that now.

The perl executable you are using has two distinct stages. First comes the frontend, which is certainly a compiler of sorts. It compiles your perl program source into a parse tree. This compiler then performs various optimizations such as one would find in any other compiler, including throwing out unreachable code, reducing constant expressions to their results, and loading in certain library definitions. It is at this point that the use statements get run, since they are semantically equivalent to BEGIN{} blocks wrapping a require and an import() class-method call against the included module.

End of compilation.

Next comes the backend, which is certainly an interpreter of sorts; let's call it a PP interpreter for now, just because. While what it actually executes is a parse tree and not byte code per se, still we would not go wrong in classifying this backend as a byte-code interpreter (like java or python). This is useful in particular when it comes to distinguishing these languages from ``pure'' interpreters, such as most shell and tcl implementations you happen to run. This is where any requires not wrapped in BEGINs occur.

[The reason it's called a PP interpreter is that it's pretending to be a virtual machine implementing instructions whose names are things like pp_rv2gv, pp_chomp, pp_ge, pp_each, pp_split, and pp_backtick. See the perl source code itself for details. This virtual machine's language is defined in pp.h, and implemented in files like pp.c, pp_ctl.c, pp_hot.c, and pp_sys.c if you're curious (or even if you're not).]

From the frontend (the ``source-code to parse-tree'' compiler), you can get at the backend (the PP interpreter) via a BEGIN subroutine. Likewise, to go the other way (get back to the compiler from the interpreter), you can use an eval("string") or a s/foo/bar/ee notation. By the way, despite appearances to the contrary, it turns out that an eval { BLOCK } and s/foo/bar/e are not actually hooks back to the compiler; it already handled them long ago and far away.)

Does that make sense? Think of every call to

    $ perl somescript

as being

    $ perl-compiler < somescript | perl-interpreter

At no point did our perl compiler as described above engage in the actual generation of C code, assembly language, nor machine code take place.

The seldom-(successfully)-used dump() function and the -u command line flag provide a way to skip the first stage in that pipeline, but you still have the parse trees and the PP interpreter in that huge dumped file, which you must somehow massage into an a.out, usually using the undump program from the TeX distribution or linking against the C function unexec() from GNU emacs. This is normally never done, because the file is huge and it darned hard to actually get it to work. As a mere starting point, you'd have to link your perl executable statically not dynamically, but it's still extremely hard to get to work for a particular architecture. I'm not sure whether it's even ever been done on anything but a Sun.

As a brief diversion, let's look at compilation on more traditional system in which code generation is typically handled by a compiler backend. For example, if you were running a Convex system, you might well have a Fortran, C, and an Ada compiler, which all produce a common intermediary form. This intermediary code (whether it be a raw parse-tree or some hypothetical virtual machine's assembly language, that is: byte code) in then fed into a a backend code-generator then optimizes and spits out assembly language, which is in turn fed to the assembler to produce machine code. Later on when you ``execute'' this machine code, it is in turn fed to the firmware interpreter largely implemented in hardware.

Now, at the risk of further confusing you, permit me to tell you about something else. Currently in early public alpha release, a ``perl compiler'' exists, which will surely blur these distinctions even further. We throw out the traditional back, and substitute in a new stage. This ``middle-end'' (as it were) takes the frontend's parse tree code produces as its output an intermediary byte code form (perl byte code, or PBC).

Three different backends grok this PBC:

A byte-code interpreter, or if you would, a perl virtual machine if you've been indoctrinated into the java terminology. This still needs the old backend parse-tree interpreter sitting around somewhere.
A code generator that produces compilable C code. This is essentially an unroller (or perhaps unraveller) of the parse-tree interpreter. That is, it traces the code path that the interpreter would execute.
Another code-generator that produces compilable C code, except that this one doesn't just trace the steps the interpreter would have followed, but actually produces optimized code (for example, it would work with raw integers directly rather than calling the interpreter calls that would have done so).
If your code makes use of any dynamically-loaded modules (like POSIX, Socket, Fcntl, FileHandle, etc.), then you must keep those modules' binary forms (POSIX.so, Socket.so, etc) around so they can be found when your executable gets run. Both backends one and two alleviate the need to store the original, pre-compiled source code anywhere. Backends two and three alleviate the need to keep the old PP interpreter lying about. Backend number three is the only one which is going to speed up execution time when compared with the old PP interpreter.
There, I hope that's all clear now. :-)
--tom

PS:
For more information, see the alpha announcement that Malcolm made.

Return to:
- The FMTYEWTK index page.
- The Perl home page.
Copyright 1996 Tom Christiansen.
All rights reserved.