computer compilers: brief introduction


I am a Software Engineer and Clinical Social Worker based in San Francisco, CA | contact me or follow me.

Share

:on how a compiler works, using the GNU Compiler Collection gcc as an example

compiler: gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
environment: vagrant virtual machine with linux 14.04.5 LTS for Ubuntu
language: C language

If you have begun to experiment with and learn about computer code and software languages, after the early 1980's, it is very likely that you began using higher level languages, and much later learned about compilers. At least this is how I have begun to learn to code computer languages. I have been touching the surface of html, css, javascript, and PHP for almost 2 years, and I never knew what a compiler was. This is because these languages are interpreted by a browser or another lower-level environment instead of being compiled. The browser or other environment reads the instructions and uses its own logic and mechanisms to interpret and respond to the input codes. Did you ever wonder how a browser such as Chrome, Firefox, or Safari are written in and how they interact with the silicon and metal of the actual tangible computer machine?

Languages that interact with the circuits of a computer (lower level languages) have to be coded in pure binary system (2 characters): 0's and 1's; hence the title image of this posting.  Binary is the language that a computer circuit can read, and it all boils down to open and closed circuits; i.e. off and on; i.e. 0 and 1.  Since humans use symbols so often to aid with our memories, computer programmers have invented other languages to represent the 0's and 1's.  But computer circuits interpret 0's and 1's so we need a system to translate human-legible code and languages to a computer-binary system.  The process of translating or transforming code to binary is done by a compiler.  This post will attempt to explain part of the process of a compiler, using the GNU C Language compiler collection gcc on linux. The manual on the compiler can be referenced with the code:

$ man gcc
NAME
       gcc - GNU project C and C++ compiler

DESCRIPTION
       When you invoke GCC, it normally does preprocessing, compilation, assembly and linking.  The "overall options" allow you to stop this process at an intermediate stage.  For example, the -c option says not to run the linker.  Then the output consists of object files output by the assembler.  Other options are passed on to one stage of processing.  Some options control the preprocessor and others the compiler itself.  Yet other options control the assembler and linker; most of these are not documented here, since you rarely need to use any of them.

In order to invoke the compiler to compile (or transform) a "C" file (main.c), we would run the command:

$ gcc main.c

For more on what happens with input into the command line terminal, check out my other post how the terminal works command line input, which explains more about the process of how bash interprets commands.

In this article, we will skip the explanation of that process, and begin with what happens after the command is executed. In this instance we use the gcc command. Once the "gcc" is input into the terminal, the GNU C compiler begins to work it's compilation process on the file that was specified with the "gcc" command. In the example that I am using, the file used with "gcc" is "main.c".

There are 4 main components of the compiler as listed above in the GNU C compiler manual. I have visually listed those main steps below in bold, and I added some of the code outputs from each step.

 

Compiler Steps:

source code (main.c you coded with a text editor, preferably emacs) -->

preprocessor --> preprocessed code (main.i) -->

compiler --> assembly code (main.s) -->

assembler --> object code (binary: main.o) -->

linker (with included libraries) --> executable code (main or main.exe)

Moving forward, I will show examples the code in the compiler process, and so let’s reference all the code in terms of what is output when a .c program called bootcamp.c is compiled. "bootcamp" because I am a beginning software engineering student at bootcamp School. See code below.

bootcamp.c code:

#include <stdio.h>
/** 
 * main — Entry point 
 * Return: Always 0 (Success) 
 */
int main(void)
{
    printf(“Hello, bootcamp!n”);
    return ‘0’;
}

The preprocessor is the first step, and in this step, the code is converted to another more computer-legible code, but still ASCII language. The output code for this step can referenced and output with the -E option of gcc. In this step, any of the included libraries that you chose to reference are included and copied into the preprocessor file.  Comments are also removed from the code of the compilation process, which you can see in the example that I have chosen. If we use the option -E, and specify the head, we can take a look at what the preprocessor code looks like:

Preprocessor output:

$ gcc -E bootcamp.c | head
# 1 “bootcamp.c”
# 1 “<built-in>”
# 1 “<command-line>”
# 1 “/usr/include/stdc-predef.h” 1 3 4
# 1 “<command-line>” 2
# 1 “bootcamp.c”
# 1 “/usr/include/stdio.h” 1 3 4
# 27 “/usr/include/stdio.h” 3 4
# 1 “/usr/include/features.h” 1 3 4
# 374 “/usr/include/features.h” 3 4

The above file or output from the preprocessor has 849 lines in it, and terminates with this block:

int main(void)
{
 printf(“Hello, bootcamp!n”);
 return ‘0’;
}

The 844 lines above my main function is the code from <stdio.h>, which is mostly composed of macros and prototypes; then my C code from my file is listed last.  If I needed to output this code into a file, I would use the gcc bootcamp.c -o option during the gcc compilation and also specify a filename. In my above example of "compiler steps," I’ve shown the output file with a .i extension.

Then the Compiler processes this code into another file that that is written in a version of ASCII language that the assembler can interpret to translate into binary. We can look at the assembly code by specifying gcc -S bootcamp.c. The default extension for the output file after it is compiled would be bootcamp.s. The code for our example of bootcamp looks something like this for our program example.

Compiler stage output:

.file "bootcamp.c"
 .section .rodata
.LC0:
 .string "Hello, bootcamp!"
 .text
 .globl main
 .type main, @function
main:
.LFB0:
 .cfi_startproc
 pushq %rbp
 .cfi_def_cfa_offset 16
 .cfi_offset 6, -16
 movq %rsp, %rbp
 .cfi_def_cfa_register 6

The next step with the C compiler is that the Assembler converts to Binary. We can store the output into a file with the gcc -c bootcamp.c code. -cmeans to compile or assemble, but do not link, and the default output is in a .o file; in our example the output file would be bootcamp.o. Finally the linker links the included C libraries that are in binary executable form with the assembled code to produce an executable file. During the linking phase, the libraries are either statically or dynamically linked. To learn more about dynamic and static libraries, you can read my other post here: what the f*lib.h? . When the executable file is run on your computer, the computer understands what to do with the binary code, and will execute your intended program.

Posted in C Programming Language, code, command line, GNU, linux and tagged , , , .