arsine, a custom shell command line interpreter

:a step by step process of how the custom shell, arsine, processes the input command ‘ls -l‘ and returns output in a terminal emulator

Co-Authored with:
Bobby Yang, follow @glyif on twitter and github.
For reference to the code of arsine, it is hosted on github (@glyif , repo simple_shell): https://github.com/glyif/simple_shell

main while loop of arsine

These beginning steps until the custom _getline() function all occur for all processes and all inputs into arsine including ls -l, and so therefore, there is no specific explanation of what happens in the case of ls -l until the _getline() function explanations.

arguments inventory

The major component of the initialization of arsine occurs in a function that builds a struct termed the arguments inventory. This arguments inventory has almost all of the major variables utilized within arsine; also, many of the functions in arsine take the arguments inventory as input parameters. It was a concept that was implemented in the early stages of building arsine shell as a solution to improve the scalability, organization, and readability of the C language code of arsine. In this way for every feature that was added to the code or arsine, which required a new variable, the functions simply required that the new variable be added to the arguments inventory instead of having to modify multiple function prototypes and function calls to account for the new variable. As a side note, arsine is fully coded in C language, and designed to be compiled with the GNU compiler collection gcc 4.8.4, which converts the code of arsine into an executable file of binary machine code.

environmental variables

One of the main variables built in the arguments inventory is the environmental variables linked list. It is simply generated from the extern char **environ and passed around the arsine system through the arguments inventory.

history

History is also generated from the input file stored in the $HOME directory called .simple_shell_history. The arguments inventory automatically generates the history list and restores it at the end of the program.

fstat()

The first 2 relevant variables of the arguments inventory are the exit integer and an integer based on the first system call of the program, fstat(), which returns statistics on file descriptors. The exit integer is the integer that the main loop of the main() function uses as the conditional check to continue processing the input and output process. In the case of fstat(), arsine uses the standard input as the file descriptor, and so fstat() builds a struct called “stat” of a list of variables with statistics or information on the standard input. One of the variables of that struct has information on the field known as the: st_mode. The _filemode() function checks the integer that is stored in the st_mode struct to see if it is a match with either: S_IFCHR — 0020000 — character device, or S_IFIFO — 0010000 — FIFO. If st_mode is a character device, this indicates arsine that it is being executed in interactive mode through the terminal. Otherwise, if the st_mode is is a FIFO, this indicates that the program is being run through a FIFO special file, or a pipe. If arsine is executed through a pipe, we wrote instructions after checking the st_mode from the stat struct to skip the next process of arsine, which is the prompt string or PS1. In the case of the example for this article, ls -l could be executed in either interactive mode, or non-interactive mode, and the outputs will only differ based on whether the prompt string is output or not.

PS1

arsine is designed to run inside a loop that continues so long as the user has not input the built-in function “exit“, which exits arsine. In this loop, the prompt string is first output before we prompt users for input. The prompt string is only output when arsine is executed in a device like a terminal accepting user input from the custom getline function in arsine, as contrast to being executed through a FIFO or pipe (this is explained above). We use the simple system call, write() to write to standard output this prompt string: "$ ". After the prompt string is output, the user is then prompted for input from the custom _getline() function from arsine. Before the main loop of arsine inside the main() function, arsine has instructions of how to interpret the user input of ^C (control + C). The system call signal()handles this step with instructions on how to respond when ^C is input from standard input with a designated function. arsine designates the function sig_handle() to respond from input of ^C to instead of exiting the current process, to instead immediately output the prompt string on the next line. The prompt string is also output, immediately with no other output on the next line, if arsine does not find any instructions at all when the user inputs “enter” or a newline.

In the case of ls -l PS1 is output after the command is executed if run in interactive mode; if arsine is executed in non-interactive mode, then the PS1 is not output.

custom `_getline()`

The custom _getline() function uses the system call read() to read from standard input one character at a time. When the input character is '\n' (a newline), the read getline() function stops reading input, and exits with the entire input stored in a char * buffer string. There is a check to see if read() returns -1, which is what is returned from no input at all, which occurs when commands are piped into the shell. In this case, arsine exits, because there are no more commands being input from the user. In the case of ls -l arsine takes in 5 characters, and then the 6th character input is the \n character from the user inputing enter to officially enter the command. After the input string is copied into a buffer stored in the arguments inventory, it is then also added as a new node into the linked list of history variables.

In the case of ls -l the history list contains the C string: "ls -l\n", which is also demonstrated as the character array:
{ 'l', 's', ' ', '-', 'l', '\n', '\0' }

tokenizer

The tokenizer takes the input commands stored in the buffer string and creates a new string of characters. In this function, the buffer string is parsed one character at a time and has instructions on how to handle every character. For most all alpha characters and numbers, the character is simply copied into the new string, which is stored in the struct tokens. The tokens struct also has a variable, which is a token type array of structs of type token (no ‘s’). Each token in this array has a pointer to the beginning of the copy of the input string stored within this same struct tokens. The tokens struct also has an integer which represents the number of tokens in the tokens array or in other words the length of the array. The copied string in the tokens struct has a NULL byte char '\0' after every set of commands or words that are copied into it; therefore, the pointers in the token struct, also end with a ‘\0‘. When the tokenizing process occurs, all white space characters are skipped. Semicolons are input, but not repetitively; i.e. for input of ;;;;;, only one ; is copied. After the input string is copied and processed into the tokens struct, the struct is then handled in the main()function.

in the case of ls -l, the input string is copied to this string: "ls\0-l\n" or { 'l', 's', '\0', '-', 'l', '\n', '\0' }, and a pointer is stored containing the address of the beginning character of each string in the command.

expansion

Expansion is another process in our code in which the array of tokens inside the tokens struct is processed to check for the char '$'. If a '$' is found, the following char is checked with a switch case function to see how to handle the input. For case '$$', the process ID of the current instance of arsine is expanded, for '$!', the process ID of the latest background process is expanded, for '$?' the most recent return value is returned, and for '$0', the name of the program running is returned. If none of these characters are after the first ‘$’, then the list of environmental variables is checked for a match of the following characters after the ‘$’ character. If a match is discovered, then that environmental variable value is replaced by the name. For example: $USER would be replaced with the user from the environmental variables list. If no match is found then the entire string including the '$', is replace with nothing. For example, $TEST becomes "", an empty string.

aliases

During runtime, arsine stores a linked list of all aliases in memory. This linked list can be modified with user input commands of alias or unalias. Before arsine is executed, a file in the $HOME directory is opened to check for old aliases, and then when arsine shell ends, the generated list of aliases is then automatically written to the aliases file. When arsine is in the pipeline phase (described below), the alias list goes through a similar expansion process in how arsine processes environmental variables (described above). In this process, with the use of a custom string compare function, the input commands are checked to see if they match the stored aliases. If there is a match, the aliases are expanded out.
For ls -l no expansion occurs.

parser | pipeline | worker

Since the parser, pipeline, and worker are so complex, and are designed to respond to a variety of cases, this article discusses the following input command example, instead of ls -l:

$ echo "hi" > tmp.txt ; cat tmp.txt | grep "no" || echo "not found"

steps of parsing tree

input

The associated diagram is an example of how a complex command like the command written just above is parsed by arsine into a parse tree.
—

parser

Our parser takes the tokens provided by the tokenizer to construct a binary tree. The parser is based on the classic operator precedence parser which builds an abstract syntax tree depending on the precedence of a token.
—

expansion

described above.
—

pipeline

After the abstract syntax tree is constructed, all of the processes in the tree gets put into a pipeline from the bottom up.
—

worker

The worker takes the pipeline of processes and and feeds them to various execute functions.
—

breakdown of parser and pipeline:

$ echo "hi" > tmp ; cat tmp | grep "no" || echo "not found"

(1) arsine calls `tokenize()` function:

tokenize(&arginv->tokens, arginv->input_commands);

At this point, the arguments inventory struct, contains the tokens struct, (as described above), arginv->tokens->data, which contains a the copied buffer holding, all tokens and each token is ended with the NULL byte char: ‘\0’, and white spaces are ignored. For example:

"echo\0"hi"\0>\0tmp.txt\0;\0cat\0tmp.txt\0|\0grep\0"no"\0||\0echo\0"not found"\0"

The tokens array of token structs contains pointers to each string in the copied data string, which terminates with ‘\0’. These are some of the examples of the variables for the token struct:

struct token
{
    int id;
    int precedence;
    const char *str;
}
arginv->tokens.tokens[0]->str = "echo"
id = TOKEN_STRING
prec = 0
arginv->tokens.tokens[2]->str = ">"
id = TOKEN_REWRITE
prec = 4
arginv->tokens.tokens[3]->str = "tmp.txt"
id = TOKEN_STRING
prec = 0
arginv->tokens.tokens[4]->str = ";"
id = TOKEN_SEMICOLON
prec = 1
arginv->tokens.tokens[6]->str = "tmp.txt"
id = TOKEN_STRING
prec = 0
arginv->tokens.tokens[7]->str = "|"
id = TOKEN_PIPE
prec = 3
arginv->tokens.tokens[10]->str = "||"
id = TOKEN_OR
prec = 2

arginv->tokens->tokensN is the number of tokens tokenized. e.g. 13.

(2) expansion occurs, as explained above.

(3) arsine calls parse() function:

parse(&arginv->parser, &arginv->tokens)

parse tree completed

arginv->parser->tree points to the root node of an operator-precedence parser tree. Each tree node is either a set of commands, e.g. echo “hi” or a special token defined in the token_names, e.g. > The tree should resemble the beside image “parse tree”.
The internal nodes are original pointer nodes (defined in token_names), and the leaf nodes are the strings for command. Generally, higher precedence operators have deeper levels in the tree, so that they will be executed earlier in the execution phase.

echo "hi" > tmp.txt ; cat tmp.txt | grep "no" || echo "not found"

Parse Tree Stages

subtree A

(A)

The input is the first string command node, e.g. echo "hi". Lookahead points to >. arsine enters the outer while loop, since this is an operator token and precedence >= 0, the original pointer points to >, right side is tmp.txt, lookahead points to ;. Next, arsine does not go into the inner while loop since the precedence of ‘;‘ is less than the precedence of > (remember original pointer points to >). arsine creates a new node for >, and its left side is echo "hi" , right side is tmp.txt. At this stage arsine has created a subtree as displayed in the beside image “subtree”.

(B)

Left side points to >. arsine goes back to the start of the outer while loop, lookahead points to ; and its precedence is > 0. Therefore arsine continues the while loop. The original pointer points to ;, right side = cat tmp.txt, lookahead points to '|'. arsine enters the inner while loop since the precedence of | is greater than the precedence of ; (remember original pointer points to ;).

(C)

the original pointer points to ;, right side = cat tmp.txt, lookahead points to |, arsine goes into the inner while loop since the precedence of | is greater than the precedence of ; (remember original pointer points to ;)
Next arsine recursively calls the method where the inputs are: ntoken points to |, left side is cat tmp.txt (right side is in the outer while loop), min_prec is the precedence of |. In this recursive call, lookahead points to |, arsine goes into the outer while loop, since its precedence >= the precedence of |, original pointer points to |, the right side is grep "no". The lookahead points to ||. arsine does not go into the inner while loop since the precedence of || is less than the precedence of |.

(D)

subtree E

In the next stage there is another recursive call. Lookahead points to ||, arsine goes into the outer while loop. Since its precedence is >= the precedence of ||, original pointer points to ||. Right side is echo "not found", lookahead points to outside the token array, as we already processed all tokens. arsine does not go into the inner while loop since we don’t have more tokens to process. arsine creates a new node for ||, and its left side is |, right side is echo "not found", left side points to ||, arsine does not continue to the outer while loop since lookahead is out of the array bound (no more tokens to process), arsine returns the created new node ||. At this stage we have a subtree, shown in the beside image: “subtree E”.

(F)

parse tree completed

After the recursive call set lookahead point to outside the array. arsine moves out from the inner loop and creates a new node for the original pointer ;, and its left side is > , right side is ||. At this stage arsine has created the entire parse tree as shown in the beside image, “parse tree F”.

(4) last stage: pipeline:

In this stage, arsine calls, worker_execute(arginv);, which creates a pipeline if the node is a string, a pipe, or a redirection, and then arsine executes the pipeline. arsine first creates the pipeline on the “subtree A” (above). The pipeline has only one node, echo "hi" with filename tmp.txt. arsine executes this pipeline and waits for the process is finished. Then, arsine checks for the command, to see if it is an alias (described above). Then arsine creates the pipeline on the “subtree D.” The pipeline has two nodes cat tmp.txt and grep "no". Arsine executes this pipeline, processes the commands one by one, and feeds the previous result to the next process. arsine waits for the child process to finish and remembers the final status, which is false in this case. In the example, next there is an “OR” operator, since the left tree returns false, arsine runs the right subtree, “subtree E.”
execute built-ins
Before input commands are input into the execve() function, the command is checked with a custom string compare function to verify if the command is a built-in command. If the input command is a built-in, the built-in function is immediately called and executed, and returned with an indicated variable to inform the worker to skip processing of the execve function. The arsine shell then returns to the main loop, except for in the event of one special built-in, the_exit(). In the exit built-in function, the extra step is a modification of an argument from the arguments inventory exit integer, which is the conditional check for the main loop in the `main()` function.

for ls -l, ls is not a built-in command and therefore the worker continues on to call execve()

`execve()`

When the function execve() is called in arsine, the input arguments used for this function are: (1) the first command, (2) an array of char pointers to all the input commands that were tokenized in the tokenizer, and (3) the custom environmental variables list turned into a double pointer. The function execve()
In the case of ls -l, since ls is not a built-in function, ls will be used as an argument to the function execve() and this function will search the $PATH for the executable file ls. This process is outlined briefly, in this article linked here: https://www.davidjohncoleman.com/2017/how-the-terminal-works-command-line-input/.

output

In this phase of arsine, the output occurs either from the built-in functions, from the execve() call, or from standard error output stream. For input of ls -l the output is going to list out the file contents of the directory specified, which none is specified, and so this defaults to the current working directory. The flag -l means to output the files in long format including details about the size, date modified, and the mode of the file. The code for the instructions of ls is in the executable file ls that is referenced in the above stated article.

reset memory, repeat

After the input command is handled or an error message is output due to a flaw in the way the arsine system handles inputs, due to the command being unrecognized, or unable to be processed, the memory reset process occurs. In the memory reset phase, the main memory being reset are the bytes from the tokenizer, the parser, and the input commands, so that the next input commands can be handled. If the exit integer from the arguments inventory has not been modified then the while loop from main() function continues, and so this entire process is repeated as shown in the above image of the while loop processes.

exit & free memory

If the main loop exits, then arsine attempts to free all the rest of the allocated memory, including memory in the (1) alias list, (2) history list, (3) environmental variables list, and (4) the input commands buffer. Then arsine exits with the exit status that was stored from user input or 0.

by: David John Coleman II

I am a Software Engineer and Clinical Social Worker based in San Francisco, CA | contact me or follow me.

arguments inventory

environmental variables

history

fstat()

PS1

custom `_getline()`

tokenizer

expansion

aliases

parser | pipeline | worker

input

parser

expansion

pipeline

worker

breakdown of parser and pipeline:

(1) arsine calls `tokenize()` function:

(2) expansion occurs, as explained above.

(3) arsine calls parse() function:

Parse Tree Stages

(A)

(B)

(C)

(D)

(F)

(4) last stage: pipeline:

`execve()`

output

reset memory, repeat

exit & free memory

arguments inventory

environmental variables

history

fstat()

PS1

custom _getline()

tokenizer

expansion

aliases

parser | pipeline | worker

input

parser

expansion

pipeline

worker

breakdown of parser and pipeline:

(1) arsine calls tokenize() function:

(2) expansion occurs, as explained above.

(3) arsine calls parse() function:

Parse Tree Stages

(A)

(B)

(C)

(D)

(F)

(4) last stage: pipeline:

execve()

output

reset memory, repeat

exit & free memory

custom `_getline()`

(1) arsine calls `tokenize()` function:

`execve()`