2.0 The Assembly Programming Environment
2.1 Setting Up Your Development Environment
Before writing a single line of code, every programmer must establish a properly configured development environment. This is the first practical step on our journey. Unlike interpreted languages that can be run instantly, assembly language requires an explicit two-step process: assembling, which translates the symbolic assembly code into machine-readable object code, and linking, which combines the object code with necessary libraries to create a final executable file.
For this course, we will use a standard and accessible toolset:
- An IBM PC or compatible computer.
- A Linux operating system.
- The Netwide Assembler (NASM).
We have selected NASM as our assembler for several key reasons that make it an excellent choice for learning:
- Free and Open Source: NASM is freely available for anyone to download and use, removing any barrier to entry.
- Well-Documented: Extensive documentation and community resources are available online, making it easier to learn and troubleshoot.
- Cross-Platform: NASM works on both Linux and Windows, providing a consistent environment across different operating systems.
To install NASM on a Linux system, follow these steps. First, check if it is already installed by opening a terminal and typing whereis nasm. If a path like /usr/bin/nasm is returned, you are ready to go. If not, proceed with the installation:
- Download the latest Linux source archive (nasm-X.XX.ta.gz) from the official NASM website.
- Unpack the archive in a directory. This will create a new subdirectory (e.g., nasm-X.XX).
- Navigate into that new directory using the cd command.
- Run the configuration script: ./configure. This will prepare the Makefiles for your system.
- Build the binaries: make.
- Install the binaries and manual pages: make install (you may need sudo for this step).
With the development environment ready, the next step is to understand the fundamental structure and syntax of an assembly program.
2.2 Basic Program Structure and Syntax
Despite their low-level nature, assembly programs possess a clear and mandatory structure that logically separates different parts of the program. This organization is not arbitrary; it directly maps to how the operating system loads and manages a program in memory. This structure divides the program into three primary sections, each with a distinct purpose.
- The data section: This section is used for declaring initialized data or constants. Any values defined here, such as file names, string literals, or lookup tables, are allocated and set when the program is loaded. While constants do not change, other data in this section can be modified at runtime. The syntax is section .data.
- The bss section: This section is used for declaring uninitialized variables or buffers. The .bss section reserves space in memory for data that will be populated later by the program. This memory is typically zero-filled by the OS loader. The syntax is section .bss.
- The text section: This is where the actual executable code resides. It must begin with the declaration global _start, which is a critical instruction for the linker. It tells the kernel the exact entry point where program execution should begin. The syntax is section .text.
Within these sections, assembly programs consist of three types of statements:
- Executable Instructions: These are the mnemonics (like MOV or ADD) that tell the processor what to do. Each instruction corresponds to a single machine language operation code.
- Assembler Directives: Also known as pseudo-ops, these instructions tell the assembler how to process the code (e.g., section .data or equ). They do not generate machine code and are not executed by the processor.
- Macros: These are text substitution mechanisms that allow a programmer to define a reusable block of code with a single name, which the assembler expands in place wherever it is invoked.
The syntax for a single line or statement in assembly follows a consistent format, though most fields are optional:
[label] mnemonic [operands] [;comment]
- [label]: An optional identifier that marks a specific location in the code, often used as a target for jump instructions.
- mnemonic: The required name of the instruction to be executed (e.g., MOV).
- [operands]: The optional parameters or data that the instruction will operate on. An instruction can have zero, one, or two operands.
- [;comment]: An optional comment, which begins with a semicolon (;) and is ignored by the assembler.
Here are some examples illustrating this syntax:
INC COUNT ; Increment the memory variable COUNT (1 operand)
MOV TOTAL, 48 ; Transfer the value 48 into TOTAL (2 operands)
ADD AH, BH ; Add the content of BH register into AH register
These structural and syntactic elements are the building blocks we will now combine to create our first complete, functioning program.
2.3 Your First Program: “Hello, World!”
The “Hello, World!” program is a foundational rite of passage for programmers in any language. In assembly, this simple exercise is particularly revealing, as it provides a clear window into how a program communicates directly with the operating system to perform a fundamental task like displaying text on the screen. It lays bare the mechanics of system calls, which we will explore in great detail.
Here is the complete assembly code for a “Hello, World!” program in NASM:
section .text
global _start ; Must be declared for linker (ld)
_start: ; Tells linker entry point
mov edx, len ; Message length
mov ecx, msg ; Message to write
mov ebx, 1 ; File descriptor (stdout)
mov eax, 4 ; System call number (sys_write)
int 0x80 ; Call kernel
mov eax, 1 ; System call number (sys_exit)
int 0x80 ; Call kernel
section .data
msg db ‘Hello, world!’, 0xa ; String to be printed
len equ $ – msg ; Length of the string
Now, let’s carefully deconstruct this program to understand how it works.
- section .data: Here we define our data.
- msg db ‘Hello, world!’, 0xa: Defines a byte string named msg. 0xa is the ASCII newline character.
- len equ $ – msg: This is an assembler directive. $ represents the current address, so $-msg calculates the length of the msg string and assigns it to the constant len.
- section .text: This is our code section.
- global _start: Makes the _start label visible to the linker.
- _start:: The entry point of our program.
- The sys_write System Call: The first block of mov instructions prepares a system call to write to the screen. Each register is loaded with a specific piece of information required by the kernel.
- mov edx, len: Moves the length of our message into the edx register.
- mov ecx, msg: Moves the memory address of our message into the ecx register.
- mov ebx, 1: Moves the file descriptor for standard output (stdout) into the ebx register.
- mov eax, 4: Moves the system call number for sys_write into the eax register.
- int 0x80: This instruction triggers a software interrupt, telling the kernel to halt our program, inspect the registers, and execute the requested system call.
- The sys_exit System Call: The second block terminates the program cleanly.
- mov eax, 1: Moves the system call number for sys_exit into the eax register.
- int 0x80: Calls the kernel to terminate the program.
To compile and link this program, save the code as hello.asm and run the following commands in your terminal:
- Assemble the code: nasm -f elf hello.asm
- This command tells NASM (nasm) to assemble the file hello.asm. The -f elf flag specifies the output format as ELF (Executable and Linkable Format), standard for 32-bit Linux. This creates an object file named hello.o.
- Link the object file: ld -m elf_i386 -s -o hello hello.o
- This command uses the linker (ld) to create a final executable. -m elf_i386 specifies the emulation for 32-bit Intel architecture. -s strips symbol information, and -o hello names the output executable hello.
- Execute the program: ./hello
- This runs your compiled program, which should display Hello, world! on the screen.
The section keyword used in this program is directly related to how the system organizes memory into logical areas called segments.