Register renaming2 months ago
I haven't written a blogpost in a long time, although I really wanted to, but real life and a new project kept me busy. This is also the reason why there haven't been any Firedrake updates lately, but don't worry, the project is far from being dead and there is some progress (and there going to be more in the future). I'm going to write more blogposts about Firedrake in the near future, but in the meantime, I would like to write about register renaming. A few months ago, there was an article on, I think, #AltDevBlogADay about CPU optimizations, talking about caches and it also briefly touched the topic of register renaming. The comment section was full of people asking for a more in depth article about this, but as far as I'm aware, there never was an in-depth article, so that's why I decided to write about this.
Preface
I assume a certain degree of knowledge about CPUs. Don't worry, you don't have to be an expert, but if you have never heard of instruction pipelining or the fact that a CPU is using a clock to do its work, you might want to reconsider reading this. Also worth noting; This blogpost is written with x86 and x86-64 in mind! While some of the concepts are true for other architectures as well, not every architecture behaves this way. Please keep in mind that a modern CPU is an incredibly complex piece of hardware and that this blogpost doesn't aim to be a scientific paper, so a lot of things are simplified.
The problem
Modern CPUs use a variety of techniques to keep the performance high without increasing the clock speed at the same time. This is done because increasing the clock speed is virtually impossible nowadays, because power consumption and heat generation would be immense, and there is also the physical limitation of signal propagation and contamination delay. The most known technique is probably multi-core CPUs, which are nowadays standard on desktops and even mobile devices. By duplicating the CPU, the throughput is, in theory, duplicated as well, although it requires that the executed program can benefit from multiple cores. However, not everything can benefit from multithreading for various reasons, so maximizing the throughput even for single threaded programs is still desirable. One of the many ways to do this, is by duplicating some of the functional units on a single core, for example by introducing a second ALU, instead of having just one. This way the CPU can achieve so called instruction level parallelism (ILP), by fetching and decoding multiple independent instructions at the same time and dispatching them to the redundant functional units. In the best case this means that the CPU can complete two or more instructions in the time that it would usually take to complete one, depending on the program flow and which and how many functional units are made redundant. A CPU with such an architecture is called a superscalar CPU, and such an architecture is used in virtually every CPU made in the last decade (it's safe to assume that every desktop CPU is a superscalar CPU).
Data dependency
One of the biggest challenges with superscalar CPUs is keeping the functional units saturated at all times, because the moment it becomes impossible for the CPU to parallelize instructions, the throughput of completed instructions will drop significantly. Although there are ways to deal with an unsaturated CPU, for example by using the unused resources for speculative execution to predict the future program flow, it is often more desirable to keep the CPU saturated with the execution of the actual program, since speculative execution may waste resources by doing unneeded calculations, which in return rise the power consumption without any performance benefit. However, not everything can be parallelized, because an instruction may impose certain dependencies on a previous instruction, for example by using the result of a previous calculation. A simple example for this is the following pseudo code:
r1 = r1 + r2 // Add register r1 and r2 and store the result in r1
r1 = r1 + r3 // Add stored result in r1 and r3 and store the result in r1
In this case it's impossible for the CPU to parallelize the two instructions, because the second instruction depends on the result of the first one in a so called data hazard. Data hazards are problems that occur when two or more instructions access the same location (either by reading from or writing to it), and require the CPU to execute the program in order to avoid race conditions.
There are three possible types of data hazards that the CPU has to take care of in order to avoid race conditions:
Read after writeorRAW, demonstrated by the example above, requires that a read from a location returns the result of the last write to that location in the program order.RAWis a true dependency, meaning that the CPU has to execute the program in order.Write after readorWAR, requires that a value read from a location must not return the result of a future write.WARdependencies are also known as false dependencies.Write after writeorWAW, requires that, when multiple instructions write into a location, the location must hold the result of the last write.WAWdependencies are also known as output dependencies.
A huge problem with data hazards is the fact that x86 has only 8 general purpose registers (some of them not even being general purpose at all). Even x86-64 is limited to 16 registers, so the compiler has to reuse registers and thus introduce data dependency along the way.
There have been efforts of solving these problems offline using the compiler, by introducing more registers and giving the compiler better ways of telling the CPU about the program flow (for example the Itanium architecture has 128 general purpose registers and allows the compiler to do speculative execution, branch prediction and register renaming), but these efforts are limited to the high-end supercomputer and server market and introduce their own problems.
Register renaming
WAR and WAW dependencies can be resolved by using register renaming, a technique that allows the CPU to rename a register at runtime. To make this a bit easier to understand, try to not think of a register as a fixed location but merely a pointer to some location. The location the register names resolves to is decided by the CPU, and can differ from instruction to instruction, by "renaming" the register, the CPU changes the location the register name resolves to. Here is an example of a WAR dependency:
r4 = r1 + r2 // Add register r1 and r2 and store it in r4
r2 = r1 + r3 // Add register r1 and r2 and store it in r2
[r4] = r2 // Store the result of register r2 into the memory location pointed to by r4
As you can see, the first instructions reads from r2 while the second instruction writes to r2. Without register renaming, the CPU has to wait for the first instruction to complete before it can continue with the second and third instruction, because the first instruction depends on register r2, which would be overwritten by the second instruction. This is why WAR dependencies are called false dependencies, the dependency isn't the actual result of the previous instruction, but only its location. With register renaming however, the CPU can rename the r2 register in the second instruction, allowing the first two instructions to be computed in parallel:
r4 = r1 + r2
r2(r5) = r1 + r3
[r4] = r2(r5)
In this example, the CPU has remapped the location of the register r2 to the location of register r5 (indicated by the r5 in parentheses). Note that this is completely invisible to the executed program! The program itself is never modified in any way and for all intents and purposes it looks like the CPU is reading and writing to and from the register r2. Register renaming is always invisible to the executed program(s), if you stop your program in a debugger and look at the content of the registers, you are guaranteed to see the expected result (that is, if you stop it between the second and third instruction of the example program above, you will see the result of r1 + r3 in the r2 register).
Now, you may have noticed that I wrote earlier that the problem was that there are not enough registers available, so renaming one general purpose register to another general purpose register isn't going to work, because the program is most likely using all registers already. That's why there is a distinction between architectural and physical registers. The architectural registers are the ones provided by the instruction set of the CPU (for x86 this means the eax, ebx, ecx, edx, ebp, esp, edi and esi registers), and these are the only registers that you can visibly access from within your program. However, a CPU that uses register renaming doesn't actually offer the architectural registers but instead uses a set of physical registers (it usually has somewhere between 64 and 256 physical registers) to which the architectural registers resolve to. Each architectural register can resolve to an arbitrary physical register, and the CPU can remap architectural registers to physical registers for any instruction.
Here is the above example program again, this time with the physical register the architectural register resolves to written in parentheses:
r4(p4) = r1(p1) + r2(p2)
r2(p5) = r1(p1) + r3(p3)
[r4(p4)] = r2(p5)
When the CPU decodes an instruction, it does an implicit dependency check of the instruction against every pipeline that contains a previous instruction of the program stream. There are three possible outcomes of that dependency check:
- 1) The instruction doesn't depend on a previous instruction
- 2) The instruction depends on a previous instruction, but the dependency is either a
WARorWAWdependency and can be resolved using register renaming. - 3) The instruction depends on a previous instruction, but the dependency can't be resolved (eg a
RAWdependency).
In the first two cases, the CPU will hand the instruction over to the execution unit (after renaming the register in the second case), however, in the third case the CPU has to stall the pipeline until the dependency is resolved.
Like already mentioned, register renaming can also be used to resolve WAW dependencies, for example in this program:
r1 = r5 + r3
r1 = r2 + r4
By simply renaming the r1 register, the CPU can execute both instructions in parallel, without having to wait for the first instruction to complete before it can write the result of the second instruction.
(Okay, I lied, in this example the CPU can simply discard the first instruction since the result is never used, but for the sake of simplicity, let's assume that the first instruction either alters the state of one of the control flags in a way that isn't overwritten by the second instruction or that there is another instruction in between that depends on the result of r1).
Another advantage of register renaming is that it allows the CPU to execute certain types of loops in parallel, which would normally be impossible because the registers are reused in every iteration of the loop. However, register renaming can't resolve RAW dependencies, like the example posted in the Data dependency section.
Wrap up
Like already mentioned, this blogpost simplifies a lot things! For example the physical registers are much more complex and usually consist out of a list of integer and floating point registers (instead of one unified set of registers like I made it seem). If you are interested in these kinds of things, you should read this paper, which discusses this in much deeper detail and also covers a few more things.
Now that Firedrake 0.3.0 is released, I would like to write a few blogposts about its internal workings. In this blogpost I'm going to focus on the Kernel itself, leaving out things like libkernel and libio, but I want to come back to them in another blogpost and talk about them in detail as well. However, today I'm going to focus on what happens from the time that the machine starts until the end of the boot process, so a timespan of roughly 20 milliseconds (and there is a lot of stuff going on in this time).
Bootstrapping
When you power up your PC (or your emulator, it doesn't really matter), the first thing that's going to start is GRUB. Firedrake uses GRUB as bootloader because it makes certain things of the bootstrapping process much easier, for example normally when you start your system, you start in the legacy real mode and without any information about the system you are running on. GRUB gathers informations about the system and also switches into the protected mode, before loading the kernel into memory and passing execution to it. Now, after GRUB is done doing its magic, Firedrake is able to execute its first few instructions which can be found in /sys/bootstrap/bootstrap.S. For reference, here is the content of the first function that executes:
ENTRY(_start)
movl $stack_top, %esp
pushl %ebx
call sys_boot
jmp .
Now that's a bit disappointing, isn't it? ENTRY() is just a macro defined in /sys/asm.h which makes sure that the label is visible and has proper alignment. Otherwise, there isn't that much going on, we switch stacks so that there is a valid one to operate on and then we push the content of the ebx register onto the stack. Ebx contains a pointer to the multiboot header structure and is passed to the high level bootstrapping function sys_boot, which is called right afterwards. Just in case that we ever leave sys_boot, something that is not supposed to happen, we will just go into an infinite loop.
So the next point in the booting process is the sys_boot function, which isn't written in assembler anymore, but plain old C and which can be found in /sys/bootstrap/boot.c. The first thing that function does is clearing the video output from any garbage and then printing a happy welcome message (the good old "Here be dragons!", alongside with some version information and the kernels compilation date).
(This is how Firedrake looks like at this point of the bootstrapping process)
Now the real meat happens as Firedrake begins to initialize all its modules, each one providing one specific set of features to the kernel. There are eight modules as of now, so let's go over every single one of them! One bit of clarification though, while I will say the word module quite a few times, they are just a concept and not actually separate modules! Each module is part of Firedrakes source code and binary, the functions are just grouped in a more or less logical way and each group is called a module, with most of the modules providing a single initialization function that's called by sys_boot.
Memory
The first thing that Firedrake takes care of is memory. It's the one resource that is needed throughout the whole lifespan of the kernel, but as of now, it's not available to us. The only memory we have at hand is the stack and the global variables, which are part of Firedrakes process image. However, we need to make dynamic allocations, so that's why Firedrake is going to take care of memory right away.
Physical memory
First thing we need is physical memory, because at the end of the day, all memory operations need to be backed by physical memory. /sys/memory/pmemory.c is responsible for all this, it will iterate over the memory maps that are provided by GRUB and mark every page as free that is available (while in theory it's possible that the whole 4gb address range is backed by physical memory, it's more likely that there is a lower half and an upper half of available memory). Of course, now every page that is available is marked as free, which also means that we have marked the kernel, the stack and everything else as free. This is corrected in the next step, where Firedrake will mark every page that is already in use as used. It will also allocate pages for the kernel page directory and the trampoline area (I'm going to talk about both of these in a bit), which are both expected at certain physical addresses. So to make sure that they aren't allocated before the responsible modules have a chance to claim the memory, the physical memory manager is going to mark them as used.
Virtual memory
The next thing that Firedrake is taking care off, is virtual memory. Virtual memory is an essential component in modern operating systems, and while I wouldn't call Firedrake either modern or an operating system, it certainly requires virtual memory. The virtual memory is initialized in /sys/memory/vmemeory.c and is at this point rather simple; First, Firedrake is going to forge the initial kernel page directory, and then it maps the kernel, stack, video memory, and various multiboot related information 1:1 into the virtual address space. Once all this is done, Firedrake will enable the usage of virtual memory (by setting bit 31 of the cr0 register), and after this point direct usage of physical memory is going to fail.
Heap allocator
Now that physical and virtual memory are available to the system, we need one more memory related module; The heap allocator. The problem is that as of now, we are only able to allocate memory in page sized chunks, ie we need to allocate the memory in multiples of 4k, which is rather inconvenient given that most kernel structures are only a few bytes in size and we rarely need to allocate more than 200 bytes at a time. To avoid wasting a lot of memory, we need something that is able to allocate memory in finer granularities and that is exactly what the heap allocator does. The code is inside /sys/memory/heap.c, but the actual initialization routine isn't anything exciting, it just creates the default heap by calling heap_create().
I don't want to go into too much details about the inner workings of heaps, but what they basically do is to allocate memory out of so called zones. Each zone spans over at least one page of memory and keeps track of the allocations and free space inside of it. There are different zone types for differently sized allocations to minimize the wasted memory per allocation, and each heap can create new zones or purge old ones when needed.
Interrupts
After memory, the most essential bit required is proper interrupt handling. Like most modern kernels, so is Firedrake an interrupt driven kernel and thus it makes sense to enable interrupts as soon as possible. One could argue that it makes more sense to enable interrupts at the very beginning, instead of starting with memory, so that early memory bugs can be caught (instead of now where they would take down the whole system), but Firedrakes interrupt handling mechanism requires memory in order to work, so interrupts can't be enabled until after memory is working. The entry point for the interrupt initialization is inside /sys/interrupts/interrupts.c and the first thing that the function does is setting up some basic interrupt and exception handlers. Remark though, the ir_setInterruptHandler() function doesn't alter anything inside the IDT but just stores the function pointer into an internal array. When the internal interrupt handler is called, it will use the interrupt vector to lookup the actual handler in this array.
Trampolines
After the lookup array is populated, the ir_trampoline_init() function inside /sys/interrupts/trampoline.c is called. This function does most of the actual interrupt setup process, but before I dive into it, let me explain what the actual purpose of this is. Most™ kernels are so called higher half kernels, which means that the kernel resides in the upper half (usually just the upper 1gb) of every virtual address space. This is done because when there is an interrupt, the CPU will automatically switch execution to a predefined interrupt handler that must be visible in the current virtual address space, otherwise there would be just another interrupt (a page fault exception to be exact), which couldn't be handled either resulting in a double and then a triple fault, taking down the whole machine. This and the fact that the interrupt handlers are part of the kernel are the reasons for higher half kernels, because then the interrupt handlers are always visible and interrupt handling is no problem.
However, Firedrake doesn't work this way. The kernel resides in its own virtual address space, and each program that is executed gets it own virtual address space as well, completely for itself. Naturally this is a problem for interrupt handling, which Firedrake solves by using a trampoline area. The trampoline area is a small, 2 page sized area starting at address 0xFFAFF000 of every virtual address space, it contains a few crucial structures like the GDT, IDT and TSS as well as a basic set of interrupt handlers. These interrupt handlers are used for user space/kernel space boundary crossing, if you are interested in the code, take a look into /sys/interrupts/idt.S. What it does, roughly speaking, is first pushing the required information onto the stack (that is, the interrupt number, the current state of the eight general purpose registers, the ds, es, fs and gs segment registers) and then switching to the kernels virtual address space by loading the kernels page directory. The stack that the function operates on is visible in both address spaces (the one the interrupt occurred in and the kernels), so it's no problem to simply switch the address space. Now that the kernels virtual address space is visible, we can safely call the high level interrupt handler (ir_handleInterrupt) that resides inside the kernels address space. On the way out of the lower level interrupt handler, the process is reversed with one additional step; If there was a context switch (ie a new thread got scheduled by the interrupt handler), then the function will switch to a new stack before switching to the virtual address space of the currently scheduled thread.
Now that the theory of trampolines is clear, let's go back to the ir_trampoline_init() function. The first thing it does is mapping the trampoline structure into the virtual address space of the kernel. Like I already mentioned, the address of the trampoline area is hardcoded to 0xFFAFF000, and it is mapped into every virtual address space (normally this is done when the address space is created, but the kernels address space isn't created normally and thus we need to map the trampoline area into it at this point). Then, we have to move all interrupt handlers into the trampoline area, so that they are actually callable. This is done via a simple memcpy(), however, there is one last thing we need to do to make all of this work; The interrupt handler (idt_entry_handler) makes a call to the high level interrupt handler (ir_handleInterrupt), this call is an indirect call via an offset to the function, but since we have moved the function from it's normal address to the trampoline area, we need to fix this offset. ir_trampolineFixEntryCall is responsible for this.
IDT, GDT and the rest
So, we are almost done with setting up interrupts. The next thing that we are going to do is to setup the IDT (interrupt descriptor table) using the ir_idt_init function, which is responsible for telling the IDT where our interrupt handlers are located in memory. Then we make a call to gdt_init to setup the global descriptor table, which contains the code and data segments for Ring 0 and Ring 3 (the whole 4gb address range is marked here). The very last step is initializing the PIC (programmable interrupt controller) and then we are finally done setting up interrupts.
Time keeping
The next step in the boot process is setting up the programmable interrupt timer which is used to keep track of time. The PIT is set up to fire an interrupt roughly every 1000Hz (aka every millisecond), and the interrupts are later used to drive the scheduler as well as the time keeping code. The reason the timer is initialized so early in the boot process is because it is an asynchronous operation that takes some time to complete, so we kick it off as early as possible to avoid waiting for the timer to get fully operational at the end of the booting process. What's happening inside /sys/system/time.c is actually really simple, we initialize the PIT, set some CMOS flags which allow us to read the current time later on and then we set a temporary interrupt handler for the timer interrupt (which is later replaced once we got the initial boot time read from the CMOS RTC).
Scheduler
Next up on the list is getting the scheduler ready, which happens inside /sys/scheduler/scheduler.c. The initialization code will first forge the initial kernel process, which is a custom built process that inherits most of its properties from the current state of the system, the process never receives it's own entry point or "starts", because essentially it's already running and just a wrapper for the current thread of execution. The scheduler will also set up an interrupt handler for the yield command (interrupt vector 0x31), and it will then enable interrupts. Even though interrupts have already been initialized, they haven't been enabled yet because the scheduler wasn't ready to schedule anything up until this point, and any interrupt could have been fatal. Now that we have the initial process in place, we can turn interrupts on and be sure that we can continue execution of the boot process. This is also the point where any interrupt and exception that occurred so far will catch up with the system, so it might very well be that this is also the point where Firedrake goes down (it shouldn't though, if this happens for you, please file a bug report!).
There is one last thing that the sd_init function is going to do; Patching the spinlock code. Because spinlocks are used before threads can be yield using int $0x31, the intial spinlock code looks like this:
ENTRY(spinlock_lock)
pushl %edi
movl 0x8(%esp), %edi
movb $0x1, %cl
jmp spinlock_tryObtain
spinlock_wait:
nop
nop
spinlock_tryObtain:
xorb %al, %al
lock cmpxchgb %cl, (%edi)
jne spinlock_wait
popl %edi
ret
Notice the two nops in the waiting branch, now that we are able to yield the currently executing thread, the scheduler will replace them with the bytecode for int $0x31. It isn't really necessary to start with the two nops though, since up until now (and for a bit longer), there is only one thread running, so unless it deadlocks itself, the wait branch is never executed. However, just in case it deadlocks itself, the two nops will avoid that the system crashes (and thus it will just hang and hopefully display useful information on the screen).
Syscalls
Memory, interrupts, time and scheduler, what else could we possibly need? Well, means for user space programs to communicate with the kernel can't harm, can they? No, so that's why we are going to initialize the syscalls in the next step. This happens inside /sys/syscall/syscall.c and it set ups the handler for the syscall interrupt (interrupt vector 0x80), and then it will install all syscall handlers. Simple as that, not much else to say.
ioglue
Last, but not least, comes ioglue. Ioglue is the runtime dynamic link-editor of Firedrake, which allows it to load dynamic libraries into the kernel space. It's part of the kernels driver framework, and it has its entry inside /sys/ioglue/iostore.c. I don't actually want to go into too much detail here, because there will be another blogpost dedicated to ioglue and the driver framework in general, so just the basic concept of ioglue: It set ups the iostore, which is responsible for storing references to all loaded libraries (it's implemented as an Andersson tree). The next thing it does is to load the two essential kernel libraries (libkernel.so and libio.so) into the kernel, where they both are first relocated and then linked with each other and the kernel. Then it will initialize libio, a process which does an awful lot of stuff in the background and which I will discuss at a later point.
Outro
And then, we are done. The kernel is completely booted and ready to work, which it does by handing execution over to the kernel daemon (which can be found in /sys/kerneld/kerneld.c). The kernel daemon also does a bit of bootstrapping to get some more things up and running, but I won't go into any details of these at this point. Here is how Firedrake looks like once it finished the boot process:
Mentioned repositories:
Firedrake 0.3 release6 months ago
Almost exactly 4 months after the 0.2.1 release and exactly one year after I started working on Firedrake, I can finally present you Firedrake 0.3.0, nicknamed Jormungand. It's a huge update, there have been over 70 commits and the code size increased from 5131 to 15310 lines (actual code, excluding blank lines and comments)! Almost nothing has been left untouched and I did a lot of refactoring of old code which makes Firedrake much faster and more stable than the previous versions. I've already talked about some of the changes in Firedrake 0.3.0, so I won't talk about them again and instead talk a bit about some of the other stuff that made it into this release.
libio
Wait, I already talked about libio, didn't I? Yes, but it got some massive changes which I want to talk about. The first one is that libio got split up into two libraries; libkernel and libio. libkernel is a lower level C interface which doesn't add as much abstraction as libio does, but instead focuses on exporting raw kernel functionality and providing a thin layer over a few things. libio sits on top of libkernel and provides a wide range of functionality and abstractions to make driver development easier. There is a bit of documentation available for libio, which I'm going to heavily extend in the near future. Besides of that, if you are interested in playing around with libio, you should seriously take a look into /libkernel/libPCI and /libkernel/libRTL8139, which I used to test most of libios functionality (and which provide some insight into how it's supposed to work).
Thanks to the dynamic nature of libio (there is a shadow class for every libio class, which allows for dynamic class lookup and instantiation), it's now also possible for services (the backbone controller classes that provide the actual driver functionality) to find other services which match a set of certain characteristics and publish them at runtime. To give a more concrete explanation of this; The libPCI library consists out of two classes, PCIProvider and PCIDevice. The PCIProvider class scans the bus for available PCI devices and then creates a new PCIDevice instance for each found device. PCIDevice provides some basic abstraction for the actual PCI device and asks the libio runtime for a matching service for that particular device (the characteristics to match are the vendor id, device id, class and subclass of the PCI device). If a service is found, it will be instantiated and published by the PCIDevice, and the service becomes the controller of the device. An example for this is the RTL8139Controller class inside libRTL8139, which provides a service that matches the Realtek 8139 ethernet controller.
I'm planning to publish a more in depth post about libio, libkernel and ioglue (the in-kernel dynamic link editor), alongside with documentation and reference guides for various concepts of the driver interface.
Memory Management
Firedrake now provides two new ways to work with memory, the first one is an abstraction for DMA (/sys/memory/dma.c|h) which is also exported into libkernel and libio (via the IODMARequest class). It provides a simple way to request pages of memory, and also allows for non contiguous physical memory pages. It will try to use as less fragments as possible (unless you explicitly tell it to use contiguous memory pages, in which case it won't fragment the memory at all), and will then map all fragments as one contiguous block into the virtual memory. Not much else to say here, simple and straight forward.
The second new thing isn't actually "new" but a rewrite of an existing feature: Heaps. Heaps are much more sophisticated now! A heap now works with zones as memory backend for allocations, each zone can serve memory for one class of allocation, where the class is dependent on the size of the allocation. This allows the heap the group likewise sized allocations together and waste less memory on tiny allocations (which make up the majority of allocations inside the kernel). Heaps can also grow indefinitely now (until you run out of actual memory), because they simply add a new zone if they can't find one that can serve the requested allocation (likewise, unused zones get purged and their memory can be reused).
Firedrake 0.3.x and 0.4.0
If you look into the commit history for Firedrake 0.3.0, you will see that the number of commits increased significantly. Instead of creating one large commit for most of the changes, and working on this one commit for a long time, I created lots of small commits for Firedrake 0.3.0 and pushed them regularly. I intend to keep it this way in the future, and I also want to keep the way branches are organized right now. This means that the Firedrake 0.4.0 development will happen in the unstable branch, while patches for 0.3.x will be developed on a new branch called development (the master branch will be merged from time to time with the development branch, once the changes in there are stable enough).
As implied, I intend to develop Firedrake 0.4.0 alongside of 0.3.x. While 0.3.x will contain only patches and no new features, Firedrake 0.4.0 is going to see some major new additions like a virtual file system layer, a userspace dynamic linker as well as a partial standard C library implementation for the userspace. However, these changes are going to be split up into multiple point releases, so there won't be one big 0.4.0 release but multiple small 0.4.x releases with one larger feature per release.
Mentioned repositories:
Firedrake 0.37 months ago
Let's talk about Firedrake and the upcoming 0.3 release (nicknamed Jormungand). Although it's not completely finished yet, there already is a tremendous amount of changes that I'm really proud of (to put it into numbers, so far there have been made 8839 additions, adding up to a total of 18615 lines of code). Of course there are the usual suspects like the memory manager, scheduler and interrupt controller, but there are also lots of other changes and additions that made it into the repository so far.
Want to get your hands on the changes made so far? Head over to the Firedrake repository and checkout commit 00ccd5c84b!
ioglue and libio
The major new feature of Firedrake at this point is, without doubt, the in-kernel runtime link editor. It's able to load, relocate and link dynamic ELF binaries, inside of the kernel. As of now the the relocation isn't done lazily but as soon as the dynamic linker is invoked, however, the final release will contain lazy relocation through the PLT and GOT. Keep in mind though that ioglue is not a replacement for ld.so, it works inside the kernel only and loads libraries directly into the kernel space where they are also executed!
So, why exactly would one want to inject code into a running kernel (I mean, seriously, those things are fragile enough already)? Obviously to extend the kernel with new functionality and drivers. And that's exactly where libio kicks in. Libio is the API Firedrake provides to extend the kernel, it's written in C++ and if you are familiar with OS X you will see that it shares some ideas with IOKit and the KPI. Libio provides a couple of things that aren't normally found in a C++ environment and there are also things it doesn't provide which you are probably used to; No RTTI, no standard library, no exceptions. The goal is to provide a clean and object oriented interface to the kernel without much of the C++ bloat. The memory model is also different, for the most parts it's a reference counting environment you get there (without the use of shared pointers). There are very few exceptions to this (namely two classes, IODatabase and IOSymbol), but they are usually not encountered when working with libio, so it's safe to say that libio provides a reference counting environment.
Additionally you get a lot of abstraction from libio, for example there are three container classes (IOArray, IODictionary and IOSet), and in the final release probably more (planned are counted sets and andersson trees). It also provides you with means of locking and synchronizations (IOSpinlock and IORunLoop, mutexes and semaphores in the future), message and event dispatching, multithreading, symbol lookup and of course actual driver related stuff like interrupt handling.
Disclaimer
libio isn't finished yet. Not only is it's API subject to change, but some of the stuff I mentioned here isn't even in the repository yet.
Debugging and performance for 100
Debugging of kernels is a pain in the ass. Mostly because there is no underlying OS that catches you in case something goes wrong and then prints you a nice dump of everything the crashed process did. As of now you can use the syslog() function for printf style debugging, you can place hardware breakpoints and you can have the kernel dump a complete stacktrace of the current thread.
With Firedrake 0.3 you can still do all that, but you can also do more stuff.
One major change is the syslog implementation, up until now it wasn't multithreading safe, meaning that two or more threads were able to write to the video memory at the same time, resulting in strange artifacts (and crashes). Now it's completely multithreading safe, each message is queued inside a new process called syslogd which will then write out every message exactly in the order they were received. It also makes sure that every committed message is eventually written out, even if there is a kernel panic.
The stacktrace implementation has been extended quite a bit as well, for example you are now able to create stacktraces for threads different than the currently running one and it now also catches and symbolicates frames from libraries loaded via libio, which looks like this (the + 0x3000 is the relocation base of the library):
There is another new debug and performance feature called watchdogd inside Firedrake 0.3. It's a special process running inside the kernel which samples each kernel thread every 5 milliseconds and looks what they are currently doing. Then it puts each sample result in a counted set (okay, it's just a set, not a counted one…) and if needed, it prints a nicely formatted list with the 15 functions the thread spent the most time in. Example of the unit test thread:
As you can see, halloc() is a real performance hog, and if you wanted to improve performance, this would be the low hanging fruit to optimize. This is a very simple, yet efficient way to track down performance bottle necks. Keep in mind though that it's not a complete time based sampler, ie. it doesn't care about the call tree and just looks at the function the process is currently in!
New container types
Up until now, Firedrake only knew one type of container: Linked lists. While they work reasonably well for small things, they aren't the perfect data structure for every case. So in addition to greatly improved linked lists, Firedrake now also knows about Andersson Trees, Arrays and Hash Tables. All of them with their own set of unit tests to make sure that they work as intended. I also tried to make sure that all container have roughly the same API, ie. all of them follow the same naming conventions.
Here is some example code that shows arrays and lists:
// Search through the dependencies
array_t *dependencyTree = array_create();
array_addObject(dependencyTree, library->dependencies);
for(size_t i=0; i<array_count(dependencyTree); i++)
{
list_t *dependencies = array_objectAtIndex(dependencyTree, i);
struct io_dependency_s *dependency = list_first(dependencies);
while(dependency)
{
symbol = io_librarySymbolWithName(dependency->library, name, hash);
if(symbol)
{
*outLib = dependency->library;
array_destroy(dependencyTree);
return symbol;
}
if(list_count(dependency->library->dependencies) > 0)
array_addObject(dependencyTree, dependency->library->dependencies);
dependency = dependency->next;
}
}
array_destroy(dependencyTree);
There is also another new type called iterator_t, which can be used to iterate over the contents of a container. Currently only Andersson Trees can be enumerated using this API, but the final release will contain support for all container types. It's also trivial to add iterator support for other container types since the iterator API relies on callbacks instead of making assumptions about the container itself, so the actual implementations is part of the container.
Additional stuff and the future
Obviously those aren't all the changes that made it so far into the repository, for example there is also a new time module that is able to track time with millisecond accuracy, heaps are now mutable and can grow in size and much more. There are also lots and lots of bug fixes, for example a critical bug in the physical memory manager has been fixed, bugs in various libc functions are also fixed etc. This really is a huge release, not only when it comes to features but also the overall stability of the kernel. But hey, that's not all that's coming; There will also a complete virtual filesystem implementation that allows the creation of custom filesystems (and obviously they can be linked at runtime into the kernel, thanks to ioglue and libio).
There is also an incomplete libc implementation coming for the userland which will be used for a ld.so implementation, which will then allow the usage of dynamic libraries in third party userland programs. However, I'm not sure if these two will make it into 0.3 or if they are included in 0.4 since 0.3 is already such a huge release.
Mentioned repositories:
October update8 months ago
Jublog Update
Good news everyone, I finally found some time to update the software that runs this blog! It's still written in JavaScript, but this time said JavaScript runs on the server instead of the client. This means that the blog content is served as a static HTML site instead of being generated on the fly, which also means that less data has to be pulled of the server which not only reduces the used bandwith but also speeds up the time it takes to load the page. It also enables Google and other web crawlers to index the site properly since they usually tend to ignore client side JavaScript.
Sooo, how does the blog work now? Well, like I said, it's still JavaScript running here. only this time inside of a Node.js server. Articles are still written in Markdown, but this time stored inside of a MySQL database. Instead of a normal admin interface, the new blog uses a git repository as backend with post-receive hooks to fetch new articles once they are published. If you are interested in the raw data that feeds this blog, the repository is public on GitHub.
There is some stuff that still needs to be done, for example there is no searching and no RSS feed, and the code, quite honestly, is a mess. But I intend to clean the mess up and add some more features to the blog, and once this is done, I'll publish the source code on GitHub under the MIT license (not that the code would be useful to anyone, but hey, Open Source for the win!).
Firedrake Update
Well, tinkering with my blog isn't the only thing I've been up to in the last couple of months, I've also been working on Firedrake a bit. Most importantely on the dynamic link-editor which is now able to load, link and relocate most of the ELF dynamic libraries you throw at it. Apart from the dynamic link-editor I've also been working on the driver interface, however, that is currently a huge mess and I have no idea how to clean it. The most annyoing problem is that it's almost impossible to provide a generic interface which can be used by drivers for all sorts of hardware, for example PCI and USB work totally different… and let's not even talk about disk and network drivers. However, I hope to find a solution for that problem in the future.
Mentioned repository:
Firedrake 0.2 release10 months ago
I finally found some time to wrap up the Firedrake 0.2.0, nicknamed Sunrise Jade, release and push it to the public GitHub repository. There has been a crazy lot of changes, 5,242 additions and 1,206 deletions in 103 files (almost no file was unchanged!), and this blogpost is going to be an in-depth review of most of them. Just in case you are interested, the lines of code of just the kernel is now 6117, not counting empty lines.
If you haven't seen it yet, all Firedrake code is now available on GitHub, of course still under the MIT license, so feel free to visit the repository and clone, fork and break it! I will also accept pull request.
Also, I would like to change how the development is done, up until now I always made a huge number of changes and then pushed them at once to make sure that the pushed stuff builds cleanly and works stable (okay, stable is a strange word when you speak about an in-development Kernel, but you get the idea…). This is still going to be the case for the master branch, however, I plan to add a unstable branch which will receive code changes much more often, but might not build at all or include some unclean stuff.
So much for the introduction, lets dive straight into the changes…
Memory
The memory management has changed. A lot. Most changes were made in the virtual memory system though, the physical memory manager has only been touched a bit. Firedrake supported virtual memory since its first release, but so far it only supported it for one address space, the one of the kernel, which is a bit unusual and not that useful. Now with Firedrake 0.2.0, this has changed, there is one 4gb virtual address space for the kernel which includes every little bit of the kernel, and then there is one 4gb address space per process, meaning that the each program can use the whole address space just for itself. How exactly this works out is described a bit later, but for now just assume that it does work.
So, the virtual memory manager is now able to craft new and empty page directories, keep in mind though that those are passed as physical addresses only! This means that before you can access any page directorys memory, you first have to map it into the memory of the currently active address space, which for the most parts is going to be the kernels address space. Afterwards you also have to map each page table into memory, which was a cause for a lot of bugs in the previous release which often just assumed that the memory was already mapped. These bugs are now fixed, the virtual memory manager correctly creates temporary mappings before accessing potentially unmapped memory.
Together with those fixes came a few additions, for example all mapping related functions now take an extra argument for an lower limit, which makes sure that the mapping doesn't happen at an address lower than the limit. This is especially useful for things like stacks which are expected to be at an high address, but its also used in a hacky way for the kernels new mmap() implementation. There are also some other useful additions, for example like the addition of two sided vm_alloc function which allows to allocate pages in both the kernels page directory and an arbitrary other page directory, the functions makes sure that the returned address is the same for both page directores which is useful for things like the kernel stack which must be visible at the same address in both the kernels address space and the users address space.
Last but not least, there is also a new way to allocate memory in the kernel itself. Previously it was only possible to allocate memory in 4k chunks via the kalloc() function, this is now history. There is a new heap based memory allocator which allocates a larger chunk of memory at once and then allocates smaller chunks inside of it when requested. This way, no space is wasted if you only need space for a few bytes. You can create arbitrary heaps, but currently there is only one created with 10mb. Allocations can be done via the new halloc() and hrealloc() functions, which work essentially the same as malloc() and realloc(), but take an additional argument for the heap to use (if you pass NULL, the kernels default heap is used).
Interrupts
Lets talk about interrupts, they are essential for any kernel to work properly and have been completely rewritten in this release. The reason for this is simple, whenever an interrupt occurs, be it through hardware or software, the CPU automatically switches control over to the kernel to handle it. This introduces some problems in Firedrake, because every process has its own 4gb address space and the kernels code is not visible in it, so if there was an interrupt and the CPU would jump to the interrupt handler which of course isn't mapped into the currently active address space which would force the CPU to rise an page fault, which is also handled by the interrupt handler, which results in another page fault which makes the CPU to simply give up and halt. Not exactly the desired behavior, so Firedrake uses a trampoline area which only purpose it is to switch to the kernels address space and then call the high level kernel interrupt handler. This small area, which only consists out of 2 pages, is mapped into each and every page directory at the very same address; 0xFFAFF000.
The trampoline area includes the new interrupt entry points, so if an interrupt occurs, the CPU can jump to the trampoline area and start executing there. The entry points than make sure that the needed state information are pushed on the stack, this includes the eight general purpose registers and the four segment registers plus the information that the CPU pushes automatically (EFLAGS, EIP, ESP, cs and ss registers). Then the handler loads the kernels segment registers and finally loads the kernels address space before handing control over to the kernels high level interrupt handler. Afterwards the same process is reversed and the address space is switched to the one of the currently scheduled process. All this stuff is implemented inside Kernel/interrupts/idt.S, so feel free to take a look.
Of course there is a reason why this isn't done in all kernels (although there are a few), normally the virtual address space of a process is split 3:1 or similar, and the kernel is mapped into the upper half, which is useful because this trampoline stuff doesn't have to be implemented and the address space doesn't have to be switched for every interrupt. Switching the address space is kinda expensive, because it trashes the TLB of the MMU which in return means that every memory access after such a switch hits a cold cache (and in the worst case the MMU has to make up to 24 memory accesses to resolve the linear address). However, Firedrake is not a professional kernel, so those minor performance hits are completly acceptable. Besides, even kernels like XNU (used by Mac OS X and Darwin) use things like this (only the 32bit kernel though, the newer 64bit kernel uses the more convinient approach).
Another thing probably worth mentioning is that the kernel now first initializes the memory manager before creating anything needed for interrupts to work, because they now need access to the virtual memory system to have it map the trampoline area. So if you change the init routine of either of the three memory managers (physical, virtual or heap), keep in mind that interrupts are delivered later and if something really goes bad the CPU might even halt completly and refuse to do its job. This can be tricky do debug, so make sure you know what you are doing!
Syscalls
With the new interrupt handler and the new virtual memory stuff comes a new syscall handler. Previously it was really easy to handle syscalls because everything was in one address space and every memory bit could be accessed without any problems. Now, the syscall entry point has to do some extra steps before it can call the actual handler of the syscall, for example it has to map the stack of the process temporaly into memory so that the kernel can access the arguments that were pushed (and most syscall handlers expect some arguments). Also implemented now is errno, which is passed through the ecx register to the process.
However, if you implement a syscall handler, chances are that you also need to map some memory around, for example the print syscall handler gets passed a pointer to a string that should be print and of course this pointer is inside the process' address space. So it first hast to resolve the address into the physcial address and then map the chunk of memory into the kernel before it can access it. And of course at the end the mapping should be reversed because otherwise the kernels virtual address space fills up pretty fast.
Debugging
Up until now, the only way to debug Firedrake was through syslog(). That works for small things, but it doesn't for various other problems, and in the case of exceptions and a kernel panic up until now you had to either use an emulator to know what caused the exception or throw in some random syslog() calls in the hope you could track it down this way. This has changed, if Firedrake comes down now, it prints a nice callstack along witht he panic message, which lists the calls that were made inside the kernel which lead to the problem. This becomes really useful if you have to debug a problem in the kernel itself (otherwise you might only see that the kernel was entered through the interrupt handler, but nothing before because that happened in another process), however, there is one thing you need to keep in mind if you are watching a backtrace; If there is any interrupt that lead to the crash, the callstack is not 100% accurate. This is, because the function that actually caused the interrupt is not visible in the callstack! Keep that in mind when debugging by using the callstack which is automatically dumped when there is a kernel panic.
Another way to debug memory problems are hardware watchpoints. The x86 architecture allows you to define up to four watchpoints for any linear address, you can set some conditions for the watchpoint as well, for example "it must be a read access and happen anywhere from the address to the address + 4 bytes". Whenever the condition is met, the CPU will call the interrupt handler which will dump the current callstack and then resume operation. Look into Kernel/system/kernel.h to see how this stuff works and what conditions you can set. However, there is a small caveat when using this; Local watchpoints. A local watchpoint is automatically disabled once an interrupt handler is called, but not reenabled when it left. In theory you should be able to set a few watchpoints per process and have Firedrake enable them as local watchpoints for you when the process gets scheduled so that you can track memory bugs for one process. This however, is not yet implemented, so you should always use a global watchpoint and keep that in mind.
Whats next?
So, this covers the four really big topics I wanted to cover, of course there are many more things that changed, for example that the kernel is able to load ELF binaries now or that it can run unit tests. However, I won't go into any greater detail about them right now, because they aren't that important or just so small that its not worth to mention them. Feel free to read the changelog instead which lists far more changes and additions, although not in a great detail (and some changes aren't mentioned at all because they are way too minor).
But, whats next for Firedrake? I already said it a few times, I will work on Firedrake as long as I have fun doing this, without any greater goal but just implement more and more stuff. And I still have fun doing this, so here is what I have planned for the near future, in no order; A driver interface (which was originally planned for this release), a dynamic linker (which will run in a programs address space, basically like a simple libexec), some kind of shell and much more syscalls so that programs can actually do useful stuff.
Mentioned repository:
Why so serious?a year ago
This blogpost is about something that bugged me for quite some time and recently I stumbled about this again and thus decided to write this post. I'm talking about the AAA wannabe-ism that mostly catches young indie game developers.
AAA is not the holy grail!
So, what exactly am I talking about? I'm talking about the apparent urge of young game developers to copy bad ideas of big studios or companies just so their games look AAA-like. This already starts with small and little things, eg. the nightmare of UX concepts; A launcher. They have no use case whatsoever, because everything they offer could also be done ingame, the only use they have is to annoy the user. I don't want to view your fucking website when I double click awesomegame.exe (or .app) and I don't want to enter the options (something I need to do only once if ever!), so stay the hell away from me with this launcher crap! But hey, big titles like Mass Effect or Skyrim have them, they can't be that bad, can they? Oh by the way, did you know what they also have? Longcat like long loading screens for tiny levels and here is a small anecdote taken from my youth (aka just a few years ago); I was in the process of working on my first bigger project called Die Händler and one day I figured that my levels loaded way too fast compared to big games. But my project was a big game, it was epic, it was awesome, why was it so fast? I decided to pause the level loading for just five seconds to make it look like it was doing something really incredible complex (while it just wasted CPU cycles like a boss). A horrible idea in retrospect, ripping out something that should be considered awesome, fast level loading, to look like the standard bloatware from the shelf.
If only this was everything a indie developer could copy, but the worst part is this Shader fetish most people developed. If it doesn't look like the most artificial plastic, its considered not photorealistic and crappy looking. This starts with glossy looking bump mapping on every surface (most of the times with a non fitting bump map generated from a low resolution texture) and ends with post processing effects that just vomit into your face (depth of field that defines depth as anything further than 5cm away from your face for example). Why are you doing this? Why are you trying to drown everything in shaders? It almost looks like people forget that a good atmosphere isn't the same as great graphics, but most indie titles with superb graphics have a horrible atmosphere. Luckily there are great projects which are doing it right, Amnesia: The dark descent for example, or Superku. Those projects use shaders too (its impossible to not use them), but they use them to create a great atmosphere and not try to just add everything that the next best AAA title has. So don't get me wrong, shaders per se aren't bad! Depth of field can be a really great effect, as long as its not completely shoved into my face. But don't think that more equals better!
You are indie!
I could continue this post for quite some time, listing only the stuff that I think isn't worth copying, but I guess you too know that copying isn't the best idea. So instead, let me encourage you; Be different! You are indie for gods sake, you are free to do whatever you want, experiment with the genres, styles, graphics and everything you want! Don't copy, impress by being unique and telling your story the way it is great for you and your game, not the way its done over and over again in the industry. This is your chance to do something very impressive, it became unbeliveable easy to get started with creating games, even if you never programmed before. Even if you don't have access to any assets, all you have to do is to be creative. But being creative requires to think out of the box, so please don't try to hide yourself inside the big AAA box.
Beside that, you mostly don't have the resources a big company has. You can't afford high polyogn models with photorealistic textures and everything and you certainly can't do it alone. Its impossible to compete alone against a large studio with up to hundreds of professionals, each one having one very specific task. As you don't copy this aspect, don't bother with copying just the bad stuff they do.
Jublog updatea year ago
So I decided to update Jublog, aka this blog, a bit. And by "a bit", I actually meant "completely". There were two major changes, the most visible one comes with this blogpost, the not so visible one is already active since a few weeks, it included a complete rewrite of the backend which now supports plugins and most importantly uses JSON instead of hard coded arrays (I still have no idea what hold me up from using JSON in the first place). However, the frontend changed too now, the most visible thing should be the sidebar on the right instead of the left. Other changes are the slightly changed article layout, the new navigation bar and the new background image (yay!).
I also made some other changes, but they aren't really worth mentioning (this whole post isn't really worth anything, but I figured that I should write something after a month without a post)...
Tags: jublog
