This post is a rant, a piece of opinion about decisions that have been made, rather than a technical how to. It arises from a technical discussion on another thread, and is a reaction to technical info posted by Wonko the Sane and by Steve. Before you visit that link, please let me say I am not cross with either of them - I appreciate their friendly intention to help solve the technical point.
Who am I am cross with then? - well I am a little annoyed by the numerous kludge koders who produced this huge pile of nonsense that lies in the history of booting the IBM PC platform, and absolutely furious with those who feel we should be absolutely prevented from booting anything that the bootlaoder writers did not think of first.
But first, lets play two games.
Game1: Let us imagine a boot loader that is impossibly universal. It will boot anything bootable, including my souvenir 1/2" mag tape that booted the 1980's Interdata 8/16e, a paper tape that booted the PDP 8, and the 8" floppy disk that booted the BAS micro. In fact it is so flexible that it will boot these even without having the relevant hardware, using a universal emulator. So, all we have to do to implement a bay-door locking HAL is to wave one of HAL's memory cartridges at it. Even one of the props from the Stanley Kubrick film will do, after all the bootloader is supposed to be Universal, isn't it?
In a way such a machine is almost possible - Turing showed that any algorithmic computer can be emulated by a finite sized Turing machine, with just the minor details of providing an infinite length memory tape. It is somewhat easier than you might think as the tape need not be random access.
Oh, yes, silly me I forgot to mention one final tiny little detail: you do have to write the code for the tape.
So what we learn from this is that no finite sized "Universal" machine is truly Universal: depending on the programmers, it has a finite repertoire.
Game2: Let us imagine a bootloader that cares for your spiritual welfare. It will only work when the Moon is in Aries There is a good theoretical reason for this: how can your code be in the RAM if the Moon isn't?).
Or there could be religious versions that obey the special days appointed by the designated religion...
We would think there was something over protective about putting that kind of restriction into a bootloader, wouldn't we? Unless it was an April Fool joke.
There are some decisions that really should be left to the person who owns the machine. It is a matter of debate how far that "should" extends, but whenever people design a bootloader they have some idea of the level of that "should" in their mind, unconsciously if not consciously so. And my statement about "the person who owns the machine" begs some philosophical questions too. "Should" the bootloader prevent a terrorist loading software that sets off a bomb? How does the software determine if the owner is a "terrorist" or a legitimate member of the armed forces of [insert country here]?
So, what I want to do in this post is to make this subconscious "should" explicit. According to me, a bootloader *should* provide the means to boot *any* runnable software for the designated range of hardware.
Other people will have other views, and I fully expect contrasting opinions to appear lower down this thread. Those opinions, like mine, are neither "right" nor "wrong", but are different phiosophical starting points, which we can discuss as friends and stay friends.
UEFI is an example of what I would call a "stage zero" bootloader: it is the very first code that runs when a machine is powered up. Its job, like that of the BIOS that preceeeded it, is to do certain preparatory hardware tests, and load and run the next stage in the boot process.
GRUB is an example of what I would call a stage two bootloader, its job is to select and load the particular Operating System, and to run it with whatever options the operator selects on the day. It is not the job of a stage 2 bootloader to validate those options but simply to pass them on to the OS. If it tries to validate the options it will inevitably reject some as-yet-unimagined option that someone will want to add in the future.
Between stage zero and stage two is (guess what) stage one. A stage one bootloader is a small piece of code, on the disk, that represents the bridge between the firmware and the bootloader proper. Now, to explain an important point about stage1 loaders, I need to digress for a bit about non-universal stage2 loaders.
Not all bootloaders claim to be universal. NTLDR, for example, is quite unashamedly specific to a small number of proprietary operating systems. (If you don't know which, a clue about one of them is in the first two letters...). Equally, not all operating systems are designed with universal loading in mind. XP, as an example, is designed to be loaded oy NTLDR and only by NTLDR. When a system is set up by the Win7 installer to offer a boot choice between XP and 7, how it does it is to boot into the Win 7 loader (BOOTMGR). If XP is selected, BOOTMGR "boots" into NTLDR. See what is happening here, one stage2 bootloader is booting into another. This is known in boot theory as "chaining".
When you have a twin boot system offering both Linux and Windows, the most usual way this is implemented is that you boot into the Linux-friendly stage2 like LiLo (Linux Loader). Lilo does not really know how to load MS Windows, nor does it care. The programmers who wrote Lilo knew that if you have Windows you will have the Windows stage1 and stage2 bootloaders. So Lilo gives you the option to boot into those bootloaders, leaving them to finish the job. In the case of Lilo this is going above and beyond the call of duty, of course: Lilo is only specified to load lilo. The clue is in the name.
Grub* does the same trick. (I am using Grub* to mean legacy grub, grub4dos, and grub2 collectively). Does this stop it claiming to be a Universal bootloader? No, not in this case. It provides the specific case of booting into Windows by co-opting a low level loader that its authors know comes with windows. If you use the syntax
chainloader +1or
chainloader 0+1
you are chaining into the stage1 loader; if you use the syntax
chainloader /ntldror
chainloader /bootmgr
then you are chaining directly to the stage2 loader. No problem so far.
Because you can chain from one stage1 loader to another, this adds the facility to chain from one bootloader to another. You can chain from syslinux to grub4dos, for example and vice versa. When you boot an .iso in Easy2boot AUTO, for example, the grub4dos stage2 code boots you into the stage1 code inside the .iso, which quite often is syslinux.
These are all desirable qualities in a flexible bootloader. I would say, they are essential qualities if the loader is to deserve the adjective "Universal".
Now to the problem. Grub2 (in the version currently shipping with Ubuntu 12.04) will not chainload into the partion boot sectors of grub4dos. Grub2 is claiming that this is not a valid disk, or that there is an "invalid signature". I say that I guess that is a deliberate feature of Grub2. Now then, Wonko says something like "not so fast, River. How do you know that the grub4dos sector is a valid sector?". "Because is runs from other bootloaders", I say. "Aha! but perhaps those other bootloaders are running it even though it is illegal" seems to be the jist of Wonko's argument.
OK, maybe I got Wonko's idea correctly, maybe I misunderstood. Either way, this goes directly to the core of my rant.
A boot sector can appear at the start of a disk or at the start of a partition, and their formats are different. But one thing has been constant ever since IBM first made the specification. The first 16 bits must be runnable code on a 8086 chip. As the next so many bytes contain data, this means that in practice every boot sector starts with a real mode long jump instruction to branch to the beginning of what normal people would think of as the true code.
OK, the data it jumps over may conform to the original IBM specification, or not. In that sense it may be "legal" or not. The disk may or may not be bootable in the terms of the original IBM intentions.
But, I am asserting here as a bold opinion, from the point of view of a UNIVERSAL bootloader, if the first sector is a legal jump instruction that jumps to runnable code (or indeed if there is the beginning of some other realmode code) then the job of the UNIVERSAL bootloader is to load that code and run it.
Validity checks are great, if and *only* if, the operator wants them. Fine, for safety's sake we may want to say that the safety / sanity checks are the default mode. Perhaps we even want to ensure that only permitted operators are allowed to relax the checks by using an "unlock" code or whatever. But if you want to call it a UNIVERSAL bootloader, it must be somehow possible to run any software that is physically possible to run on that hardware.
So, when Grub2 tells me there is an invalid signature I don't care if Grub2 is correct or incorrect; I don't care if the BPB is valid or not, I just want to be able to say to Grub2 "don't bother with the check, just run the thing". I know the code is runnable (under my definition) as it runs from other bootloaders. End of story, according to me: Grub2 is being over-protective and if there is no override provided, we should call it GROB (O=overprotective) or GREB (E=enumerated, as it only runs specifcly enumerated systems) or GRIB (I=IBM, who first specifed the standard).
And yes, it might be that if I poke the characters "5.1" into some part of the sector, Grub2 will now accept it. We can arugue about whther grub4dos "should" have written that, and if so whether grub2 "should look for it in the default case. It may or may not be that grub4dos has misaligned some of the fields in the data area, or put values that say "I am illegal" (as when you see the error message "BPB sectors must not be zero"). In the practical world that kind of workaround is often unfortunately necessary. But if Grub2 wants to be seen as a UNIVERSAL loader, I still want some way to say "override that warning and run it anyway" without having to poke the data area.
Another way to look at this is to ask, do we expect chainloader to load *only* Microsoft systems, or should it chainload any arbitrary runnable code? I say the latter. Your opinion may vary.
Now there is another point related to this. The syntax +1 in the first chainloader command above is an abbreviation of 0+1, and the more general case is something like
chainloader 4468+2,4472+472,...
where the numbers are arbitrary offsets+lengths on a disk. Chainloader is documented as loading those arbitrary sectors and running them. Why does it make sense for it to do special checks on a particular data area? Does it do those checks for the boot sector regardless of the offsets given? OK, then we can't use Grub2 to boot a disk that has been corrupted in sector 0. Not very universal that if it can't be used as a recovery tool then. Or does it only make those checks if we specifiy certain specific sectors? again, not so universal. The command works as specified in grub legacy, and in grub4dos. It becomes more fussy in grub2. I say that extra fussiness is a mistake: regardless of whether it enforces the rules correctly, this particular command should not be enforcing rules at all.
Or, as a compromise, if you really want the extra safety, how about a chainloadernoverify command, along the lines of the grub4dos rootnoverify command (where noverify was provided for the same reason - to allow you to do things the author had not specifically coded for).
Likewise UEFI. I will be brief here, but it does seem to me the same philosophical mistake is being made.
It is great that we can boot into verified kernels. In an large company setting this is a valid way of preventing a certain common class of viruses spreading. In principle we could go further and before any boot a firmware UEFI+ could do a complete virus scan of all connected disks. BUT, if it is impossible for the hardware owner to give the operator of the machine the credentials to override those checks, the system has no right to claim for itself the accolade of being UNIVERSAL.
OK, then, that's my opinion. Now, tell me I'm wrong: I can take it
Edited by trueriver, 18 February 2013 - 11:31 AM.