This content was published by Andrew Tomazos and written by several hundred members of the former Internet Knowledge Base project.

Binary Data through to Source Code

Welcome to the third edition of the IKB newsletter.

This time we want to step back a bit and go through a few fundamental computing ideas and concepts.

This one is pretty basic, so if you are a programmer, feel free to skip ahead to the open source bit.

++ DATA ++

Binary data is a sequence of ones and zeros, a well-defined logical entity. All things inside a digital computer from basic word-processing documents, pictures, sounds to video are represented as binary data. In fact, everything inside a computer has to be binary data, because otherwise it simply doesn’t "fit" in there.

One of the biggest challenges to the computing industry is finding new things of value that can be represented or at least approximated, as data. Once we can successfully convert things from physical molecules, human sensory information or more broadly, anything our imagination will hold -- into a string of ones and zeros, we can get it inside the machine and the computer can get its hands on it. Once that happens it allows us to do cool things, like catalog it, transform it, transmit it across the world, simulate it, (delete it), etc.

++ PROGRAMS ++

One type of binary data is very special, because it not only gets put into storage on the computer, but actually gets fed into the brain of the computer, taking control of it. We call this type of data "processor instructions" or "machine code". A large sequence of these instructions might be called a program, script or "software".

What can you do with software? Well that's also one of the questions the computing industry is trying to answer. Computers are a general-purpose construct, by man-made design. That wonderful cascading or recursive idea - that data can be used to control the computer, which in turn generates more data, which may end up being used to control it - leads to possibilities not seen in other man-made machines. Programs running other programs.

When you turn your computer on, it has a tiny little program hard-wired into it by the manufacturer stored in its basic input and output system. All this tiny little program does is look at a special place on the hard drive for another program called a bootstrap loader. It loads it into the processor and goes away. That program then goes looking for a program, called a kernel, stored on the hard drive. It starts the kernel and goes away. The kernel starts and doesn't go away until you turn your computer off.

++ KERNELS ++

The kernel's job is to manage your computer's basic resources and to run other programs. When the kernel runs a program, it shares processor time with that program. There are several schemes and methods that vary between operating systems but the end result is the same; the kernel allows multiple programs to run on your computer at the same time, because it swaps the processor between them so quickly that the human operater don't notice.

The kernel starts many programs that you don't see, unless you go looking for them. One of the first programs a consumer desktop operating system like Windows XP, Mac OS X or Red Hat Linux loads has the job of showing a picture on your monitor. It doesn't do this because it wants to advertise the vending company's brand name (that's an afterthought). It does this because it is busy loading all those little programs.

Eventually the computer is ready to talk to you. It will load some kind of user interface program like a command-line interface or a graphical user interface (GUI). By interacting with that you can tell the kernel to load or stop (unload) yet more programs.

++ PROGRAMMING ++

The process of programming, writing software, is an interesting one, because certainly all these programs were not written by a human pressing the zero and one key in different sequences.

We start by representing the binary sequences that enter the processor as hexadecimal numbers which is like normal decimal numbers except you count to 16 before you wrap to the other column.

That is rather than counting like...

... 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ...

We count like...

... 7, 8, 9, A, B, C, D, E, F, 10, 11, 12 ...

We do it in this annoying way because we can group four lots of ones and zeros into one hexadecimal digit. Originally when humans started counting we decided to wrap at 10 because we have 10 fingers, but this doesn't factor as easily with binary. That is, for example the binary sequence "1101" is exactly hex "D", whereas in decimal it would take up two characters "13", overflowing into the next digit.

Restating, a binary sequence has exactly four times as many digits as a hex sequence. That is to say "1101 1101 1101" is "DDD" in hex, whereas in normal decimal counting it doesn't fit as neatly (its 3,549).

So now we write a program in hex and have another program convert that hex into binary data and feed it to the processor. Easier than typing ones and zeros.

++ ASSEMBLY ++

Next we give each hex instruction an easier name, called a mnemonic. MOV, JMP and INC are examples. MOV tells the computer to "MOVe" a piece of data from one place to the other. JMP tells it to "JuMP" to a different place in the program and continue from there. INC tells the computer to add 1 (INCrement) a given number.

Armed with these mnemonics, and a program that maps them into the right numbers, we can then start writing longer programs without having to memorize which hex sequence does what.

We call the program that takes the mnemonics and converts them to the binary data an "assembler." The collection of mnemonics -- i.e., the "language" -- is called "assembly."

The set of instructions that even the most modern of central processing units (CPU) can handle is actually quite limited, there are only a few hundred. For example, there is no instruction to display a picture file "MyCat.jpg" on the monitor. To write a program that did that in assembly would take a very long time and would be very difficult.

++ PROGRAMMING LANGUAGES ++

So we imagine a language that is a bit more complex than what we can do with assembly. We then write a program in assembly that takes a text file in our new language and converts it to assembly. This program is called a compiler or an interpreter, because it takes a program in one language and it compiles or interprets it into another.

One of the oldest and most basic languages that is still in widespread use is C. For the most part the family of operating systems called Unix were written in C, as well as vast tracts of almost everything else. C makes it easier to manage more complex algorithms and data structures than with assembly code.

++ LANGUAGE MAYHEM ++

There are many, many other programming languages. Some of them with compilers written in assembly, in C, or in one of many other languages.

Some of these are even self-hosting, which means that the compiler is written in the same language as it compiles. How is this possible? Well, to begin with, the compiler is written in a different language. Then, that old compiler is used to compile a new compiler in the same language, the old compiler then gets tossed out and what you are left with is a self-hosting language. The process which makes this happen is, shall we say, difficult.

Some of them have their own cross-platform bytecodes and a runtime environment, which means that they have an imaginary processor that they compile to, and then they write a program (the runtime) that takes those bytecodes and will run the same program on different operating systems or processors. This means that you can compile the program to bytecodes and then run it on any platform as long as the end user has the runtime environment.

Why would you need that and not just distribute the program in the normal language and the compiler itself? Two reasons. It takes a long time to compile things, so its faster to distribute pre-compiled bytecodes.

The other, possibly bigger, reason is that when people have the source they can make modifications to it, like say, removing copyright warnings, copy-protection, seeing how it works and making something better -- all of which is bad for business. It is a lot harder to do that when all you have is compiled bytecodes to work with, and not the human-written source files in the original language.

++ SOURCE CODE ++

Having said that, there has been an ever-present, and maybe even growing, community of people that simply refuse to use a program that they do not have the source code for. There are entire operating systems and a myriad of free tools for which the source is freely available. The source code is open. It is "open source".

Some of these projects even have very loose and relaxed licensing terms. They survive reduced revenue through volunteer effort and weight of numbers. Because they are open source, one developer can stop working on it, and another can pick up the ball.

This doesn’t tend to happen with proprietary software, when development stops on a proprietary project for some reason, the commercial pressure is to keep it closed and get as much revenue as possible from it until it becomes irrelevant.

In comparison between common open source projects (eg FreeBSD, GNU, Apache) and common proprietary systems (eg Windows, Mac), what tends to happen is that the open source ones are a lot more robust and flexible, because in proprietary software only the development team can look into the source for bugs -- whereas with open source software both the development team and the rest of the world can look into the source for bugs.

The downside is that the user interface is usually a lot more complex because there is less commercial incentive to make the technology more accessible for non-programmers. Market forces from non-programmer's demand do not push on the project as much, and that means that the user interface doesn't become as streamlined and weathered as a proprietary project. The "too many cooks" effect.

++ OPEN SOURCE VS PROPRIETARY ++

In practice it is not as black-and-white as that. Not everything fits nicely into two categories. On one side of the spectrum you have Microsoft Windows and Office, with intentionally obfuscated proprietary document formats throughout.

++ MICROSOFT WINDOWS & OFFICE ++

For good business reasons Microsoft wants to keep you as a user, and wants to make it nearly impossible to leave once you have started using their products. Their products are very easy to use -- Office is by far best of class in this regard -- and they can afford to do that because they have the revenue stream to pay an army of developers to keep it best of class. They got the revenue stream by locking people in with proprietary software, hiding their source and excruciatingly restrictive licensing agreements.

On the other extreme you have FreeBSD, an operating system that describes the Linux's licensing agreement as overly restrictive. It is totally supported by volunteer effort. There is no revenue stream. It is completely open source from the ground up.

++ FREEBSD ++

FreeBSD has a ports collection, which means you can pick from a library of 10,000 programs and applications, and by issuing one command it will automatically connect over the Internet to a file repository, download the *source code* to your computer, compile it, test it and install it, with documentation. For free.

FreeBSD is renowned as one of the most stable operating systems in existence, with uptimes measured in years. The downside is that it is very difficult to operate. (Or put another way, it is very difficult to learn how to operate it, but once you know, some argue, that it is more powerful and flexible.)

++ IN THE MIDDLE ++

Between these two extremes you have many different licensing agreements and policies about source. Sun's Java project for example, has a lot of the source code available, but not in a form you can compile yourself. Is it open source or proprietary? Apple released their kernel program called Darwin with open source, but a lot of the user-level programs that run on it are proprietary.

Because of their different characteristics between robustness and ease-of-use, easy-to-use proprietary programs have made their way onto customer-facing desktop systems and robust open source solutions have made their way onto heavily-traffic servers.

Back to Index