This content was published by Andrew Tomazos and written by several hundred members of the former Internet Knowledge Base project.

File Type, Standards and Control

Welcome to the fifth edition of the IKB newsletter.

We talked last time about “web sites as software” and took a quick romp through how computer programs talk to each other over the Internet and how web-distributed software fits into that picture.

An IKB member made an edit changing “XHTML programming language” to “XHTML markup language.” This might not seem like a big deal, but it actually opens the door to a very interesting topic.

++ ABSTRACT DATABASE ++

Data is stored on a digital computer in binary. Each position on a data store (eg the hard drive) is given a numbered address so that we can refer to it — like a big long row of numbered on/off switches. On a 100 gigabyte hard drive there are eight hundred thousand million of these on/off switches. If each were the size of a room-light switch they would blanket New Jersey.

We can store lots of data there, but we need a way to keep track of it. We do that by arranging it into pieces and noting down which pieces are at which addresses in a table. Also included in that table we can put data about the data (meta-data) — like give each piece a title, a type name, an owner, who can access it, the date it was created, etc. The table of meta-data is itself stored as data, usually alongside the data it is talking about.

In general, a system of structuring data is called a database.

++ FILE SYSTEM DATABASE ++

A specialized type of database is called a file system. Most people that have used a desktop computer have interacted directly with this type of database. It’s where we get names like “files,” “folders” and “directories”. Generally pieces of data, aka “files,” are arranged so that they have one parent “directory,” and each directory has one parent, except for the top one, the root directory, which has no parent. In computer science this is called a tree.

There are many different kinds of file system — NTFS, ext2, UFS, FAT32, etc. Most people don’t usually hear those names, because the technology is usually wrapped in the brand name of the mother operating system that uses the given file system by default. System administrators have to know about it though. When you install an operating system on a computer for the first time one of the first choices you usually have to make is which type of file system database you are going to use.

They vary by such things as (a) what metadata is stored about each file in the file system, (b) how it handles giving two different names to the same piece of data, (c) how big of a space it can organize (d) how efficiently it handles different uses, etc.

Usually the requirements of a desktop computer are much simpler than a server. It’s on servers that the details of file systems become very important to reliability and efficiency.

++ FILE TYPE ++

One of the pieces of metadata a file system keeps track of for each file is the “file type.” This is more complex than it seems, and there are many different approaches — some compatible — some overlapping — some incompatible.

A file is just a sequence of binary data and the file type is just a tag. It doesn’t actually change the data. We only use the file type to decide how it is interpreted and displayed by the programs that interact with it. The operating system uses it to decide which program to launch to handle it.

This is the cause of an interesting fight between competing software vendors. When you have two computer programs that can handle the same type of file — how does the operating system decide which to launch? To break this tie, user-friendly file systems attach a unique tag to each application and allow the application to tag every file it creates.

However, some primitive file systems are not that powerful. In such a case, does the operating system use the program that is made by the same company that made the operating system? This is one way a company can leverage the fact that it controls the operating system to give the other parts of their business an advantage. Another way is to hide system interfaces from competitors and leave the format of key system files such as keyboard and font layout definition resources undocumented so that foreign software cannot be as efficient as the vendor's.

The operating system keeps a database of file types and computer programs, called the desktop database. When a new computer program gets installed on a primitive file system, it usually tries to take over the entries for the file types it is interested in. Sometimes it asks. Sometimes it just does it. Some programs even check to see when they are launched if they have control of the file types, and if they don’t, they change them back. On the other hand, the new application installed on a user-friendly file system does not need to take over all existing files it supports; it can limit itself to the files it creates and offer such takeover to the user as an option when the user gets accustomed to their new software.

++ CHARACTER ENCODING ++

Let’s start with an example of the most basic type of file, an ASCII plain text file. ASCII stands for American Standard Code for Information Exchange. It is one of the oldest and most basic ways we give a sequence of binary data meaning.

Seven binary digits represent a number between 0 and 127. The ASCII chart maps those 128 numbers to symbols. The ASCII symbols are things like English letters, numbers, and the other symbols you find on a keyboard (@, #, $, %, etc). Also included is the space symbol “ ” and 33 other special symbols called control codes.

Most of the 33 control codes aren’t really used much anymore apart from the ones that represent a “return” or “enter.” They are referred to as a Carriage Return (for the old typewriters that moved the typing carriage back to the start of the line) and New Line (to move down to the next line, so you don’t type over what you’ve just written).

The two really mean the same thing nowadays — a “Line Break” — but the standard varies from platform to platform. Unix uses just a Line Feed, Macs use just a Carriage Return and Windows uses both. You find this out the hard way if you ever transfer a text file between platforms with a dumb piece of software.

Using ASCII we can take a sequence of binary digits and turn each lot of seven binary digits into a character, leaving each eight bit unused. This could then be used to represent a basic sequence of text like the one you are reading now. If stored as a file, it is generally referred to as a plain text file.

There are newer ways of encoding characters than the old ASCII. UTF-8 stands for 8-bit Universal Code Transformation. The 8 means 8 binary digits (rather than 7 in ASCII). Using it you can represent pretty much every symbol in every human language ever known. This is part of the big computing industry push towards international compatibility.

By using the extra binary digit value (the eighth one), you can signal that a new non-ASCII character is starting and finishing in the sequence. This new character can then use many more binary digits than just eight. In this way UTF-8 is backward compatible with ASCII, so we can start using UTF-8 in parallel with ASCII — as we try to phase it out.

++ PROGRAMMING LANGUAGE CHARACTER ENCODING ++

We’ve talked about source code before. It is what we use to create programs.

Almost all source code for computer programs is written as sets of files on the file system database, and those files are almost always written as human-readable plain text ASCII files.

Why don’t we use a special-purpose database and file format for source code? It’s largely for historical reasons. Most modern programming languages have ways to link one piece of the code to others—but we started out by building that linking based on ASCII text files and file names. Consequently any mainstream way of writing a program, or any new programming language, has to build on top of the old way or it is too radical to become adopted.

Java for example is a relatively new language. It uses an object metaphor where small reusable pieces of code are written as classes that contain instructions and data wrapped up together. Each of these “object classes” is contained in exactly one file on the file system. The name of that object corresponds to the name of the file.

This is a step in the right direction, and definitely different from older programming languages where there is only voluntary mapping between the names of the files and what they contain. In many modern programming languages you can spread code around any way you want — and unfortunately, a lot of people do.

Currently still, almost all source code sits as unstructured plain text files for the compiler or interpreter to read. You can still edit almost any source code in a basic text editor. Programming languages are specified in terms of which character sequences mean what.

++ SGML AND XML ++

SGML stands for Standardized General Markup Language and XML stands for Extensible Markup Language.

When we say something is “extensible” in computing, it means simply that a technology is easy to extend to do more things than it can do at the moment. This is something a lot of people that design programs don’t think about enough. A lot of people are focused on getting today’s problems solved without thinking about tomorrow. In general there needs to be a balance between these two things.

So, we have sequences of binary digits that are now sequences of character symbols. The next step beyond that is to be able to give structure to the sequences of characters somehow.

In human language we have the concept of “words”; small sequences of characters separated by the “ ” character. Indeed, that concept is used in many programming languages. What we call “whitespace” which is basically spaces, tabs and line breaks is used to chop up sequences of characters into programming language words (tokens).

For other types of documents we need to go beyond that. In a word processing document for example, not all sequences of text are equal. Some are headings. Some are lists. While we can do that by indenting or using all capital letters, it becomes harder and harder for the computer to keep track of.

Originally all computer programs used their own way of doing this. The problem arises when we want to take data from one program and use it in a different program. We can write programs that do conversions—but it is messy and difficult.

We need a standard way of marking sequences of characters within a file and giving them relationships to the other things contained in the file, or even to things contained in different files.

Thus the SGML standard entered many decades ago. SGML turned out to be way too complex so a simpler version XML was created. They are largely the same, but XML has fewer features.

SGML was the origin of HTML. Then XML was developed. And now we have XHTML which is XML-compliant HTML. It’s all the same stuff.

A document is given structure by marking character sequences with angle brackets like this “bar.” This gives “foo”ness to the character sequence “bar”. It marks “bar” as being “foo”.

The reason that we don’t simple write “bar is a foo” is that we also want a computer to understand it. The “bar” format is easier for a computer program to work with than human languages. Human languages are too ambiguous.

We call SGML, HTML, XHTML, and XML collectively "markup languages". Their chief goal is to give plain text files structure such that both a computer program and a human being can read it and understand it.

++ CONTROL, CONTROL, CONTROL ++

We have touched upon plain text files, source code, markup languages, and programming languages here and in previous articles.

A computer screen is made up by a grid of small colored dots called pixels. The computer displays what you see on the screen by controlling the color of those pixels.

Let’s say that we have a plain text sequence of characters, like the one you are reading now, stored in a file. If we open it with a text editor program, the contents of the file will determine which dots are which colors. The dots that are in the area of a character will be displayed as black. The ones that aren’t, are displayed as white. Or, if you do not have a graphic terminal, the other way round.

So when we write any text file and open it, to some extent they are taking control of the computer. Does this make any text file a computer program?

Can we write a text file such that when we open it with a text editor it deletes all the files on your hard drive? No. Absolutely not.

Can we write one such that it displays a picture? Well, yes. It’s called ASCII art. An ancient pastime from when displaying characters was all a computer could do with a screen.

The definition of programming in this case really comes down to the possibilities of control. The maximum control a sequence of binary data can have over a computer is when it is fed directly into the processor. This is assembly programming. Once you get away from that control starts getting taken away from the programmer.

Let’s break it down. The following four paragraphs are all similar.

You can do everything that you can do in C, in assembly. The reverse is not true. There are things you can only do in assembly, that you cannot do in C. For those things that you can do in both C is easier.

You can do everything that you can do in Java, in C. The reverse in not true. There are things you can only do in C, that you cannot do in Java. For those things that you can do in both Java is easier.

You can do everything that you can do with HTML displayed in a web browser, in Java. There are things you can only do in Java, that you cannot do in HTML and a web browser. For those things you can do in both HTML is easier.

You can do everything that you can do with plain text displayed in a text editor, with HTML displayed in a web browser, but the reverse is not true. There are things you can only do with HTML than you cannot do with plain text. For those things you can do in both, plain text is easier.

These items are just an example. Every type of file fits into this spectrum of control somewhere. (Some things have too much control for what they are (e.g. ActiveX) and are the cause of a lot of the spyware and adware Internet Explorer users have problems with.)

As discussed previously, compilers, interpreters, web browsers, and text editors are just computer programs. To some extent they act as proxies for the data that enters them, controlling the computer. Is it the data that enters them that is in control of the computer or the program that interprets the data?

++ THE POINT ++

In fact, the separation between computer programs, source code, document files, and other kinds of binary data is not as well defined as it first seems. There is a whole spectrum of different types of binary sequences acting on other binary sequences.

What is interesting is the potential for control that each is given. The computer you interact with controls what is shown on the monitor and goes into your eyes. Given that, the term “information” has to cover the whole realm of everything a person can see. It isn’t really a sufficient term.

The computer and network as a whole are in control of your experience. As an end user you dish out that control, first through the operating system and then through the other layers of programs and data.

Each binary sequence in the computer has a source. Many times this source is not you — but a third person or company. Interacting together, these control your experience when you sit at a computer.

How much control each one is allowed — by virtue of technical possibility and what is easy — needs to be considered. This is governed by the choice of which operating system you choose to install, which programs you install and which remote computers you interact with over the Internet.

This is where all the controversy over standards comes from. It is all about the brokering of this “control.”

Back to Index