ASCII, NFO Art & Text Encodings

After last month’s successful rollout of the JavaScript DOS emulation throughout the site. There has been one other gripe I have been wanting to overcome on Defacto2 and that is the accurate display of ASCII art and NFO text in a browser.

Surprisingly this ability to display files created on a text standard from 1981 has been mostly out of reach for numerous reasons that I will get into later. But for now, the last missing piece of the puzzle, the need for proper bitmapped DOS fonts converted into a web friendly format. Has been solved thanks to the marvellous work of Viler and his Ultimate Oldschool PC Font Pack.

Home of the world’s biggest collection of classic text mode fonts, system fonts and BIOS fonts from DOS-era IBM PCs and compatibles

You may recognise Viler as one member of the 2015 team that developed the technically amazing and competition winning 8088 MPH demo. Which impossibly pulled off the feat of displaying an image using 1024 simultaneous colours on a 1981 era home/office PC.

Before the availability of Viler’s font pack, you couldn’t properly display ASCII/NFO art in a modern operating system without a specialised application to either view or convert the text into an image.

ANSILove is a set of tools to convert ANSi and artscene-related file formats into PNG images

But image conversion has its downsides. For one text embedded into an image isn’t searchable, nor is it selectable, transferable or assessable. So the text can only be read using a fixed colour and small font size. Plus web browsers themselves add further limitations by placing memory restrictions on the size of the images they are able to load.

Thanks to Viler’s fonts and some code page character conversions I created (really a hack). Most of the text files on Defacto2 are now accurately displayed in the browser as HTML text which eliminates all those mentioned issues using images.

Here are a few examples.

For mostly novelty value you can switch between these 4 sets of DOS era fonts while viewing the text files; 9 pixel VGA, high-resolution thin CGA, IBM PC BIOS and Tandy 1000 series BIOS. And there are a number of colour combinations too, DOS grey on black, white on black, monochrome green on black, black on white and a gimmick black on white with CSS shadow effects. Not surprisingly some files look better with different colour and font combinations.

2016-04-04_18-10-35
Before: The original DEADLINE.NFO poorly rendered by Chrome
deadline
Now: In browser render of DEADLINE.NFO
deadline shadow
CSS Shadow effects
select all
All text is searchable and selectable
2016-04-04_18-15-35
Text pasted into Notepad

For any web developers out there the basic implementation involved taking a DOS encoded text file and reading it using the Windows 1252 character set.

I looked up all the common ASCII art characters that are malformed by using this incorrect code page, pattern matched and replaced them with their UTF-8 coded equivalent.

For example, when using CP-437 the lower half block glyph ▄ is represented as decimal 220With the Windows 1252 code page which has no lower half block character, decimal 220 returns the Ü glyph.

After loading an ASCII art file using Windows 1252, I replace all the incorrect Ü glyphs with U+2584 or ▄ and then display the converted document in the Unicode compatible, web browser friendly UTF-8 encoding.

The translated body of text is wrapped between a set of <pre></pre> tags on a HTML5 page rendered as UTF-8. A CSS font-face rule is inserted to remotely load Viler’s Truetype font and apply to it the content of the <pre> tags.

So why did this take so long?

There were a couple of problems that hampered this process.

First and foremost was the font issue. Even if Viler had done this font conversion in the 2000s they would have been useless for web developers such as myself. The ability to force browsers to download and use specific fonts was only introduced in CSS3 and wasn’t widely implemented in most browsers until a few years ago.

But the more frustrating complication was character encoding. The character sets used in DOS are non-standard and are not supported in modern operating systems. These encodings assign each character, letter, digit or glyph a unique numeric reference. Without the right character set, DOS text files will never display correctly in web browsers.

The common character set in use today is UTF-8 which itself is an implementation of Unicode. Unicode didn’t implement all the extended DOS characters until revision 3.2 released in 2002. Yet it took an extremely long time for that standard to be supported by modern web browsers. It was only after the widespread adoption of HTML5 that we saw progress with the in browser support for DOS era block and box characters.

Older browsers may not support all the HTML5 entities in the table below. Chrome has good support. But (currently) only IE 11+ and Firefox 35+ support all the entities.

http://www.w3schools.com/charsets/ref_utf_block.asp

ASCII explained in a long winded historical context

Today people generally refer to text-based art as ‘ASCII’ but that is a misuse of the acronym. The first ASCII (American Standard Code for Information Interchange) standard that we associate with text encoding came about from a 1963, binary based standard known as the ASA (American Standards Association) standard X3.4-1963. It was severely limited in a number of ways including the complete lack of lowercase lettering. But unlike earlier telegraph and teletype communications encoding schemes. It was designed from the ground up for computers and programs rather than human operators.

Just months after the release of the ASA standard, the ISO (International Organization for Standardisation) announced the intention to improve the obvious deficiencies in the encoding scheme. What became of that was EMCA-6 (European Computer Manufacturers Association) and adopted as ISO/IEC 646. ASCII X3.4-1967 was the United States adoption of this 1965 standard where “ASCII” became the common use name and is still today the basis of many modern character code sets including Unicode.

Unfortunately, there are numerous names for the identical standards depending on who publish or adopted them. ASCII X3.4-1967 was later renamed to ANSI X3.4-1967 (American National Standards Institute) and again to US-ASCII but can also be known under its ISO 646 classification. Still to this day people often shorten the names to either ANSI or ASCII and confusingly mean the same thing. Or worse interchange ANSI for Windows-1252 due to a historical Microsoft mislabelling. For simplicity I will refer to ASCII X3.4-1967 as ASCII for the remainder of this text.

ANSI: Acronym for the American National Standards Institute. The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. “ANSI applications” are usually a reference to non-Unicode or code page–based applications.

https://msdn.microsoft.com/en-gb/goglobal/bb964658.aspx#a

ASCII gave American English glyphs a unique binary code that could also be sequentially counted. In fact, the standard US keyboard layout today is still only able to output the same ASCII character code set standardised during the late 1960s. And I would imagine that this basic keyboard layout still greatly influences the syntax of many modern programming languages.

In ASCII X3.4-1967, the upper-case A is encoded as 100 0001 in 7 bit binary.  For humans and web developers it is represented by decimal 65.

  • B is 100 0010, decimal 66.
  • C is 100 0011, decimal 67.
  • D is 100 0100, decimal 68.
  • And so on.

Despite being a 7-bit encoding scheme with 128 character possibilities, only 94 are used as display characters comprising of upper/lowercase letters, numerical digits, common pronunciation marks, mathematical and Fortran programming symbols.

The remainder are known as control characters (or control codes) and were designed to allow computers to share the text with other machines. Or to control the formatting on output devices such as displays and printers. It was up to the devices themselves as to which control characters to support as different types of machines had different requirements.

Many of these control characters are now redundant and do nothing in a modern computing sense but some are still in use.

SP spacebar, ESC escape key, HT tab key , SI SO shift keys , DEL to remove the character at the active position, BS backspace , CR and LF enter and return keys .

Interesting, adhering to the ASCII standard to start a new line requires the sending of two control characters, the CR and LF. This is to return the cursor back to the start of the line then drop it onto the new line. Windows, DOS and a number of legacy computers still use this method. While on Unix (and Linux, OSX, Amiga) they dropped this two character requirement as it was unnecessary and wasteful. So on those systems either a single CR or LF will create the new line.

Back in 1981 when IBM introduced their IBM PC running PC-DOS, it was not fully ASCII compatible. Depending on the machine’s purpose and market, IBM gave its computers custom character encodings. It would designate these encodings as Code Page [number] with the original PC receiving Code Page 437 or CP437. The character glyphs associated with these code pages were stored in the computer’s ROM as easy access for programs.

The IBM PC was 8-bit and so it could support a character set of up to 256 characters. The CP437 mostly contains the ASCII standard but dropped support for the control codes. These were instead replaced with programmer friendly glyphs intended for use with text user interfaces. These same glyphs became the basis of the modern ASCII art scene despite not having anything to do with ASCII text per se.

There were a number of problems with IBM’s approach, though. The lack of control codes meant that text documents created on a PC had to use glyphs to simulate pseudo control functions and it was up to the text editor or viewer to decide how to format and display.

For example a right pointing arrow glyph in DOS is used to mark the end of file. And there is no proper tabbing support because an ASCII HT (horizontal tab) control code in DOS can also be used to display a glyph. This makes DOS text less portable and many of its non-standard glyphs will usually not display on other machines.

ascii with control codes
An ASCII document with control codes that in DOS CP437 should reference arrow glyphs
ascii doc typed in dosbox
The same ASCII file in DosBox, but the left arrow fails to show
dos fail
FreeDOS Edit mostly fails with the file
notepad fail
Notepad in Windows 10 formats the file fine but the DOS specific arrow glyphs fail to show

The original CP437 set had limited use for international languages. So numerous incompatible DOS code pages were created, each with nonmeaningful numeric references to target different groups of languages. CP-850 for Western Europe, CP-852 for Central Europe, CP-860 Portuguese, CP-865 Nordic, etc.

In each of these new code pages, many of the more frivolous glyphs were dropped or moved around. But there were no means of embedding to the file which code page was used to author the text. It was left to the reader to work it out themselves.

EMCA-94 is an 8-bit character code standard from early 1985 and was designed to add better internationalisation to the original 7 bit ASCII X-67 reference. Its key goal was communication interoperability and so all the unnecessary display characters were ignored or dropped. This is important to note as many of the block, shade and line characters associated with ASCII art were included in that rejection. The standard became known as ISO/IEC 8859-1 in 1987 with other groups of languages gaining support in subsequent releases.

From its first release, Windows adopted the ISO 8859-1 standard but replaced some of the control codes with additional characters and rename it Code Page 1252. But it is more commonly known today as Windows-1252. ISO 8859-1 is compatible with Windows-1252 but not the other way around.

Because of this text files created in DOS will not accurately display in Windows and vice versa without some kind of prior conversion.

Thankfully today much of the web has moved onto Unicode-based encodings that remove the issues with incompatible character code sets and legacy encodings. As it gives each glyph a unique identifier and all languages now use the one common code page.

  • A will always be represented as U+1D00 
  • À as U+00E0
  • as U+2588 
  • as U+2016

This is the main reason to use the Unicode compatible UTF-8 encoding to display ASCII art in the browser. And why until now it has been rather difficult to accurately display many text files created with MS-DOS in modern web browsers as text.

Additional sources.

  1. An annotated history of some character codes or ASCII: American Standard Code for Information Infiltration.
  2. Standard EMCA – 6 7 -bit Input/Output Coded Character Set 4th Edition 1973
  3. KreativeKorp CP437
  4. KreativeKorp.com US-ASCII
  5. Code Page 1252 Windows Latin 1 (ANSI) with its misuse of the ‘ANSI’ acronym