Last month I wrote a long post titled ASCII, NFO Art & Text Encodings. That covered the complications and technical difficulties involved in accurately rendering ASCII art and NFO text in modern browsers. It also dealt with how I overcame those difficulties and developed a backend rendering engine to accurately display text files hosted on Defacto2.net in the browser as text.
RetroTxt is a handy little extension that takes old fashioned text files and stylises them in a more pleasing visual format. Despite the web being built on text, web browsers are often incapable of accurately displaying texts written during the pre and early web eras. This is where RetroTxt comes in! It imports the text into a modern format, injects a font that mimics a chosen retro computer, then applies styling to improve the display and readability. It can also double up as a text viewer for your locally stored, off-line files and even serve as an NFO text viewer.
The best way to explain RetroTxt is to see it in action with this short 54 second muted video. Where I browse a few NFO files hosted on textfiles.com showing both before and after RetroTxt has been applied.
I’ve tried to keep the extension as light and hands off as possible within the confines of the restrictive Chrome extension API. While still giving users the ability to easily customise fonts, colours and other theme settings. It offers a choice of 25 retro fonts, 11 colour themes, text alignment options as well as some presets for MS-DOS, Commodore Amiga, Commodore 64, Apple II and even a DOS font, web hybrid.
Any web developers looking to browse the code would probably be interested in functions.js and text.css. As they contain the core functionality of the text character conversions, font colours and CSS styling. The rest of the code is mostly for the extension functionally and UI.
Surprisingly this ability to display files created on a text standard from 1981 has been mostly out of reach for numerous reasons that I will get into later. But for now, the last missing piece of the puzzle, the need for proper bitmapped DOS fonts converted into a web friendly format. Has been solved thanks to the marvellous work of Viler and his Ultimate Oldschool PC Font Pack.
Before the availability of Viler’s font pack, you couldn’t properly display ASCII/NFO art in a modern operating system without a specialised application to either view or convert the text into an image.
But image conversion has its downsides. For one text embedded into an image isn’t searchable, nor is it selectable, transferable or assessable. So the text can only be read using a fixed colour and small font size. Plus web browsers themselves add further limitations by placing memory restrictions on the size of the images they are able to load.
For mostly novelty value you can switch between these 4 sets of DOS era fonts while viewing the text files; 9 pixel VGA, high-resolution thin CGA, IBM PC BIOS and Tandy 1000 series BIOS. And there are a number of colour combinations too, DOS grey on black, white on black, monochrome green on black, black on white and a gimmick black on white with CSS shadow effects. Not surprisingly some files look better with different colour and font combinations.
For any web developers out there the basic implementation involved taking a DOS encoded text file and reading it using the Windows 1252 character set.
I looked up all the common ASCII art characters that are malformed by using this incorrect code page, pattern matched and replaced them with their UTF-8 coded equivalent.
For example, when using CP-437 the lower half block glyph ▄ is represented as decimal 220. With the Windows 1252 code page which has no lower half block character, decimal 220 returns the Ü glyph.
After loading an ASCII art file using Windows 1252, I replace all the incorrect Ü glyphs with U+2584 or ▄ and then display the converted document in the Unicode compatible, web browser friendly UTF-8 encoding.
The translated body of text is wrapped between a set of <pre></pre> tags on a HTML5 page rendered as UTF-8. A CSS font-face rule is inserted to remotely load Viler’s Truetype font and apply to it the content of the <pre> tags.
So why did this take so long?
There were a couple of problems that hampered this process.
First and foremost was the font issue. Even if Viler had done this font conversion in the 2000s they would have been useless for web developers such as myself. The ability to force browsers to download and use specific fonts was only introduced in CSS3 and wasn’t widely implemented in most browsers until a few years ago.
But the more frustrating complication was character encoding. The character sets used in DOS are non-standard and are not supported in modern operating systems. These encodings assign each character, letter, digit or glyph a unique numeric reference. Without the right character set, DOS text files will never display correctly in web browsers.
The common character set in use today is UTF-8 which itself is an implementation of Unicode. Unicode didn’t implement all the extended DOS characters until revision 3.2 released in 2002. Yet it took an extremely long time for that standard to be supported by modern web browsers. It was only after the widespread adoption of HTML5 that we saw progress with the in browser support for DOS era block and box characters.
Older browsers may not support all the HTML5 entities in the table below. Chrome has good support. But (currently) only IE 11+ and Firefox 35+ support all the entities.
ASCII explained in a long winded historical context
Today people generally refer to text-based art as ‘ASCII’ but that is a misuse of the acronym. The first ASCII (American Standard Code for Information Interchange) standard that we associate with text encoding came about from a 1963, binary based standard known as the ASA (American Standards Association) standard X3.4-1963. It was severely limited in a number of ways including the complete lack of lowercase lettering. But unlike earlier telegraph and teletype communications encoding schemes. It was designed from the ground up for computers and programs rather than human operators.
Just months after the release of the ASA standard, the ISO (International Organization for Standardisation) announced the intention to improve the obvious deficiencies in the encoding scheme. What became of that was EMCA-6 (European Computer Manufacturers Association) and adopted as ISO/IEC 646. ASCII X3.4-1967 was the United States adoption of this 1965 standard where “ASCII” became the common use name and is still today the basis of many modern character code sets including Unicode.
Unfortunately, there are numerous names for the identical standards depending on who publish or adopted them. ASCII X3.4-1967 was later renamed to ANSI X3.4-1967 (American National Standards Institute) and again to US-ASCII but can also be known under its ISO 646 classification. Still to this day people often shorten the names to either ANSI or ASCII and confusingly mean the same thing. Or worse interchange ANSI for Windows-1252 due to a historical Microsoft mislabelling. For simplicity I will refer to ASCII X3.4-1967 as ASCII for the remainder of this text.
ANSI: Acronym for the American National Standards Institute. The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. “ANSI applications” are usually a reference to non-Unicode or code page–based applications.
ASCII gave American English glyphs a unique binary code that could also be sequentially counted. In fact, the standard US keyboard layout today is still only able to output the same ASCII character code set standardised during the late 1960s. And I would imagine that this basic keyboard layout still greatly influences the syntax of many modern programming languages.
In ASCII X3.4-1967, the upper-case A is encoded as 100 0001 in 7 bit binary. For humans and web developers it is represented by decimal 65.
B is 100 0010, decimal 66.
C is 100 0011, decimal 67.
D is 100 0100, decimal 68.
And so on.
Despite being a 7-bit encoding scheme with 128 character possibilities, only 94 are used as display characters comprising of upper/lowercase letters, numerical digits, common pronunciation marks, mathematical and Fortran programming symbols.
The remainder are known as control characters (or control codes) and were designed to allow computers to share the text with other machines. Or to control the formatting on output devices such as displays and printers. It was up to the devices themselves as to which control characters to support as different types of machines had different requirements.
Many of these control characters are now redundant and do nothing in a modern computing sense but some are still in use.
SP spacebar, ESC escape key, HT tab key ↹, SI SO shift keys ⇧, DEL to remove the character at the active position, BS backspace ←, CR and LF enter and return keys ↵.
Interesting, adhering to the ASCII standard to start a new line requires the sending of two control characters, the CR and LF. This is to return the cursor back to the start of the line then drop it onto the new line. Windows, DOS and a number of legacy computers still use this method. While on Unix (and Linux, OSX, Amiga) they dropped this two character requirement as it was unnecessary and wasteful. So on those systems either a single CR or LF will create the new line.
Back in 1981 when IBM introduced their IBM PC running PC-DOS, it was not fully ASCII compatible. Depending on the machine’s purpose and market, IBM gave its computers custom character encodings. It would designate these encodings as Code Page [number] with the original PC receiving Code Page 437 or CP437. The character glyphs associated with these code pages were stored in the computer’s ROM as easy access for programs.
The IBM PC was 8-bit and so it could support a character set of up to 256 characters. The CP437 mostly contains the ASCII standard but dropped support for the control codes. These were instead replaced with programmer friendly glyphs intended for use with text user interfaces. These same glyphs became the basis of the modern ASCII art scene despite not having anything to do with ASCII text per se.
There were a number of problems with IBM’s approach, though. The lack of control codes meant that text documents created on a PC had to use glyphs to simulate pseudo control functions and it was up to the text editor or viewer to decide how to format and display.
For example a right pointing arrow → glyph in DOS is used to mark the end of file. And there is no proper tabbing support because an ASCII HT (horizontal tab) control code in DOS can also be used to display a ○ glyph. This makes DOS text less portable and many of its non-standard glyphs will usually not display on other machines.
In each of these new code pages, many of the more frivolous glyphs were dropped or moved around. But there were no means of embedding to the file which code page was used to author the text. It was left to the reader to work it out themselves.
EMCA-94 is an 8-bit character code standard from early 1985 and was designed to add better internationalisation to the original 7 bit ASCII X-67 reference. Its key goal was communication interoperability and so all the unnecessary display characters were ignored or dropped. This is important to note as many of the block, shade and line characters associated with ASCII art were included in that rejection. The standard became known as ISO/IEC 8859-1 in 1987 with other groups of languages gaining support in subsequent releases.
From its first release, Windows adopted the ISO 8859-1 standard but replaced some of the control codes with additional characters and rename it Code Page 1252. But it is more commonly known today as Windows-1252. ISO 8859-1 is compatible with Windows-1252 but not the other way around.
Because of this text files created in DOS will not accurately display in Windows and vice versa without some kind of prior conversion.
Thankfully today much of the web has moved onto Unicode-based encodings that remove the issues with incompatible character code sets and legacy encodings. As it gives each glyph a unique identifier and all languages now use the one common code page.
A will always be represented as U+1D00
À as U+00E0
█ as U+2588
‖ as U+2016
This is the main reason to use the Unicode compatible UTF-8 encoding to display ASCII art in the browser. And why until now it has been rather difficult to accurately display many text files created with MS-DOS in modern web browsers as text.