ology.org -> resources -> Composing Good HTML

Composing Good HTML (section 2 of 4)

Physical versus logical character emphasis

Since HTML (and also SGML) is designed to be a device independent language for describing the content of documents, most of the elements within it aren't intended to give direct control to the author over how the final page layout will look. The major exceptions to this are in the character highlighting elements.

There are two types of character highlighting elements -- physical and logical. The physical styles involve things like "italic font", and "boldface"; while the logical styles are things like "emphasis", "citation", and "strong." It is strongly recommended that you employ the logical styles rather than the physical styles in your documents. Using the I element to render text in italics will only be effective on those browsers which are capable of displaying italics -- which all browsers are not guaranteed to be able to do. It is far better to encode semantic content -- to describe things in terms of logical styles -- and then allow the browser to display that semantic structure as best it can, given its display capabilities.

So, instead of

<I>italics</I>

you might use

emphasized or <CITE>citation</CITE>

and instead of

<B>bold</B>

you might use

<STRONG>strong</STRONG>

This also leaves the possibilities open in the future for more sophisticated uses of these semantic encodings, which have much more inherent meaning than font styles like bold or italic. For example, the Lycos indexing system can take advantage of semantic encoding to create abstracts of documents.

Note: Before you stop using B and I altogether, here's another viewpoint to consider. One argument against logical character styles is that it turns out to be a bottomless pit, a fruitless attempt to define logical styles for every possibility. Physical styles, combined with the context of the text in which they are placed, seem to provide a much richer set without a huge number of tags. Consider the large space of context that can be implied with only the typographical conventions of bold or italic. The only problem is that that contextual space needs to have a human being to interpret it, which would make some kinds of computer-based rendering difficult, if not impossible (e.g. speech synthesis).

A picture is worth a thousand words (which is why it takes a thousand times longer to load...)

The title of this section is somewhat facetious, but only somewhat. It's more and more obvious from current Web development efforts that the main attraction of the Web is not hypertext, and it's not an easy interface; the main attraction is the flashy graphics and the alluring promise of multimedia. We shall heroically refrain from commenting on whether this is a good or a bad thing, for the fact remains that online multimedia is here to stay. What we will comment on is on the issues that must be considered to use multimedia for best effect.

The first set of issues revolves about the faux sense of page design one can get by using inline images. An early example of this was one of the early commercial forays into the Web, a graphic design house which advertised professional layout services for online brochures. They spent quite a bit of time designing graphics images of the proper width so that they could achieve page-layout effects like right justification and centering, and created a page which was fairly well-designed. However, they got bitten because this design relied on a browser's window being the default width for X Mosaic. With a wider window, the carefully aligned logo in the upper right corner was immediately followed by the image that should have been left justified on the following line.

Current browsers implement some better forms of layout control for images. For example, an author can specify the way in which text will flow around an image with an ALIGN element. Figures 8 and 9 exemplify this; the former has no text-flow information, and the latter does. This is not perfect, as using the ALIGN tag can cause strange stair-stepping effects if there is not enough text separating two images, as figure 10 illustrates. If the desired effect is of images with captions, a table is probably the best approach for layout purposes (Figure 11).

[Figure 8: IMG without the ALIGN element (Netscape)]

[Figure 9: IMG with the ALIGN element (Netscape)]

[Figure 10: Stair-stepping due to ALIGN (Netscape)]

[Figure 11: Using TABLE for layout (Netscape)]

Another consideration is unnecessary duplication of effort. Many authors swear by colored bullets and colorful horizontal rules, implementing both effects by using inlined images rather than the structural markup. Doing this can leave the portion of your audience which is unable (or unwilling) to view inlined images out of the loop, and can also negate some of the benefits provided by structural markup. There is also an unexpected side effect to using many small images: the current way in which Web clients retrieve documents requires that a separate connection to a Web server be initiated for each image. The time involved in negotiating this connection may actually be larger than the time involved in retrieving the image itself. Consider whether the effect achieved by the "enhanced" layout justifies the cost.

Another concern is the size of images. With the increasing home popularity of the Internet, more and more users are purchasing dial-up connections of one sort or another. This may be of the strict "shell-account" variety, which means that your readers will not see images at all, or they may be of the SLIP/PPP variety, which means that your readers will have an average of only 14,400 bits of information per second sent to them. This is not a large number, and huge images can take minutes to load. Bear this in mind when selecting images; will the image take so long to load that your reader will go somewhere else rather than wait?

The image size issue can be alleviated in several ways. First, the increasing popularity of the JPEG format means that images can be compressed to much smaller sizes, which provides dramatic speed-up in image load time. Even better results can be achieved by using fewer colors (gray scale, rather than full 24-bit color, for example). Another approach is to use a small set of navigational icons which appear on every page in your Web. Most browsers now cache documents and images; using the same icons (and using the same URL to refer to them with, perhaps by maintaining an /icons directory on your Web server) means that the reader will only incur the cost of downloading once.

Also, when using the IMG element, don't forget to also use the ALT attribute. The ALT attribute allows alternate text to be specified for an inlined image. This is especially useful for images that have specific meaning (and provide a link to other documents), as that meaning can be lost on those who do not have images loaded. For example:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif">

can be better represented with the addition of the following ALT attribute:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif" ALT="[Next Page]">

as shown in figures 12 through 16.

[Figure 12: The Document As Expected (Netscape)]

[Figure 13: Inlined Images Off/No ALT Tag (Netscape)]

[Figure 14: Text Browser/No ALT Tag (Lynx)]

[Figure 15: Inlined Images Off/ALT Tag Supplied (Netscape)]

[Figure 16: Text Browser/ALT Tag Supplied (Lynx)]

Finally, don't rely entirely on image maps and graphic logos to build your site. There are a few sites which have almost no textual content whatsoever; when visited by readers who do not (or cannot) load images, there is no information available. This is not to say that image maps must be avoided altogether. Instead, provide alternative means of navigation which supplement the image map, such as explanatory text which follows your map.

Common Errors

This section details common errors in HTML composition that may lead to documents which are not fully device-independent. The behaviors of these errors are undefined, so certain browsers may render them as intended but not all browsers are guaranteed of doing so. Therefore, these mistakes should be avoided, even if your browser of choice renders your documents correctly.

These errors are, for the most part, artifacts of "raw" HTML authoring. Web development has suffered from a lack of good authoring tools, a situation which is only now beginning to be rectified. Many of these errors involve typos or simple mistakes, although others deal with more fundamental conceptual problems.

Paragraph element errors

The use of the paragraph element (P) can be confusing. When HTML was first introduced,  served as a paragraph separator, not as an end-of-paragraph; a confusion which originally prompted this document. However, more recent version of the HTML 2.0 and later specifications have changed this behavior.

The current recommended use of the P element is to be placed at the beginning of paragraphs; for example:

<P> In this paragraph, our hero discovers that he really likes
baloney sandwiches. He also listens to some disco, and has a
lovely beverage. Ah, if only all paragraphs were this exciting!

This is in contrast to previous usage, where the  was usually placed at the end of the paragraph.

Still, in certain contexts, use of  should be avoided, such as directly before any other element which already implies a paragraph break.

To wit, the  element should not be placed before the headings, HR, ADDRESS, BLOCKQUOTE, or PRE.

It should also not be placed immediately before a list element of any stripe. That is, a should not be used to mark the end-of-text for <LI>, <DT> or <DD>. These elements already imply paragraph breaks.

Caveats

Some clarifications on the above might be in order. One is the difficulty of rendering appropriate white space by a browser. While it is true that all of the entities mentioned above imply a paragraph break, this only occasionally means that they also imply white space between sections -- this depends on the browser. So, while you might feel inclined to add a  in order to fix white space problems, please think twice and avoid it if you can.

Also, when using the glossary list (DL), please try to avoid using multiple DDs (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a  tag between paragraphs in a definition.

All clear now?

Character and entity reference errors

Simply put, a character reference and an entity reference are ways to represent information that might otherwise be interpreted as a markup tag. For example, consider the rendered HTML document in figure 17.

[Figure 17: Properly escaping character entities (Arena)]

The source which produces this document, which uses entities, looks like:

In order to represent the &quot;&lt;P&gt;&quot; in this text, I had to use &amp;lt;P&amp;gt; in my raw HTML.

In this example, the < becomes "<", the > becomes ">", the " becomes a quotation mark, and the & becomes "&" (which is needed in order to represent the text < in the document without the text being turned into "<"). There are currently four entities for this purpose in HTML, as well as several entities which allow encoding of the ISO Latin-1 Character Set.

The most common error in the use of entities is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference. Here are some examples of incorrect usage:

Doug &amp Chris went out for a walk.
A paragraph break can be represented with
&quote; &lt; P &gt; &quote;

Can you spot the errors in the above examples? They are:

In the first line, "&amp" needs to have a semicolon after it.
In the third line, "&quote;" should be """ (this is subtle and annoying, much like the Unix system call, creat())
There should be no spaces in the third line, which should read: "".

URL errors

Another misunderstood aspect of Web document composition is in the creation of URLs.

Directory reference errors

One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pre-generated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of my home page, you would want to refer to it as http://www.cs.cmu.edu/~tilt/, not as http://www.cs.cmu.edu/~tilt.

Many servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection. Also, getting this correct the first time means it will take less time for the page to be loaded; your readers won't have to wait through the time needed to open two (or more) HTTP connections.

Not using fully qualified domain names

Problems can arise when the hostnames in URLs aren't fully qualified. Within a local network, a machine can often be simply referred to by its host name. For example, the domain miskatonic.edu might have in it a WWW server with the host name www. Readers within that domain can refer to the machine by this name. However, the server's fully qualified domain name is www.miskatonic.edu. This fully qualified domain name provides enough information that any host, anywhere on the Internet, can find this particular machine.

What happens is that an HTML author might construct a link that looks like this:

<A HREF="http://www/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which produces a link to "Metanoia-A Change In Spirit" that will only work for people in the local network which that machine is on. A correct link would look like this, instead:

 <A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which would allow all of the readers who are interested in Metanoia -- even those living in Freedonia -- to actually follow the link.

Along those same lines, be careful in using URLs of the scheme "file:". It's possible to have a reference to file://localhost/some/file/pathname. What this does is references the file described on the local host of whoever is browsing the document. Which is why a reference to <A HREF="file://localhost/etc/motd">the message of the day</A> will display the message of the day on your machine, not the message of the day on my machine. However, this makes several assumptions about your reader's local machine and network which you probably shouldn't be making. Unless you know what you are doing (and probably even then), references of this type will really mess up your Web.

Missing quotes in start tags

One common error, especially with the current lack of widely available and useful authoring tools, is to leave off a quote in the attributes of tags. For example, this reference to the euphonium, king of instruments, should look like:

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/">

but people composing "raw" HTML from a text editor will often instead type

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/>

It's likely that by the end of that huge URL, the author had forgotten it was supposed to be quoted. The behavior of browsers upon encountering this varies -- some display a proper link, but you can't follow it, while others actually eat up huge portions of the following text, thinking everything up until the next quotation mark to be part of the URL.

Missed end tags

Many of the HTML elements contain information within them. For example, emphasized text would be rendered as emphasized text. There is a start tag (), some content (which may include text, and in some cases, other nested elements), and an end tag (, indicated by the </). A common mistake is to miss the / in the end tag. All elements (except empty elements, below) must be terminated by an end tag -- otherwise, undefined behavior may occur.

Some HTML elements may be empty, such as  and <HR> (the HTML 2.0 specification provides more information about element content). If this is the case, there is no need for an end tag.

Using white space around element tags

In general, the use of white space around element tags should be avoided. For example, if white space immediately follows a start tag, the style changes implied by that element may be applied to the initial space as well. For instance,

You really should
<A HREF="http://www.cs.cmu.edu/~tilt/"> CZeCh THIZ 0uT </A> !

would be rendered in Netscape as shown in figure 18, and in Lynx as shown in figure 19.

[Figure 18: Improper use of whitespace (and spelling and punctuation, too) (Netscape)]

[Figure 19: Improper use of whitespace (Lynx)]

On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags.)

<< previous -- next >>