Du bon usage de...


Son-of-RFC 1036
News Article Format and Transmission

1. Introduction

Network news articles resemble mail messages but are broad- cast to potentially-large audiences, using a flooding algo- rithm that propagates one copy to each interested host (or groups thereof), typically stores only one copy per host, and does not require any central administration or system- atic registration of interested users. Network news origi- nated as the medium of communication for Usenet, circa 1980. Since then Usenet has grown explosively, and many Internet sites participate in it. In addition, the news technology is now in widespread use for other purposes, on the Internet and elsewhere.

The earliest news interchange used the so-called "A News" article format. Shortly thereafter, an article format vaguely resembling Internet mail was devised and used briefly. Both of those formats are completely obsolete; they are documented in appendix A for historical reasons only. With publication of RFC 850 [rrr] in 1983, news arti- cles came to closely resemble Internet mail messages, with some restrictions and some additional headers. RFC 1036 [rrr] in 1987 updated RFC 850 without making major changes.

In the intervening five years, the RFC 1036 article format has proven quite satisfactory, although minor extensions appear desirable to match recent developments in areas such as multi-media mail. RFC 1036 itself has not proven quite so satisfactory. It is often rather vague and does not address some issues at all; this has caused significant interoperability problems at times, and implementations have diverged somewhat. Worse, although it was intended primar- ily to document existing practice, it did not precisely match existing practice even at the time it was published, and the deviations have grown since.

This Draft attempts to specify the format of articles, and the procedures used to exchange them and process them, in sufficient detail to allow full interoperability. In addi- tion, some tentative suggestions are made about directions for future development, in an attempt to avert unnecessary divergence and consequent loss of interoperability. Major extensions (e.g. cryptographic authentication) that need significant development effort are left to be undertaken as independent efforts.

NOTE: One question this all may raise is: why is there no News-Version header, analogous to MIME- Version, specifying a version number corresponding to this specification? The answer is: it doesn't appear to be useful, given news's backward- compatibility constraints. The major use of a version number is indicating which of several INCOMPATIBLE interpretations is relevant. The impossibility of orchestrating any sort of simul- taneous change over news's installed base makes it necessary to avoid such incompatible changes (as opposed to extensions) entirely. MIME has a ver- sion number mostly because it introduced incompat- ible changes to the interpretation of several "Content-" headers. This Draft attempts no changes in interpretation and it appears doubtful that future Drafts will find it feasible to intro- duce any.

UNRESOLVED ISSUE: Should this be reconsidered? Only if the header has SPECIFIC IDENTIFIABLE uses today. Otherwise it's just useless added bulk.

As in this Draft's predecessors, the exact means used to transmit articles from one host to another is not specified. NNTP [rrr] is probably the most common transmission method on the Internet, but a number of others are known to be in use, including the UUCP protocol [rrr] extensively used in the early days of Usenet and still much used on its fringes today.

Several of the mechanisms described in this Draft may seem somewhat strange or even bizarre at first reading. As with Internet mail, there is no reasonable possibility of updat- ing the entire installed base of news software promptly, so interoperability with old software is crucial and will remain so. Compatibility with existing practice and robust- ness in an imperfect world necessarily take priority over elegance.

2. Definitions, Notations, and Conventions

2.1. Textual Notations

Throughout this Draft, "MAIL" is short for "RFC 822 [rrr] as amended by RFC 1123 [rrr]". (RFC 1123's amendments are mostly relatively small, but they are not insignificant.) See also the discussion in section 3 about this Draft's relationship to MAIL. "MIME" is short for "RFCs 1341 and 1342" (or their updated replacements).

UNRESOLVED ISSUE: Update these numbers.

"ASCII" is short for "the ANSI X3.4 character set" [rrr]. While "ASCII" is often misused to refer to various character sets somewhat similar to X3.4, in this Draft, "ASCII" means X3.4 and only X3.4.

NOTE: The name is traditional (to the point where the ANSI standard sanctions it) even though it is no longer an acronym for the name of the standard.

NOTE: ASCII, X3.4, contains 128 characters, not all of them printable. Character sets with more characters are not ASCII, although they may include it as a subset.

Certain words used to define the significance of individual requirements are capitalized. "MUST" means that the item is an absolute requirement of the specification. "SHOULD" means that the item is a strong recommendation: there may be valid reasons to ignore it in unusual circumstances, but this should be done only after careful study of the full implications and a firm conclusion that it is necessary, because there are serious disadvantages to doing so. "MAY" means that the item is truly optional, and implementors and users are warned that conformance is possible but not to be relied on.

The term "compliant", applied to implementations etc., indi- cates satisfaction of all relevant "MUST" and "SHOULD" requirements. The term "conditionally compliant" indicates satisfaction of all relevant "MUST" requirements but viola- tion of at least one relevant "SHOULD" requirement.

This Draft contains explanatory notes using the following format. These may be skipped by persons interested solely in the content of the specification. The purpose of the notes is to explain why choices were made, to place them in context, or to suggest possible implementation techniques.

NOTE: While such explanatory notes may seem super- fluous in principle, they often help the less- than-omniscient reader grasp the purpose of the specification and the constraints involved. Given the limitations of natural language for descrip- tive purposes, this improves the probability that implementors and users will understand the true intent of the specification in cases where the wording is not entirely clear.

All numeric values are given in decimal unless otherwise indicated. Octets are assumed to be unsigned values for this purpose. Large numbers are written using the North American convention, in which "," separates groups of three digits but otherwise has no significance.

2.2. Syntax Notation

Although the mechanisms specified in this Draft are all described in prose, most are also described formally in the modified BNF notation of RFC 822. Implementors will need to be familiar with this notation to fully understand this specification, and are referred to RFC 822 for a complete explanation of the modified BNF notation. Here is a brief illustrative example:

     sentence  = clause *( punct clause ) "."
     punct     = ":" / ";"
     clause    = 1*word [ "(" clause ")" / "," 1*word ]
     word      = <any English word>

This defines a sentence as some clauses separated by puncts and ended by a period, a punct as a colon or semicolon, a clause as at least one <word> optionally followed by either a parenthesized clause or a comma and at least one more <word>, and a <word> as (informally) any English word. <> are used to enclose names when (and only when) distinguish- ing them from surrounding text is useful. The full form of the repetition notation is <m>"*"<n><thing>, denoting <m> through <n> repetitions of <thing>; <m> defaults to zero, <n> to infinity, and the "*" and <n> can be omitted if <m> and <n> are equal, so 1*word is one or more words, 1*5word is one through five words, and 2word is exactly two words.

The character "\" is not special in any way in this nota- tion.

This Draft is intended to be self-contained; all syntax rules used in it are defined within it, and a rule with the same name as one found in MAIL does not necessarily have the same definition. The lexical layer of MAIL is NOT, repeat NOT, used in this Draft, and its presence must not be assumed; notably, this Draft spells out all places where white space is permitted/required and all places where con- structs resembling MAIL comments can occur.

NOTE: News parsers historically have been much less permissive than MAIL parsers.

2.3. Definitions

The term "character set", wherever it is used in this Draft, refers to a coded character set, in the sense of ISO charac- ter set standardization work, and must not be misinterpreted as meaning merely "a set of characters".

In this Draft, ASCII character 32 is referred to as "blank"; the word "space" has a more generic meaning.

An "article" is the unit of news, analogous to a MAIL "mes- sage".

A "poster" is a human being (or software equivalent) submit- ting a possibly-compliant article to be "posted": made available for reading on all relevant hosts. A "posting agent" is software that assists posters to prepare articles, including determining whether the final article is compli- ant, passing it on to a relayer for posting if so, and returning it to the poster with an explanation if not. A "relayer" is software which receives allegedly-compliant articles from posting agents and/or other relayers, files copies in a "news database", and possibly passes copies on to other relayers.

NOTE: While the same software may well function both as a relayer and as part of a posting agent, the two functions are distinct and should not be confused. The posting agent's purpose is (in part) to validate an article, supply header infor- mation that can or should be supplied automati- cally, and generally take reasonable actions in an attempt to transform the poster's submission into a compliant article. The relayer's purpose is to move already-compliant articles around efficiently without damaging them.

A "reader" is a human being reading news articles. A "read- ing agent" is software which presents articles to a reader.

NOTE: Informal usage often uses "reader" for both these meanings, but this introduces considerable potential for confusion and misunderstanding, so this Draft takes care to make the distinction.

A "newsgroup" is a single news forum, a logical bulletin board, having a name and nominally intended for articles on a specific topic. An article is "posted to" a single news- group or several newsgroups. When an article is posted to more than one newsgroup, it is said to be "cross-posted"; note that this differs from posting the same text as part of each of several articles, one per newsgroup. A "hierarchy" is the set of all newsgroups whose names share a first com- ponent (see the name syntax in section 5.5).

A newsgroup may be "moderated", in which case submissions are not posted directly, but mailed to a "moderator" for consideration and possible posting. Moderators are typi- cally human but may be implemented partially or entirely in software.

A "followup" is an article containing a response to the con- tents of an earlier article (the followup's "precursor"). A "followup agent" is a combination of reading agent and post- ing agent that aids in the preparation and posting of a fol- lowup.

Text comparisons are "case-sensitive" if they consider uppercase letters (e.g. "A") different from lowercase let- ters (e.g. "a"), and "case-insensitive" if letters differing only in case (e.g. "A" and "a") are considered identical. Categories of text are said to be case-(in)sensitive if com- parisons of such texts to others are case-(in)sensitive.

A "cooperating subnet" is a set of news-exchanging hosts which is sufficiently well-coordinated (typically via a cen- tral administration of some sort) that stronger assumptions can be made about hosts in the set than about news hosts in general. This is typically used to relax restrictions which are otherwise required for worst-case interoperability; mem- bers of a cooperating subnet MAY interchange articles that do not conform to this Draft's specifications, provided all members have agreed to this and provided the articles are not permitted to leak out of the subnet. The word "subnet" is used to emphasize that a cooperating subnet is typically not an isolated universe; care must be taken that traffic leaving the subnet complies with the restrictions of the larger net, not just those of the cooperating subnet.

A "message ID" is a unique identifier for an article, usu- ally supplied by the posting agent which posted it. It dis- tinguishes the article from every other article ever posted anywhere (in theory). Articles with the same message ID are treated as identical copies of the same article even if they are not in fact identical.

A "gateway" is software which receives news articles and converts them to messages of some other kind (e.g. mail to a mailing list), or vice-versa; in essence it is a translating relayer that straddles boundaries between different methods of message exchange. The most common type of gateway connects newsgroup(s) to mailing list(s), either unidirec- tionally or bidirectionally, but there are also gateways between news networks using this Draft's news format and those using other formats.

A "control message" is an article which is marked as con- taining control information; a relayer receiving such an article will (subject to permissions etc.) take actions beyond just filing and passing on the article.

NOTE: "Control article" would be more consistent terminology, but "control message" is already well established.

An article's "reply address" is the address to which mailed replies should be sent. This is the address specified in the article's From header (see section 5.2), unless it also has a Reply-To header (see section 6.3).

The notation (e.g.) "(ASCII 17)" following a name means "this name refers to the ASCII character having value 17". An "ASCII printable character" is an ASCII character in the range 33-126. An "ASCII control character" is an ASCII character in the range 0-31, or the character DEL (ASCII 127). A "non-ASCII character" is a character having a value exceeding 127.

NOTE: Blank is neither an "ASCII printable charac- ter" nor an "ASCII control character".

2.4. End Of Line

How the end of a text line is represented depends on the context and the implementation. For Internet transmission via protocols such as SMTP [rrr], an end-of-line is a CR (ASCII 13) followed by an LF (ASCII 10). ISO C [rrr] and many modern operating systems indicate end-of-line with a single character, typically ASCII LF (aka "newline"), and this is the normal convention when news is transmitted via UUCP. A variety of other methods are in use, including out- of-band methods in which there is no specific character that means end-of-line.

This Draft does not constrain how end-of-line is represented in news, except that characters other than CR and LF MUST not be usurped for use in end-of-line representations. Also, obviously, all software dealing with a particular copy of an article must agree on the convention to be used. "EOL" is used to mean "whatever end-of-line representation is appropriate"; it is not necessarily a character or sequence of characters.

NOTE: If faced with picking an EOL representation in the absence of other constraints, use of a sin- gle character simplifies processing, and the ASCII standard [rrr] specifies that if one character is to be used for this purpose, it should be LF (ASCII 10).

NOTE: Inside MIME encodings, use of the Internet canonical EOL representation (CR followed by LF) is mandatory. See [rrr].

2.5. Case-Sensitivity

Text in newsgroup names, header parameters, etc. is case- sensitive unless stated otherwise.

NOTE: This is at variance with MAIL, which is case-insensitive unless stated otherwise, but is consistent with news historical practice and existing news software. See the comments on back- ward compatibility in section 1.

2.6. Language

Various constant strings in this Draft, such as header names and month names, are derived from English words. Despite their derivation, these words do NOT change when the poster or reader employing them is interacting in a language other than English. Posting and reading agents SHOULD translate as appropriate in their interaction with the poster or reader, but the forms that actually appear in articles are always the English-derived ones defined in this Draft.

3. Relation To MAIL (RFC 822 etc.)

The primary intent of this Draft is to completely describe the news article format as a subset of MAIL's message format augmented by some new headers. Unless explicitly noted oth- erwise, the intent throughout is that an article MUST also be a valid MAIL message.

NOTE: Despite obvious similarities between news and mail, opinions vary on whether it is possible or desirable to unify them into a single service. However, it is unquestionably both possible and useful to employ some of the same tools for manip- ulating both mail messages and news articles, so there is specific advantage to be had in defining them compatibly. Furthermore, there is no appar- ent need to re-invent the wheel when slight exten- sions to an existing definition will suffice.

Given that this Draft attempts to be self-contained, it inevitably contains considerable repetition of information found in MAIL. This raises the possibility of unintentional conflicts. Unless specifically noted otherwise, any wording in this Draft which permits behavior that is not MAIL- compliant is erroneous and should be followed only to the extent that the result remains compliant with MAIL.

NOTE: RFC 1036 said "where this standard conflicts with [RFC 822], RFC-822 should be considered cor- rect and this standard in error". Taken liter- ally, this was obviously incorrect, since RFC 1036 imposed a number of restrictions not found in RFC 822. The intent, however, was reasonable: to indicate that UNINTENTIONAL differences were errors in RFC 1036.

Implementors and users should note that MAIL is deliberately an extensible standard, and most extensions devised for mail are also relevant to (and compatible with) news. Note par- ticularly MIME [rrr], summarized briefly in appendix B, which extends MAIL in a number of useful ways that are defi- nitely relevant to news. Also of note is the work in progress on reconciling PEM (Privacy Enhanced Mail, which defines extensions for authentication and security) with MIME, after which this may also be relevant to news.

UNRESOLVED ISSUE: Update the MIME/PEM information.

Similarly, descriptions here of MIME facilities should be considered correct only to the extent that they do not require or legitimize practices that would violate those RFCs. (Note that this Draft does extend the application of some MIME facilities, but this is an extension rather than an alteration.)

4. Basic Format

4.1. Overall Syntax

The overall syntax of a news article is:

     article         = 1*header separator body
     header          = start-line *continuation
     start-line      = header-name ":" space [ nonblank-text ] eol
     continuation    = space nonblank-text eol
     header-name     = 1*name-character *( "-" 1*name-character )
     name-character  = letter / digit
     letter          = <ASCII letter A-Z or a-z>
     digit           = <ASCII digit 0-9>
     separator       = eol
     body            = *( [ nonblank-text / space ] eol )
     eol             = <EOL>
     nonblank-text   = [ space ] text-character *( space-or-text )
     text-character  = <any ASCII character except NUL (ASCII 0),
                         HT (ASCII 9), LF (ASCII 10), CR (ASCII 13),
                         or blank (ASCII 32)>
     space           = 1*( <HT (ASCII 9)> / <blank (ASCII 32)> )
     space-or-text   = space / text-character

An article consists of some headers followed by a body. An empty line separates the two. The headers contain struc- tured information about the article and its transmission. A header begins with a header name identifying it, and can be continued onto subsequent lines by beginning the continua- tion line(s) with white space. (Note that section 4.2.3 adds some restrictions to the header syntax indicated here.) The body is largely-unstructured text significant only to the poster and the readers.

NOTE: Terminology here follows the current custom in the news community, rather than the MAIL con- vention of (sometimes) referring to what is here called a "header" as a "header field" or "field".

Note that the separator line must be truly empty, not just a line containing white space. Further empty lines following it are part of the body, as are empty lines at the end of the article.

NOTE: Some systems make no distinction between empty lines and lines consisting entirely of white space; indeed, some systems cannot represent entirely empty lines. The grammar's requirement that header continuation lines contain some print- able text is meant to ensure that the empty/space distinction cannot confuse identification of the separator line.

NOTE: It is tempting to authorize posting agents to strip empty lines at the beginning and end of the body, but such empty lines could possibly be part of a preformatted document.

Implementors are warned that trailing white space, whether alone on the line or not, MAY be significant in the body, notably in early versions of the "uuencode" encoding for binary data. Trailing white space MUST be preserved unless the article is known to have originated within a cooperating subnet that avoids using significant trailing white space, and SHOULD be preserved regardless. Posters SHOULD avoid using conventions or encodings which make trailing white space significant; for encoding of binary data, MIME's "base64" encoding is recommended. Implementors are warned that ISO C implementations are not required to preserve trailing white space, and special precautions may be neces- sary in implementations which do not.

NOTE: Unfortunately, the signature-delimiter con- vention (described in section 4.3.2) does use sig- nificant trailing white space. It's too late to fix this; there is work underway on defining an organized signature convention as part of MIME, which is a preferable solution in the long run.

Posters are warned that some very old relayer software mis- behaves when the first non-empty line of an article body begins with white space.

4.2. Headers

4.2.1. Names and Contents

Despite the restrictions on header-name syntax imposed by the grammar, relayers and reading agents SHOULD tolerate header names containing any ASCII printable character other than colon (":", ASCII 58).

NOTE: MAIL header names can contain any ASCII printable character (other than colon) in theory, but in practice, arbitrary header names are known to cause trouble for some news software. Section 4.1's restriction to alphanumeric sequences sepa- rated by hyphens is believed to permit all widely- used header names without causing problems for any widely-used software. Software is nevertheless encouraged to cope correctly with the full range of possibilities, since aberrations are known to occur.

Relayers MUST disregard headers not described in this Draft (that is, with header names not mentioned in this Draft), and pass them on unaltered.

Posters wishing to convey non-standard information in head- ers SHOULD use header names beginning with "X-". No stan- dard header name will ever be of this form. Reading agents SHOULD ignore "X-" headers, or at least treat them with great care.

The order of headers in an article is not significant. How- ever, posting agents are encouraged to put mandatory headers (see section 5) first, followed by optional headers (see section 6), followed by headers not defined in this Draft.

NOTE: While relayers and reading agents must be prepared to handle any order, having the signifi- cant headers (the precise definition of "signifi- cant" depends on context) first can noticeably improve efficiency, especially in memory-limited environments where it is difficult to buffer up an arbitrary quantity of headers while searching for the few that matter.

Header names are case-insensitive. There is a preferred case convention, which posters and posting agents SHOULD use: each hyphen-separated "word" has its initial letter (if any) in uppercase and the rest in lowercase, except that some abbreviations have all letters uppercase (e.g. "Mes- sage-ID" and "MIME-Version"). The forms used in this Draft are the preferred forms for the headers described herein. Relayers and reading agents are warned that articles might not obey this convention.

NOTE: Although software must be prepared for the possibility of random use of case in header names (and other case-independent text), establishing a preferred convention reduces pointless diversity, and may permit optimized software that looks for the preferred forms before resorting to less- efficient case-insensitive searches.

In general, a header can consist of several lines, with each continuation line beginning with white space. The EOLs pre- ceding continuation lines are ignored when processing such a header, effectively combining the start-line and the contin- uations into a single logical line. The logical line, less the header name, colon, and any white space following the colon, is the "header content".

4.2.2. Undesirable Headers

A header whose content is empty is said to be an empty header. Relayers and reading agents SHOULD not consider presence or absence of an empty header to alter the seman- tics of an article (although syntactic rules, such as requirements that certain header names appear at most once in an article, MUST still be satisfied). Posting agents SHOULD delete empty headers from articles before posting them.
Headers that merely state defaults explicitly (e.g., a Fol- lowup-To header with the same content as the Newsgroups header, or a MIME Content-Type header with contents "text/plain; charset=us-ascii") or state information that reading agents can typically determine easily themselves (e.g. the length of the body in octets) are redundant, con- veying no information whatsoever. Headers that state infor- mation which cannot possibly be of use to a significant num- ber of relayers, reading agents, or readers (e.g., the name of the software package used as the posting agent) are use- less and pointless. Posters and posting agents SHOULD avoid including redundant or useless headers in articles.

NOTE: Information that someone, somewhere, might someday find useful is best omitted from headers. (There's quite enough of it in article bodies.) Headers should contain information of known util- ity only. This is not meant to preclude inclusion of information primarily meant for news-software debugging, but such information should be included only if there is real reason, preferably based on experience, to suspect that it may be genuinely useful. Articles passing through gateways are the only obvious case where inclusion of debugging information appears clearly legitimate. (See sec- tion 10.1.)

NOTE: A useful rule of thumb for software imple- mentors is: "if I had to pay a dollar a day for the transmission of this header, would I still think it worthwhile?".

4.2.3. White Space and Continuations

The colon following the header name on the start-line MUST be followed by white space, even if the header is empty. If the header is not empty, at least some of the content MUST appear on the start-line. Posting agents MUST enforce these restrictions, but relayers (etc.) SHOULD accept even arti- cles that violate them.

NOTE: MAIL does not require white space after the colon, but it is usual. RFC 1036 required the white space, even in empty headers, and some existing software demands it. In MAIL, and arguably in RFC 1036 (although the wording is vague), it is technically legitimate for the white space to be part of a continuation line rather than the start-line, but not all existing software will accept this. Deleting empty headers and placing some content on the start-line avoids this issue... which is desirable because trailing blanks, easily deleted by accident, are best not made significant in headers.

In general, posters and posting agents SHOULD use blank (ASCII 32), not tab (ASCII 9), where white space is desired in headers. Existing software does not consistently accept tab as synonymous with blank in all contexts. In particu- lar, RFC 1036 appeared to specify that the character immedi- ately following the colon after a header name was required to be a blank, and some news software insists on that, so this character MUST be a blank. Again, posting agents MUST enforce these restrictions but relayers SHOULD be more tol- erant.

Since the white space beginning a continuation line remains a part of the logical line, headers can be "broken" into multiple lines only at white space. Posting agents SHOULD not break headers unnecessarily. Relayers SHOULD preserve existing header breaks, and SHOULD not introduce new breaks. Breaking headers SHOULD be a last resort; relayers and read- ing agents SHOULD handle long header lines gracefully. (See the discussion of size limits in section 4.6.)

4.3. Body

Although the article body is unstructured for most of the purposes of this Draft, structure MAY be imposed on it by other means, notably MIME headers (see appendix B).

4.3.1. Body Format Issues

The body of an article MAY be empty, although posting agents SHOULD consider this an error condition (meriting returning the article to the poster for revision). A posting agent which does not reject such an article SHOULD issue a warning message to the poster and supply a non-empty body. Note that the separator line MUST be present even if the body is empty.

NOTE: An empty body is probably a poster error except, arguably, for some control messages... and even they really ought to have a body explaining the reason for the control message. Some old reading agents are known to generate empty bodies for "cancel" control messages, so posting agents might opt not to reject body-less articles in such cases (although it would be better to fix the reading agents to request a body). However, some existing news software is known to react badly to body-less articles, hence the request for posting agents to insert a body in such cases.

NOTE: A possible posting-agent-supplied body text (already used by one widespread posting agent) is "This article was probably generated by a buggy news reader.". (The use of "reader" to refer to the reading agent is traditional, although this Draft uses more precise terminology.)

NOTE: The requirement for the separator line even in a bodyless article is inherited from MAIL, and also distinguishes legitimately-bodyless articles from articles accidentally truncated in the middle of the headers.

Note that an article body is a sequence of lines terminated by EOLs, not arbitrary binary data, and in particular it MUST end with an EOL. However, relayers SHOULD treat the body of an article as an uninterpreted sequence of octets (except as mandated by changes of EOL representation and by control-message processing) and SHOULD avoid imposing con- straints on it. See also section 4.6.

4.3.2. Body Conventions

Although body lines can in principle be very long (see sec- tion 4.6 for some discussion of length limits), posters SHOULD restrict body line lengths to circa 70-75 characters. On systems where text is conventionally stored with EOLs only at paragraph breaks and other "hard return" points, with software breaking lines as appropriate for display or manipulation, posting agents SHOULD insert EOLs as necessary so that posted articles comply with this restriction.

NOTE: News originated in environments where line breaks in plain text files were supplied by the user, not the software. Be this good or bad, much reading-agent and posting-agent software assumes that news articles follow this convention, so it is often inconvenient to read or respond to arti- cles which violate it. The "70-75" number comes from the widespread use of display devices which are 80 columns wide, and the desire to leave a bit of margin for quoting etc. (see below).

Reading agents confronted with body lines much longer than the available output-device width SHOULD break lines as appropriate. Posters are warned that such breaks may not occur exactly where the poster intends.

NOTE: "As appropriate" would typically include breaking lines when supplying the text of an arti- cle to be quoted in a reply or followup, something that line-breaking reading agents often neglect to do now.

Although styles vary widely, for plain text it is usual to use no left margin, leave the right edge ragged, use a sin- gle empty line to separate paragraphs, and employ normal natural-language usage on matters such as upper/lowercase. (In particular, articles SHOULD not be written entirely in uppercase. In environments where posters have access only to uppercase, posting agents SHOULD translate it to lower- case.)

NOTE: Most people find substantial bodies of text entirely in uppercase relatively hard to read, while all-lowercase text merely looks slightly odd. The common association of uppercase with strong emphasis adds to this.

Tone of voice does not carry well in written text, and mis- understandings are common when sarcasm, parody, or exaggera- tion for humorous effect is attempted without explicit warn- ing. It has become conventional to use the sequence ":-)", which (on most output devices) resembles a rotated "smiley face" symbol, as a marker for text not meant to be taken literally, especially when humor is intended. This practice aids communication and averts unintended ill-will; posters are urged to use it. A variety of analogous sequences are used with less-standardized meanings [Sanderson].

The order of arrival of news articles at a particular host depends somewhat on transmission paths, and occasionally articles are lost for various reasons. When responding to a previous article, posters SHOULD not assume that all readers understand the exact context. It is common to quote some of the previous article to establish context. This SHOULD be done by prefacing each quoted line (even if it is empty) with the character ">". This will result in multiple levels of ">" when quoted context itself contains quoted context.

NOTE: It may seem superfluous to put a prefix on empty lines, but it simplifies implementation of functions such as "skip all quoted text" in read- ing agents.

Readability is enhanced if quoted text and new text are sep- arated by an empty line.

Posters SHOULD edit quoted context to trim it down to the minimum necessary. However, posting agents SHOULD not attempt to enforce this by imposing overly-simplistic rules like "no more than 50% of the lines should be quotes".

NOTE: While encouraging trimming is desirable, the 50% rule imposed by some old posting agents is both inadequate and counterproductive. Posters do not respond to it by being more selective about quoting; they respond by padding short responses, or by using different quoting styles to defeat automatic analysis. The former adds unnecessary noise and volume, while the latter also defeats more useful forms of automatic analysis that read- ing agents might wish to do.

NOTE: At the very least, if a minimum-unquoted quota is being set, article bodies shorter than (say) 20 lines, or perhaps articles which exceed the quota by only a few lines, should be exempt. This avoids the ridiculous situation of complain- ing about a 5-line response to a 6-line quote.

NOTE: A more subtle posting-agent rule, suggested for experimental use, is to reject articles that appear to contain quoted signatures (see below). This is almost certainly the result of a careless poster not bothering to trim down quoted context. Also, if a posting agent or followup agent pre- sents an article template to the poster for edit- ing, it really should take note of whether the poster actually made any changes, and refrain from posting an unmodified template.

Some followup agents supply "attribution" lines for quoted context, indicating where it first appeared and under whose name. When multiple levels of quoting are present and quoted context is edited for brevity, "inner" attribution lines are not always retained. The editing process is also somewhat error-prone. Reading agents (and readers) are warned not to assume that attributions are accurate.

UNRESOLVED ISSUE: Should a standard format for attribution lines be defined? There is already considerable diversity... but automatic news anal- ysis would be substantially aided by a standard convention.

Early difficulties in inferring return addresses from arti- cle headers led to "signatures": short closing texts, auto- matically added to the end of articles by posting agents, identifying the poster and giving his network addresses etc. If a poster or posting agent does append a signature to an article, the signature SHOULD be preceded with a delimiter line containing (only) two hyphens (ASCII 45) followed by one blank (ASCII 32). Posting agents SHOULD limit the length of signatures, since verbose excess bordering on abuse is common if no restraint is imposed; 4 lines is a common limit.

NOTE: While signatures are arguably a blemish, they are a well-understood convention, and convey- ing the same information in headers exposes it to mangling and makes it rather less conspicuous. A standard delimiter line makes it possible for reading agents to handle signatures specially if desired. (This is unfortunately hampered by extensive misunderstanding of, and misuse of, the delimiter.)

NOTE: The choice of delimiter is somewhat unfortu- nate, since it relies on preservation of trailing white space, but it is too well-established to change. There is work underway to define a more sophisticated signature scheme as part of MIME, and this will presumably supersede the current convention in due time.

NOTE: Four 75-column lines of signature text is 300 characters, which is ample to convey name and mail-address information in all but the most bizarre situations.

4.4. Characters And Character Sets

Header and body lines MAY contain any ASCII characters other than CR (ASCII 13), LF (ASCII 10), and NUL (ASCII 0).

NOTE: CR and LF are excluded because they clash with common EOL conventions. NUL is excluded because it clashes with the C end-of-string con- vention, which is significant to most existing news software. These three characters are unlikely to be transmitted successfully.

However, posters SHOULD avoid using ASCII control characters except for tab (ASCII 9), formfeed (ASCII 12), and backspace (ASCII 8). Tab signifies sufficient horizontal white space to reach the next of a set of fixed positions; posters are warned that there is no standard set of positions, so tabs should be avoided if precise spacing is essential. Formfeed signifies a point at which a reading agent SHOULD pause and await reader interaction before displaying further text. Backspace SHOULD be used only for underlining, done by a sequence of underscores (ASCII 95) followed by an equal num- ber of backspaces, signifying that the same number of text characters following are to be underlined. Posters are warned that underlining is not available on all output devices and is best not relied on for essential meaning. Reading agents SHOULD recognize underlining and translate it to the appropriate commands for devices that support it.

NOTE: Interpretation of almost all control charac- ters is device-specific to some degree, and devices differ. Tabs and underlining are sup- ported, to some extent, by most modern devices and reading agents, hence the cautious exemptions for them. The underlining method is specified because the inverse method, text and then underscores, is tempting to the naive... but if sent unaltered to a device that shows only the most recent of sev- eral overstruck characters rather than a compos- ite, the result can be utterly unreadable.

NOTE: A common interpretation of tab is that it is a request to space forward to the next position whose number is one more than a multiple of 8, with positions numbered sequentially starting at 1. (So tab positions are 9, 17, 25, ...) Reading agents not constrained by existing system conven- tions might wish to use this interpretation.

NOTE: It will typically be necessary for a reading agent to catch and interpret formfeed, not just send it to the output device. The actions per- formed by typical output devices on receiving a formfeed are neither adequate for nor appropriate to the pause-for-interaction meaning.

Cooperating subnets which wish to employ non-ASCII character sets by using escape sequences (employing, e.g., ESC (ASCII 27), SO (ASCII 14), and SI (ASCII 15)) to alter the meaning of superficially-ASCII characters MAY do so, but MUST use MIME headers to alert reading agents to the particular char- acter set(s) and escape sequences in use. A reading agent SHOULD not pass such an escape sequence through, unaltered, to the output device unless the agent confirms that the sequence is one used to affect character sets and has reason to believe that the device is capable of interpreting that particular sequence properly.

NOTE: Cooperating-subnet organizers are warned that some very old relayers strip certain control characters out of articles they pass along. ESC is known to be among the affected characters.

NOTE: There are now standard Internet encodings for Japanese [rrr] and Vietnamese [rrr] in partic- ular.

Articles MUST not contain any octet with value exceeding 127, i.e. any octet that is not an ASCII character.

NOTE: This rule, like others, may be relaxed by unanimous consent of the members of a cooperating subnet, provided suitable precautions are taken to ensure that rule-violating articles do not leak out of the subnet. (This has already been done in many areas where ASCII is not adequate for the local language(s).) Beware that articles contain- ing non-ASCII octets in headers are a violation of the MAIL specifications and are not valid MAIL messages. MIME offers a way to encode non-ASCII characters in ASCII for use in headers; see sec- tion 4.5.

NOTE: While there is great interest in using 8-bit character sets, not all software can yet handle them correctly. Hence the restriction to cooper- ating subnets. MIME encodings can be used to transmit such characters while remaining within the octet restriction.

In anticipation of the day when it is possible to use non- ASCII characters safely anywhere, and to provide for the (substantial) cooperating subnets that are already using them, transmission paths SHOULD treat news articles as unin- terpreted sequences of octets (except perhaps for transfor- mations between EOL representations) and relayers SHOULD treat non-ASCII characters in articles as ordinary charac- ters.

NOTE: 8-bit enthusiasts are warned that not all software conforms to these recommendations yet. In particular, standard NNTP [rrr] is a 7-bit pro- tocol, and there may be implementations which enforce this rule. Be warned, also, that it will never be safe to send raw binary data in the body of news articles, because changes of EOL represen- tation may (will!) corrupt it.

Except where cooperating subnets permit more direct approaches, MIME [rrr] headers and encodings SHOULD be used to transmit non-ASCII content using ASCII characters; see section 4.5, appendix B, and the MIME RFCs for details. If article content can be expressed in ASCII, it SHOULD be. Failing that, the order of preference for character sets is that described in MIME [rrr].

NOTE: Using the MIME facilities, it is possible to transmit ANY character set, and ANY form of binary data, using only ASCII characters. Equally impor- tant, such articles are self-describing and the reading agent can tell which octet-to-symbol map- ping is intended! Designation of some preferred character sets is intended to minimize the number of character sets that a reading agent must under- stand in order to display most articles properly.

Articles containing non-ASCII characters, articles using ASCII characters (values 0 through 127) to refer to non- ASCII symbols, and articles using escape sequences to shift character sets SHOULD include MIME headers indicating which character set(s) and conventions are being used, and MUST do so unless such articles are strictly confined to a cooperating subnet which has its own pre-agreed conventions. MIME encodings are preferred over all these techniques. If it comes to a relayer's attention that it is being asked to pass an article using such techniques outward across what it knows to be the boundary of such a cooperating subnet, it MUST report this error to its administrator, and MAY refuse to pass the article beyond the subnet boundary. If it does pass the article, it MUST re-encode it with MIME encodings to make it conform to this Draft.

NOTE: Such re-encoding is a non-trivial task, due to MIME rules such as the prohibition of nested encodings. It's not just a matter of pouring the body through a simple filter.

Reading agents SHOULD note MIME headers and attempt to show the reader the closest possible approximation to the intended content. They SHOULD not just send the octets of the article to the output device unaltered, unless there is reason to believe that the output device will indeed inter- pret them correctly. Reading agents MUST not pass ASCII control characters or escape sequences, other than as dis- cussed above, unaltered to the output device; only by chance would the result be the desired one, and there is serious potential for harmful side effects, either accidental or malicious.

NOTE: Exactly what to do with unwanted control characters/sequences depends on the philosophy of the reading agent, but passing them straight to the output device is almost always wrong. If the reading agent wants to mark the presence of such a character/sequence in circumstances where only ASCII printable characters are available, trans- lating it to "#" might be a suitable method; "#" is a conspicuous character seldom used in normal text.

NOTE: Reading agents should be aware that many old output devices (or the transmission paths to them) zero out the top bit of octets sent to them. This can transform non-ASCII characters into ASCII con- trol characters.

Followup agents MUST be careful to apply appropriate trans- formations of representation to the outbound followup as well as the inbound precursor. A followup to an article containing non-ASCII material is very likely to contain non- ASCII material itself.

4.5. Non-ASCII Characters In Headers

All octets found in headers MUST be ASCII characters. How- ever, it is desirable to have a way of encoding non-ASCII characters, especially in "human-readable" headers such as Subject. MIME [rrr] provides a way to do this. Full details may be found in the MIME specifications; herewith a quick summary to alert software authors to the issues...

     encoded-word  = "=?" charset "?" encoding "?" codes "?="
     charset       = 1*tag-char
     encoding      = 1*tag-char
     tag-char      = <ASCII printable character except !()<>@,;:\"[]/?=>
     codes         = 1*code-char
     code-char     = <ASCII printable character except ?>

An encoded word is a sequence of ASCII printable characters that specifies the character set, encoding method, and bits of (potentially) non-ASCII characters. Encoded words are allowed only in certain positions in certain headers. Spe- cific headers impose restrictions on the content of encoded words beyond that specified in this section. Posting agents MUST ensure that any material resembling an encoded word (complete with all delimiters), in a context where encoded words may appear, really is an encoded word.

NOTE: The syntax is a bit ugly, but it was designed to minimize chances of confusion with legitimate header contents, and to satisfy diffi- cult constraints on use within existing headers.

An encoded word MUST not be more than 75 octets long. Each line of a header containing encoded word(s) MUST be at most 76 octets long, not counting the EOL.

NOTE: These limits are meant to bound the looka- head needed to determine whether text that begins "=?" is really an encoded word.

The details of charsets and encodings are defined by MIME [rrr]; the sequence of preferred character sets is the same as MIME's. Encoded words SHOULD not be used for content expressible in ASCII.

When an encoded word is used, other than in a newsgroup name (see section 5.5), it MUST be separated from any adjacent non-space characters (including other encoded words) by white space. Reading agents displaying the contents of encoded words (as opposed to their encoded form) should ignore white space adjacent to encoded words.

UNRESOLVED ISSUE: Should this section be deleted entirely, or made much more terse? The material is relevant, but too complex to discuss fully.

NOTE: The deletion of intervening white space per- mits using multiple encoded words, implicitly con- catenated by the deletion, to encode text that will not fit within a single 75-character encoded word.

Reading-agent implementors are warned that although this Draft completely specifies where encoded words may appear in the headers it defines, there are other headers (e.g. the MIME Content-Description header) that MAY contain them.

4.6. Size Limits

Implementations SHOULD avoid fixed constraints on the sizes of lines within an article and on the size of the entire article.

Relayers SHOULD treat the body of an article as an uninter- preted sequence of octets (except as mandated by changes of EOL representation and processing of control messages), not to be altered or constrained in any way.

If it is absolutely necessary for an implementation to impose a limit on the length of header lines, body lines, or header logical lines, that limit shall be at least 1000 octets, including EOL representations. Relayers and trans- mission paths confronted with lines beyond their internal limits (if any) MUST not simply inject EOLs at random places; they MAY break headers (as described in 4.2.3) as a last resort, and otherwise they MUST either pass the long lines through unaltered, or refuse to pass the article at all (see section 9.1 for further discussion).

NOTE: The limit here is essentially the same mini- mum as that specified for SMTP mail in RFC 821 [rrr]. Implementors are warned that Path (see section 5.6) and References (see section 6.5) headers, in particular, often become several hun- dred characters long, so 1000 is not an overly generous limit.

All implementations MUST be able to handle an article totalling at least 65,000 octets, including headers and EOL representations, gracefully and efficiently. All implemen- tations SHOULD be able to handle an article totalling at least 1,000,000 (one million) octets, including headers and EOL representations, gracefully and efficiently. "Grace- fully and efficiently" is intended to preclude not only failures, but also major loss of performance, serious prob- lems in error recovery, or resource consumption beyond what is reasonably necessary.

NOTE: The intent here is to prohibit lowering the existing de-facto limit any further, while strongly encouraging movement towards a higher one. Actually, although improvements are desir- able in some cases, much news software copes rea- sonably well with very large articles. The same cannot be said of the communications software and protocols used to transmit news from one host to another, especially when slow communications links are involved. Occasional huge articles that appear now (by accident or through ignorance) typ- ically leave trails of failing software, system problems, and irate administrators in their wake.

NOTE: It is intended that the successor to this Draft will raise the "MUST" limit to 1,000,000 and the "SHOULD" limit still further.

Posters SHOULD limit posted articles to at most 60,000 octets, including headers and EOL representations, unless the articles are being posted only within a cooperating sub- net which is known to be capable of handling larger articles gracefully. Posting agents presented with a large article SHOULD warn the poster and request confirmation.

NOTE: The difference between this and the earlier "MUST" limit is margin for header growth, differ- ing EOL representations, and transmission over- heads.

NOTE: Disagreeable though these limits are, it is a fact that in current networks, an article larger than 64K (after header growth etc.) simply is not transmitted reliably. Note also the comments above on the trauma caused by single extremely- large articles now; the problems are real and cur- rent. These problems arguably should be fixed, but this will not happen network-wide in the imme- diate future. Hence the restriction of larger articles to cooperating subnets, for now.

Posters using non-ASCII characters in their text MUST take into account the overhead involved in MIME encoding, unless the article's propagation will be entirely limited to a cooperating subnet which does not use MIME encodings for non-ASCII characters. For example, MIME base64 encoding involves growth by a factor of approximately 4/3, so an article which would likely have to use this encoding should be at most about 45,000 octets before encoding.

Posters SHOULD use MIME "message/partial" conventions to facilitate automatic reassembly of a large document split into smaller pieces for posting. It is recommended that the content identifier used should be a message ID, generated by the same means as article message IDs (see section 5.3), and that all parts should have a See-Also header (see section 6.16) giving the message IDs of at least the previous parts and preferably all the parts.

NOTE: See-Also is more correct for this purpose than References, although References is in common use today (with less-formal reassembly arrange- ments). MIME reassemblers should probably examine articles suggested by References headers if See- Also headers are not present to indicate the whereabouts of the other parts of "mes- sage/partial" articles.

To repeat: implementations SHOULD avoid fixed constraints on the sizes of lines within an article and on the size of the entire article.

4.7. Example

Here is a sample article:

     From: jerry@eagle.ATT.COM (Jerry Schwarz)
     Path: cbosgd!mhuxj!mhuxt!eagle!jerry
     Newsgroups: news.announce
     Subject: Usenet Etiquette -- Please Read
     Message-ID: <642@eagle.ATT.COM>
     Date: Mon, 17 Jan 1994 11:14:55 -0500 (EST)
     Followup-To: news.misc
     Expires: Wed, 19 Jan 1994 00:00:00 -0500
     Organization: AT&T Bell Laboratories, Murray Hill


[Part 1 ] [Part 2] [Part 3] [Annexes]

Valid XHTML 1.0! Retour au sommaire Valid CSS!