WikiSyntaxTextModSyntax polishing → Step 1

Single characters

The first step on the syntax polishing lane standardizes on single character level. This is prerequisite for later analysis by this script, by bots and other gadgets, in order to detect strings in expected manner.

Current implementation


1–31 ASCII control codes

10 – newline
The only appreciated control code is newline.
All other control codes should be removed.
9 – horizontal tab
The horizontal tabulator is replaced by exactly one space character.
This has the same effect when rendering wikitext.
Tabs are often introduced by copy&paste, when existing rendered texts (mainly enumerations and tables) are inserted into Wikipedia. Strange effects occur when editing in TEXTAREA.
The only eception is within syntaxhighlight where the tabs remain. Therefore the occurrence of a tab is only prebooked. If any tab was seen, cleanup is postponed until finished tag analysis and syntaxhighlight areas have been protected.
11 – ver tab, 12 – form feed
It seems that the database rejects these codes if attempting to save. It is not possible to enter them by normal keystrokes, but they might result from copy&paste.
These characters indicate a separation between two sections. If really found in edited text, they are replaced by newline.
Any other code ≤ 31
As before; neither insertable, nor stored in database.
If occurring nevertheless they will be removed.

Unicode control codes

Writing left-to-right
The invisible l-t-r and r-t-l (200E16 and 200F16) might be very meaningful for interlanguage and pages with content in other languages. Basically they are kept.
All are consistently made visible as ‎ and ‏ entities.
If these codes are found
they will be removed programmatically.
Templates for support of foreign languages are supposed to handle the bi-directional issues internally. In the plain article text the entities may be removed without any consequence then.
Zero width characters
These are currently 200B16 (ZERO WIDTH SPACE), 200C16 (ZERO WIDTH NON-JOINER) and 200D16 (ZERO WIDTH JOINER) and they are made visible as ​, ‌ and ‍ homogeneously.
In non latin scripts they have a certain meaning. In latin text they are pointless; usually resulting from copy&paste insertions, perhaps database queries, and may be deleted. They are regularly occurring in interlanguage links and are mandatory there; within links they remain unchanged. For the sake of undisturbed diffpage all entities are turned into character codes within link targets later.
Line Separator
In Unicode there is a “line separator” 202816, which is not visible in normal browser displays. This particular line break is not meaningful for a wiki and will be replaced by a regular newline character. Within link targets the code is ignored by the parser anyway and will be removed. The character is generated by copy&paste insertions, when a mostly “soft” line break has been inserted into the displayed text by the rendering system.

Unicode typography

Simple spaces (ASCII) are replacing:
  • 200216 – N-SPACE
  • 200316 – M-SPACE
Non-breaking hyphen
This character (201116 = 820910 – NON-BREAKING HYPHEN) might be helpful for future rendering of web sites with improved automatic syllabification sinnvoll sein; but searching for strings is obstructed.
Directly before or after a (regular) space or a line break (\n as well as <br />) it is always pointless. There it is always turned into by ASCII hyphen.
In some cases where no syllabification is possible the non-breaking hyphen is removed, if never coming into effect.
A user defined change into ASCII hyphen or removal is left to the user.

Exotic characters


Invisible characters

Invisible characters, transformed into entities
Hexcode decimal named meaning remark
00A0 160 nbsp non-breaking space &#160; if found as code in text; see below.
2009 8201 thinsp thin space
200A 8202 HAIR SPACE
200C 8204 zwnj ZERO WIDTH NON-JOINER see above
200E 8206 lrm LEFT-TO-RIGHT MARK see above
200F 8207 rlm RIGHT-TO-LEFT MARK
2060 8288 WORD JOINER

Zero width characters


Some non-latin scripts use invisible “zero width” characters. Within link targets (mostly interlanguage) or recognized text sequences (especially the entire page) they areencoded invisible; else visualized as entity.

Entity Lanaguages
&zwj; kn ml sa si
&zwnj; bn fa kn ml mr mzn te
&#x200A; bo km

Spaces at line end

  1. If a line consists of spaces only, all these spaces are removed always.
    This avoids that searching expressions for \n\n fail.
  2. If for any other reason it happens to change the text, on termination of automatic processing all spaces invisible at end of line will be removed.
    • If this would be the only change, that is dispensed if control diffpage is active as common, otherwise the user would be confused by solely invisible modifications.
    • If the line end is terminated by one single equal sign, exactly one from existing space characters is kept. Some people prefer this for not yet assigned template parameters.

Character entities


Kept characters


It may be discussed whether some character codes are meaningful for a wiki project. The script does not replace them by default, but they might be changed or removed by user defined rules.

U+00A0 – non-breaking space
Since a few years this code (not only as entity &nbsp;) is possible in wikitext. In the first years the database changed it into a normal space.
Currently they are made visible as &#160; and might be judged for intended usage.
U+00AD – soft hyphen
Sometimes AD16 = 17310 is present in wikitext, mainly if entire text has been imported from other text systems. In future it might improve the layout of websites with automatic syllabification.
Searching for strings is obstructed. If no wikilink adheres, the character can be removed without any consequences by user defined rules or replaced by &shy; entity.
U+2010 – hyphen
The Unicode hyphen 820810 is generated by text systems when automatic syllabification breaks a line. It is unlikely that they result from manual input, probably they result from copy&paste of an entire word which has been separated. In wikitext it might be assumed that the parts are to be joined again into the single word. Unicode hyphen and an adjacent space are to remove or replace by soft hyphen, but this requires human control.

127–159 (Windows code pages)



  1. ^ In common TEXTAREA without WikEd the browser could miscount the character position. If such a character »|« is present in source text, e.g. between A and B: A|BCEF, and the cursor between C and E, than it might happen that the intended ‘d’ is inserted at wrong position: A|BdCEF or A|BCEdF instead of ABCdEF.

[ German page ]