<!DOCTYPE html>
<!-- ========================================== kroc camen of camen design ============================================= -->
<title>code · Transliteration</title>
<link rel="stylesheet" type="text/css" href="/design/design.css" />
<meta name="viewport" content="width=device-width, maximum-scale=1.0, user-scalable=no" />
<link rel="alternate" type="application/rss+xml" href="/code/rss" title="Just code" />
<link rel="canonical" href="/code/transliteration" />
<!-- =================================================================================================================== -->
<header>
	<h1><a href="/" rel="index">
		Camen Design
	</a></h1>
	<nav><ul>
		<li><a href="/">all</a></li>
		<li><a href="/projects">projects</a></li>
		<li><a href="http://forum.camendesign.com">forum</a></li>
	</ul><ul>
		<li><a href="/quote/">quote</a></li>
		<li><a href="/writing/">writing</a></li>
		<li><a href="/blog/">blog</a></li>
		<li><a href="/photo/">photo</a></li>
		<li><a href="/code/" rel="tag">code</a></li>
		<li><a href="/art/">art</a></li>
		<li><a href="/link/">link</a></li>
		<li><a href="/poem/">poem</a></li>
		<li><a href="/audio/">audio</a></li>
	</ul><ul>
		<li><a href="/web-dev/">web-dev</a></li>
		<li><a href="/annoyances/">annoyances</a></li>
		<li><a href="/eve/">eve</a></li>
		<li><a href="/code-is-art/">code-is-art</a></li>
		<li><a href="/inspiration/">inspiration</a></li>
		<li><a href="/windows/">windows</a></li>
		<li><a href="/gift/">gift</a></li>
		<li><a href="/gaming/">gaming</a></li>
		<li><a href="/mac/">mac</a></li>
		<li><a href="/osnews/">osnews</a></li>
		<li><a href="/c64/">c64</a></li>
		<li><a href="/linux/">linux</a></li>
	</ul>
	<a rel="previous" href="/code/-x">
		older article →
	</a><a rel="next" href="/code/nononsense_forum">
		← newer article
	</a></nav>
</header>
<!-- =================================================================================================================== -->
<article><header>
	<!-- date published or updated -->
	<time pubdate datetime="2012-09-15T09:43:00+01:00">
		<sup>9:43<abbr>am</abbr> • 2012</sup>
		<abbr title="September">Sep</abbr> 15
	</time>
	<!-- categories -->
	<ul>
		<li><a href="/code/transliteration" rel="bookmark tag">code</a></li>
		<li><a href="/code-is-art/transliteration">code-is-art</a></li>
		<li><a href="/web-dev/transliteration">web-dev</a></li>
	</ul>
	<!-- licence -->
	<small>
		<a rel="license" href="http://creativecommons.org/licenses/by/3.0/deed.en_GB">c</a>
		share + remix
	</small>
</header>
<!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
<section>
<h1>Transliteration</h1>
<aside>
	<strong>Update:</strong> Reverted o-umlaut conversion to “<code>ö-&gt;o</code>” based on <a href="http://forum.camendesign.com/language_specific_transliteration_re_transliteration" rel="external">this discussion</a>.
</aside>
<p>
	<strong>Transliteration is when you replace individual characters in a piece of text with alternative
	characters</strong>, commonly used on the web to replace accented characters
	<ins>“<code>ẚèìòù</code>”</ins> with their ASCII equivilents <ins>“<code>aeiou</code>”</ins> to use
	as filenames or URLs.
</p><p>
	This is my take on the subject that uses different levels of fallback to gaurantee a result whatever your
	environment: It transliterates better than any example snippet you have seen out there as will be explained in the
	code and afterwards.
</p>

<pre><code>//safeTransliterate v3, copyright (cc-by 3.0) Kroc Camen &lt;camendesign.com&gt;
//generate a safe (a-z0-9_) string, for use as filenames or URLs, from an arbitrary string
function safeTransliterate ($text) {
	//if available, this function uses PHP5.4's transliterate, which is capable of converting arabic, hebrew, greek,
	//chinese, japanese and more into ASCII! however, we use our manual (and crude) fallback *first* instead because
	//we will take the liberty of transliterating some things into more readable ASCII-friendly forms,
	//e.g. "100℃" &gt; "100degc" instead of "100oc"
	
	/* manual transliteration list:
	   -------------------------------------------------------------------------------------------------------------- */
	/* this list is supposed to be practical, not comprehensive, representing:
	   1. the most common accents and special letters that get typed, and
	   2. the most practical transliterations for readability;
	   
	   given that I know nothing of other languages, I will need your assistance to improve this list,
	   mail kroc@camendesign.com with help and suggestions.
	   
	   this data was produced with the help of:
	   http://www.unicode.org/charts/normalization/
	   http://www.yuiblog.com/sandbox/yui/3.3.0pr3/api/text-data-accentfold.js.html
	   http://www.utf8-chartable.de/
	*/
	static $translit = array (
		'a'	=&gt; '/[ÀÁÂẦẤẪẨÃĀĂẰẮẴȦẲǠẢÅÅǺǍȀȂẠẬẶḀĄẚàáâầấẫẩãāăằắẵẳȧǡảåǻǎȁȃạậặḁą]/u',
		'b'	=&gt; '/[ḂḄḆḃḅḇ]/u',			'c'	=&gt; '/[ÇĆĈĊČḈçćĉċčḉ]/u',
		'd'	=&gt; '/[ÐĎḊḌḎḐḒďḋḍḏḑḓð]/u',
		'e'	=&gt; '/[ÈËĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆèëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]/u',
		'f'	=&gt; '/[Ḟḟ]/u',				'g'	=&gt; '/[ĜĞĠĢǦǴḠĝğġģǧǵḡ]/u',
		'h'	=&gt; '/[ĤȞḢḤḦḨḪĥȟḣḥḧḩḫẖ]/u',		'i'	=&gt; '/[ÌÏĨĪĬĮİǏȈȊḬḮỈỊiìïĩīĭįǐȉȋḭḯỉị]/u',
		'j'	=&gt; '/[Ĵĵǰ]/u',				'k'	=&gt; '/[ĶǨḰḲḴKķǩḱḳḵ]/u',
		'l'	=&gt; '/[ĹĻĽĿḶḸḺḼĺļľŀḷḹḻḽ]/u',		'm'	=&gt; '/[ḾṀṂḿṁṃ]/u',
		'n'	=&gt; '/[ÑŃŅŇǸṄṆṈṊñńņňǹṅṇṉṋ]/u',
		'o'	=&gt; '/[ÒÖŌŎŐƠǑǪǬȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢØǾòöōŏőơǒǫǭȍȏȫȭȯȱṍṏṑṓọỏốồổỗộớờởỡợøǿ]/u',
		'p'	=&gt; '/[ṔṖṕṗ]/u',				'r'	=&gt; '/[ŔŖŘȐȒṘṚṜṞŕŗřȑȓṙṛṝṟ]/u',
		's'	=&gt; '/[ŚŜŞŠȘṠṢṤṦṨſśŝşšșṡṣṥṧṩ]/u',	'ss'	=&gt; '/[ß]/u',
		't'	=&gt; '/[ŢŤȚṪṬṮṰţťțṫṭṯṱẗ]/u',		'th'	=&gt; '/[Þþ]/u',
		'u'	=&gt; '/[ÙŨŪŬŮŰŲƯǓȔȖṲṴṶṸṺỤỦỨỪỬỮỰùũūŭůűųưǔȕȗṳṵṷṹṻụủứừửữựµ]/u',
		'v'	=&gt; '/[ṼṾṽṿ]/u',				'w'	=&gt; '/[ŴẀẂẄẆẈŵẁẃẅẇẉẘ]/u',
		'x'	=&gt; '/[ẊẌẋẍ×]/u',			'y'	=&gt; '/[ÝŶŸȲẎỲỴỶỸýÿŷȳẏẙỳỵỷỹ]/u',
		'z'	=&gt; '/[ŹŻŽẐẒẔźżžẑẓẕ]/u',				
		//combined letters and ligatures:
		'ae'	=&gt; '/[ÄǞÆǼǢäǟæǽǣ]/u',			'oe'	=&gt; '/[Œœ]/u',
		'dz'	=&gt; '/[DŽDžDZDzdždz]/u',
		'ff'	=&gt; '/[ff]/u',	'fi'	=&gt; '/[ffifi]/u',	'ffl'	=&gt; '/[fflfl]/u',
		'ij'	=&gt; '/[IJij]/u',	'lj'	=&gt; '/[LJLjlj]/u',	'nj'	=&gt; '/[NJNjnj]/u',
		'st'	=&gt; '/[ſtst]/u',	'ue'	=&gt; '/[ÜǕǗǙǛüǖǘǚǜ]/u',
		//currencies:
		'eur'   =&gt; '/[€]/u',	'cents'	=&gt; '/[¢]/u',	'lira'	=&gt; '/[₤]/u',	'dollars' =&gt; '/[$]/u',
		'won'	=&gt; '/[₩]/u',	'rs'	=&gt; '/[₨]/u',	'yen'	=&gt; '/[¥]/u',	'pounds'  =&gt; '/[£]/u',
		'pts'	=&gt; '/[₧]/u',
		//misc:
		'degc'	=&gt; '/[℃]/u',	'degf'  =&gt; '/[℉]/u',
		'no'	=&gt; '/[№]/u',	'tm'	=&gt; '/[™]/u'
	);
	//do the manual transliteration first
	$text = preg_replace (array_values ($translit), array_keys ($translit), $text);
	
	//flatten the text down to just a-z0-9 and dash, with underscores instead of spaces
	$text = preg_replace (
		//remove punctuation	//replace non a-z	//deduplicate	//trim underscores from start &amp; end
		array ('/\p{P}/u',	'/[^_a-z0-9-]/i',	'/_{2,}/',	'/^_|_$/'),
		array ('',		'_',			'_',		''),
		
		//attempt transliteration with PHP5.4's transliteration engine (best):
		//(this method can handle near anything, including converting chinese and arabic letters to ASCII.
		// requires the 'intl' extension to be enabled)
		function_exists ('transliterator_transliterate') ? transliterator_transliterate (
			//split unicode accents and symbols, e.g. "Å" &gt; "A°":
			'NFKD; '.
			//convert everything to the Latin charset e.g. "ま" &gt; "ma":
			//(splitting the unicode before transliterating catches some complex cases,
			// such as: "㏳" &gt;NFKD&gt; "20日" &gt;Latin&gt; "20ri")
			'Latin; '.
			//because the Latin unicode table still contains a large number of non-pure-A-Z glyphs (e.g. "œ"),
			//convert what remains to an even stricter set of characters, the US-ASCII set:
			//(we must do this because "Latin/US-ASCII" alone is not able to transliterate non-Latin characters
			// such as "ま". this two-stage method also means we catch awkward characters such as:
			// "㏀" &gt;Latin&gt; "kΩ" &gt;Latin/US-ASCII&gt; "kO")
			'Latin/US-ASCII; '.
			//remove the now stand-alone diacritics from the string
			'[:Nonspacing Mark:] Remove; '.
			//change everything to lowercase; anything non A-Z 0-9 that remains will be removed by
			//the letter stripping above
			'Lower',
		$text)
		
		//attempt transliteration with iconv: &lt;php.net/manual/en/function.iconv.php&gt;
		: strtolower (function_exists ('iconv') ? str_replace (array ("'", '"', '`', '^', '~'), '', strtolower (
			//note: results of this are different depending on iconv version,
			//      sometimes the diacritics are written to the side e.g. "ñ" = "~n", which are removed
			iconv ('UTF-8', 'US-ASCII//IGNORE//TRANSLIT', $text)
		)) : $text)
	);
	
	//old iconv versions and certain inputs may cause a nullstring. don't allow a blank response
	return !$text ? '_' : $text;
}</code></pre>

<p>
	<small><a href="https://gist.github.com/3551351" rel="external">Edit this code</a> on GitHub &bull; You may do
	anything with this code as long as you leave credit in the code</small>
	<br /><br />
	There is no ‘guaranteed available’ method with which to transliterate effectively in PHP; the functions that can
	do so vary by PHP version and are not likely to always be installed and enabled on every PHP server out there. The
	only built-in function
	<a href="http://uk.php.net/manual/en/function.strtr.php" rel="external"><samp>strtr</samp></a> guaranteed to be
	present only replaces one character with another and doesn’t handle the common need to expand one character to
	multiple, such as converting “<code>ß</code>” to “<code>ss</code>”.
</p><p>
	There are a number of libraries and functions out there for transliteration. Those that are comprehensive are
	massive and therefore total overkill for a small project and often have unhelpful licences, those that are small and
	compact are usually very incomplete and rely upon a single method that might not be available to you.
</p><p>
	My method uses fallbacks so as to guarantee a result and improves upon other methods out there:
</p>
<dl>
	<dt>Better unicode normalisation</dt>
	<dd>
		All the transliterations examples I have seen out there that use PHP5.4′s transliterator make the mistake
		of copy-pasting the example normalisation string given on the website. The results are simply wrong. My
		code uses a unicode normalisation method that I have worked out that has not been used anywhere else I have
		seen. It handles thousands of cases that no other library/function does because it uses two-stages of
		transliteration, first from any script to Latin-1 and then from Latin-1 to US-ASCII. This means that
		characters such as “<code>㏀</code>” become “<code>ko</code>” where else this would fail on
		other code I have seen because it would transliterate “<code>㏀</code>” to “<code>kΩ</code>”
		and then remove the “<code>Ω</code>” for being non-ASCII
	</dd>
	<dt>Readability</dt>
	<dd>
		Since the transliteration will be used for filenames or URLs, readability is important, so it’s better to
		convert some things to more meaningful words such as “<code>¥</code>” to “<code>yen</code>”
		instead of just “<code>y</code>”. In the example above, it would be trivial for the function to
		convert “<code>㏀</code>” to “<code>k-ohm</code>” but I chose not to since the single character
		“<code>㏀</code>” is almost never used on the `Web
	</dd>
</dl>

<hr />

<p>
	<strong>I don’t know anything about other languages and writing systems</strong>, so if there’s something amiss
	in my code, please let me know <a href="mailto:kroc@camendesign.com">via email</a>,
	<a href="http://forum.camendesign.com" rel="external">the forums</a> or by editing the
	<a href="https://gist.github.com/3551351" rel="external">GitHub gist</a>
</p>
</section>
<!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
</article>
<footer>
	<nav><a href="http://forum.camendesign.com">‹ Discuss this in the Forum ›</a></nav>
		
	<a href="mailto:kroc@camendesign.com">kroc@camendesign.com</a>
	<nav>view-source:
		<a href="/code/transliteration.rem">Rem</a> •
		<a href="/code/transliteration.html">HTML</a> •
		<a href="/design/">CSS</a> •
		<a href="/.system/">PHP</a> •
		<a href="/.htaccess">.htaccess</a>
	</nav>
	<form method="get" action="https://duckduckgo.com">
		<input type="hidden" name="sites" value="camendesign.com" />
		<input type="search" name="q" placeholder="search…" />
		<input type="submit" value="Go" />
	</form>
</footer>
<!-- =================================================================================================== code is art === -->