This is a document written using ReMarkable, a shorthand syntax for generating HTML.

{	"date"		:	201208311148,
	"updated"	:	201209150943,
	"licence"	:	"cc-by",
	"tags"		:	["code-is-art", "web-dev"]
}

<section>

# Transliteration #

<aside>
	*Update:* Reverted o-umlaut conversion to "``ö->o``" based on <this discussion (http://forum.camendesign.com/language_specific_transliteration_re_transliteration)>.
</aside>

*Transliteration is when you replace individual characters in a piece of text with alternative characters*, commonly used on the web to replace accented characters ["``ẚèìòù``"] with their ASCII equivilents ["``aeiou``"] to use as filenames or URLs.

This is my take on the subject that uses different levels of fallback to gaurantee a result whatever your environment: It transliterates better than any example snippet you have seen out there as will be explained in the code and afterwards.

~~~ PHP ~~~>
//safeTransliterate v3, copyright (cc-by 3.0) Kroc Camen <camendesign.com>
//generate a safe (a-z0-9_) string, for use as filenames or URLs, from an arbitrary string
function safeTransliterate ($text) {
	//if available, this function uses PHP5.4's transliterate, which is capable of converting arabic, hebrew, greek,
	//chinese, japanese and more into ASCII! however, we use our manual (and crude) fallback *first* instead because
	//we will take the liberty of transliterating some things into more readable ASCII-friendly forms,
	//e.g. "100℃" > "100degc" instead of "100oc"
	
	/* manual transliteration list:
	   -------------------------------------------------------------------------------------------------------------- */
	/* this list is supposed to be practical, not comprehensive, representing:
	   1. the most common accents and special letters that get typed, and
	   2. the most practical transliterations for readability;
	   
	   given that I know nothing of other languages, I will need your assistance to improve this list,
	   mail kroc@camendesign.com with help and suggestions.
	   
	   this data was produced with the help of:
	   http://www.unicode.org/charts/normalization/
	   http://www.yuiblog.com/sandbox/yui/3.3.0pr3/api/text-data-accentfold.js.html
	   http://www.utf8-chartable.de/
	*/
	static $translit = array (
		'a'	=> '/[ÀÁÂẦẤẪẨÃĀĂẰẮẴȦẲǠẢÅÅǺǍȀȂẠẬẶḀĄẚàáâầấẫẩãāăằắẵẳȧǡảåǻǎȁȃạậặḁą]/u',
		'b'	=> '/[ḂḄḆḃḅḇ]/u',			'c'	=> '/[ÇĆĈĊČḈçćĉċčḉ]/u',
		'd'	=> '/[ÐĎḊḌḎḐḒďḋḍḏḑḓð]/u',
		'e'	=> '/[ÈËĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆèëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]/u',
		'f'	=> '/[Ḟḟ]/u',				'g'	=> '/[ĜĞĠĢǦǴḠĝğġģǧǵḡ]/u',
		'h'	=> '/[ĤȞḢḤḦḨḪĥȟḣḥḧḩḫẖ]/u',		'i'	=> '/[ÌÏĨĪĬĮİǏȈȊḬḮỈỊiìïĩīĭįǐȉȋḭḯỉị]/u',
		'j'	=> '/[Ĵĵǰ]/u',				'k'	=> '/[ĶǨḰḲḴKķǩḱḳḵ]/u',
		'l'	=> '/[ĹĻĽĿḶḸḺḼĺļľŀḷḹḻḽ]/u',		'm'	=> '/[ḾṀṂḿṁṃ]/u',
		'n'	=> '/[ÑŃŅŇǸṄṆṈṊñńņňǹṅṇṉṋ]/u',
		'o'	=> '/[ÒÖŌŎŐƠǑǪǬȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢØǾòöōŏőơǒǫǭȍȏȫȭȯȱṍṏṑṓọỏốồổỗộớờởỡợøǿ]/u',
		'p'	=> '/[ṔṖṕṗ]/u',				'r'	=> '/[ŔŖŘȐȒṘṚṜṞŕŗřȑȓṙṛṝṟ]/u',
		's'	=> '/[ŚŜŞŠȘṠṢṤṦṨſśŝşšșṡṣṥṧṩ]/u',	'ss'	=> '/[ß]/u',
		't'	=> '/[ŢŤȚṪṬṮṰţťțṫṭṯṱẗ]/u',		'th'	=> '/[Þþ]/u',
		'u'	=> '/[ÙŨŪŬŮŰŲƯǓȔȖṲṴṶṸṺỤỦỨỪỬỮỰùũūŭůűųưǔȕȗṳṵṷṹṻụủứừửữựµ]/u',
		'v'	=> '/[ṼṾṽṿ]/u',				'w'	=> '/[ŴẀẂẄẆẈŵẁẃẅẇẉẘ]/u',
		'x'	=> '/[ẊẌẋẍ×]/u',			'y'	=> '/[ÝŶŸȲẎỲỴỶỸýÿŷȳẏẙỳỵỷỹ]/u',
		'z'	=> '/[ŹŻŽẐẒẔźżžẑẓẕ]/u',				
		//combined letters and ligatures:
		'ae'	=> '/[ÄǞÆǼǢäǟæǽǣ]/u',			'oe'	=> '/[Œœ]/u',
		'dz'	=> '/[DŽDžDZDzdždz]/u',
		'ff'	=> '/[ff]/u',	'fi'	=> '/[ffifi]/u',	'ffl'	=> '/[fflfl]/u',
		'ij'	=> '/[IJij]/u',	'lj'	=> '/[LJLjlj]/u',	'nj'	=> '/[NJNjnj]/u',
		'st'	=> '/[ſtst]/u',	'ue'	=> '/[ÜǕǗǙǛüǖǘǚǜ]/u',
		//currencies:
		'eur'   => '/[€]/u',	'cents'	=> '/[¢]/u',	'lira'	=> '/[₤]/u',	'dollars' => '/[$]/u',
		'won'	=> '/[₩]/u',	'rs'	=> '/[₨]/u',	'yen'	=> '/[¥]/u',	'pounds'  => '/[£]/u',
		'pts'	=> '/[₧]/u',
		//misc:
		'degc'	=> '/[℃]/u',	'degf'  => '/[℉]/u',
		'no'	=> '/[№]/u',	'tm'	=> '/[™]/u'
	);
	//do the manual transliteration first
	$text = preg_replace (array_values ($translit), array_keys ($translit), $text);
	
	//flatten the text down to just a-z0-9 and dash, with underscores instead of spaces
	$text = preg_replace (
		//remove punctuation	//replace non a-z	//deduplicate	//trim underscores from start & end
		array ('/\p{P}/u',	'/[^_a-z0-9-]/i',	'/_{2,}/',	'/^_|_$/'),
		array ('',		'_',			'_',		''),
		
		//attempt transliteration with PHP5.4's transliteration engine (best):
		//(this method can handle near anything, including converting chinese and arabic letters to ASCII.
		// requires the 'intl' extension to be enabled)
		function_exists ('transliterator_transliterate') ? transliterator_transliterate (
			//split unicode accents and symbols, e.g. "Å" > "A°":
			'NFKD; '.
			//convert everything to the Latin charset e.g. "ま" > "ma":
			//(splitting the unicode before transliterating catches some complex cases,
			// such as: "㏳" >NFKD> "20日" >Latin> "20ri")
			'Latin; '.
			//because the Latin unicode table still contains a large number of non-pure-A-Z glyphs (e.g. "œ"),
			//convert what remains to an even stricter set of characters, the US-ASCII set:
			//(we must do this because "Latin/US-ASCII" alone is not able to transliterate non-Latin characters
			// such as "ま". this two-stage method also means we catch awkward characters such as:
			// "㏀" >Latin> "kΩ" >Latin/US-ASCII> "kO")
			'Latin/US-ASCII; '.
			//remove the now stand-alone diacritics from the string
			'[:Nonspacing Mark:] Remove; '.
			//change everything to lowercase; anything non A-Z 0-9 that remains will be removed by
			//the letter stripping above
			'Lower',
		$text)
		
		//attempt transliteration with iconv: <php.net/manual/en/function.iconv.php>
		: strtolower (function_exists ('iconv') ? str_replace (array ("'", '"', '`', '^', '~'), '', strtolower (
			//note: results of this are different depending on iconv version,
			//      sometimes the diacritics are written to the side e.g. "ñ" = "~n", which are removed
			iconv ('UTF-8', 'US-ASCII//IGNORE//TRANSLIT', $text)
		)) : $text)
	);
	
	//old iconv versions and certain inputs may cause a nullstring. don't allow a blank response
	return !$text ? '_' : $text;
}
<~~~

((<Edit this code (https://gist.github.com/3551351)> on GitHub &bull; You may do anything with this code as long as you leave credit in the code))
¬¬
There is no 'guaranteed available' method with which to transliterate effectively in PHP; the functions that can do so vary by PHP version and are not likely to always be installed and enabled on every PHP server out there. The only built-in function <`strtr` (http://uk.php.net/manual/en/function.strtr.php)> guaranteed to be present only replaces one character with another and doesn't handle the common need to expand one character to multiple, such as converting "``ß``" to "``ss``".

There are a number of libraries and functions out there for transliteration. Those that are comprehensive are massive and therefore total overkill for a small project and often have unhelpful licences, those that are small and compact are usually very incomplete and rely upon a single method that might not be available to you.

My method uses fallbacks so as to guarantee a result and improves upon other methods out there:

:: Better unicode normalisation
	All the transliterations examples I have seen out there that use PHP5.4's transliterator make the mistake of copy-pasting the example normalisation string given on the website. The results are simply wrong. My code uses a unicode normalisation method that I have worked out that has not been used anywhere else I have seen. It handles thousands of cases that no other library/function does because it uses two-stages of transliteration, first from any script to Latin-1 and then from Latin-1 to US-ASCII. This means that characters such as "``㏀``" become "``ko``" where else this would fail on other code I have seen because it would transliterate "``㏀``" to "``kΩ``" and then remove the "``Ω``" for being non-ASCII

:: Readability
	Since the transliteration will be used for filenames or URLs, readability is important, so it's better to convert some things to more meaningful words such as "``¥``" to "``yen``" instead of just "``y``". In the example above, it would be trivial for the function to convert "``㏀``" to "``k-ohm``" but I chose not to since the single character "``㏀``" is almost never used on the `Web

* * *

*I don't know anything about other languages and writing systems*, so if there's something amiss in my code, please let me know <via email (mailto:kroc@camendesign.com)>, <the forums (//forum.camendesign.com)> or by editing the <GitHub gist (https://gist.github.com/3551351)>

</section>