Camen Design

c share + remix


Transliteration is when you replace individual characters in a piece of text with alternative characters, commonly used on the web to replace accented characters ẚèìòù with their ASCII equivilents aeiou to use as filenames or URLs.

This is my take on the subject that uses different levels of fallback to gaurantee a result whatever your environment: It transliterates better than any example snippet you have seen out there as will be explained in the code and afterwards.

~ PHP ~> //safeTransliterate v3, copyright (cc-by 3.0) Kroc Camen //generate a safe (a-z0-9_) string, for use as filenames or URLs, from an arbitrary string function safeTransliterate ($text) //if available, this function uses PHP5.4′s transliterate, which is capable of converting arabic, hebrew, greek, //chinese, japanese and more into ASCII! however, we use our manual (and crude) fallback first instead because //we will take the liberty of transliterating some things into more readable ASCII-friendly forms, //e.g. “100℃” > “100degc” instead of “100oc”

/* manual transliteration list: -------------------------------------------------------------------------------------------------------------- */ /* this list is supposed to be practical, not comprehensive, representing: 1. the most common accents and special letters that get typed, and 2. the most practical transliterations for readability;

given that I know nothing of other languages, I will need your assistance to improve this list, mail with help and suggestions.

this data was produced with the help of: */ static $translit = array ( ‘a’ => ‘/ÀÁÂẦẤẪẨÃĀĂẰẮẴȦẲǠẢÅÅǺǍȀȂẠẬẶḀĄẚàáâầấẫẩãāăằắẵẳȧǡảåǻǎȁȃạậặḁą/u’, ‘b’ => ‘/ḂḄḆḃḅḇ/u’, ‘c’ => ‘/ÇĆĈĊČḈçćĉċčḉ/u’, ‘d’ => ‘/ÐĎḊḌḎḐḒďḋḍḏḑḓð/u’, ‘e’ => ‘/ÈËĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆèëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ/u’, ‘f’ => ‘/Ḟḟ/u’, ‘g’ => ‘/ĜĞĠĢǦǴḠĝğġģǧǵḡ/u’, ‘h’ => ‘/ĤȞḢḤḦḨḪĥȟḣḥḧḩḫẖ/u’, ‘i’ => ‘/ÌÏĨĪĬĮİǏȈȊḬḮỈỊiìïĩīĭįǐȉȋḭḯỉị/u’, ‘j’ => ‘/Ĵĵǰ/u’, ‘k’ => ‘/ĶǨḰḲḴKķǩḱḳḵ/u’, ‘l’ => ‘/ĹĻĽĿḶḸḺḼĺļľŀḷḹḻḽ/u’, ‘m’ => ‘/ḾṀṂḿṁṃ/u’, ‘n’ => ‘/ÑŃŅŇǸṄṆṈṊñńņňǹṅṇṉṋ/u’, ‘o’ => ‘/ÒÖŌŎŐƠǑǪǬȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢØǾòöōŏőơǒǫǭȍȏȫȭȯȱṍṏṑṓọỏốồổỗộớờởỡợøǿ/u’, ‘p’ => ‘/ṔṖṕṗ/u’, ‘r’ => ‘/ŔŖŘȐȒṘṚṜṞŕŗřȑȓṙṛṝṟ/u’, ‘s’ => ‘/ŚŜŞŠȘṠṢṤṦṨſśŝşšșṡṣṥṧṩ/u’, ‘ss’ => ‘/ß/u’, ‘t’ => ‘/ŢŤȚṪṬṮṰţťțṫṭṯṱẗ/u’, ‘th’ => ‘/Þþ/u’, ‘u’ => ‘/ÙŨŪŬŮŰŲƯǓȔȖṲṴṶṸṺỤỦỨỪỬỮỰùũūŭůűųưǔȕȗṳṵṷṹṻụủứừửữựµ/u’, ‘v’ => ‘/ṼṾṽṿ/u’, ‘w’ => ‘/ŴẀẂẄẆẈŵẁẃẅẇẉẘ/u’, ‘x’ => ‘/ẊẌẋẍ×/u’, ‘y’ => ‘/ÝŶŸȲẎỲỴỶỸýÿŷȳẏẙỳỵỷỹ/u’, ‘z’ => ‘/ŹŻŽẐẒẔźżžẑẓẕ/u’, //combined letters and ligatures: ‘ae’ => ‘/ÄǞÆǼǢäǟæǽǣ/u’, ‘oe’ => ‘/Œœ/u’, ‘dz’ => ‘/DŽDžDZDzdždz/u’, ‘ff’ => ‘//u’, ‘fi’ => ‘/ffifi/u’, ‘ffl’ => ‘/fflfl/u’, ‘ij’ => ‘/IJij/u’, ‘lj’ => ‘/LJLjlj/u’, ‘nj’ => ‘/NJNjnj/u’, ‘st’ => ‘/ſtst/u’, ‘ue’ => ‘/ÜǕǗǙǛüǖǘǚǜ/u’, //currencies: ‘eur’ => ‘//u’, ‘cents’ => ‘/¢/u’, ‘lira’ => ‘//u’, ‘dollars’ => ‘/$/u’, ‘won’ => ‘//u’, ‘rs’ => ‘//u’, ‘yen’ => ‘/¥/u’, ‘pounds’ => ‘/£/u’, ‘pts’ => ‘//u’, //misc: ‘degc’ => ‘//u’, ‘degf’ => ‘//u’, ‘no’ => ‘//u’, ‘tm’ => ‘//u’ ); //do the manual transliteration first $text = preg_replace (array_values ($translit), array_keys ($translit), $text);

//flatten the text down to just a-z0-9 and dash, with underscores instead of spaces $text = preg_replace ( //remove punctuation //replace non a-z //deduplicate //trim underscores from start & end array (‘/\p{P/u’, ‘/a-z0-9-/i’, ‘/2,/’, ‘/|$/’), array (‘’, ‘’, ‘’, ‘’),

//attempt transliteration with PHP5.4′s transliteration engine (best): //(this method can handle near anything, including converting chinese and arabic letters to ASCII. // requires the ‘intl’ extension to be enabled) function_exists (‘transliterator_transliterate’) ? transliterator_transliterate ( //split unicode accents and symbols, e.g. “Å” > “A°”: ‘NFKD; ’. //convert everything to the Latin charset e.g. “ま” > “ma”: //(splitting the unicode before transliterating catches some complex cases, // such as: “㏳” >NFKD> “20日” >Latin> “20ri”) ‘Latin; ’. //because the Latin unicode table still contains a large number of non-pure-A-Z glyphs (e.g. “œ”), //convert what remains to an even stricter set of characters, the US-ASCII set: //(we must do this because “Latin/US-ASCII” alone is not able to transliterate non-Latin characters // such as “ま”. this two-stage method also means we catch awkward characters such as: // “㏀” >Latin> “kΩ” >Latin/US-ASCII> “kO”) ‘Latin/US-ASCII; ’. //remove the now stand-alone diacritics from the string ‘:Nonspacing Mark: Remove; ’. //change everything to lowercase; anything non A-Z 0-9 that remains will be removed by //the letter stripping above ‘Lower’, $text)

//attempt transliteration with iconv: : strtolower (function_exists (‘iconv’) ? str_replace (array (“‘”, ’"‘, ’', '^', '~'), '', strtolower ( //note: results of this are different depending on iconv version, // sometimes the diacritics are written to the side e.g. "ñ" = "~n", which are removed iconv ('UTF-8', 'US-ASCII//IGNORE//TRANSLIT', $text) )) : $text) ); //old iconv versions and certain inputs may cause a nullstring. don't allow a blank response return !$text ? '_' : $text; } <~~~ ((<Edit this code (> on GitHub &bull; You may do anything with this code as long as you leave credit in the code)) ¬¬ There is no 'guaranteed available' method with which to transliterate effectively in PHP; the functions that can do so vary by PHP version and are not likely to always be installed and enabled on every PHP server out there. The only built-in function <strtr (> guaranteed to be present only replaces one character with another and doesn't handle the common need to expand one character to multiple, such as converting "ß" to "ss". There are a number of libraries and functions out there for transliteration. Those that are comprehensive are massive and therefore total overkill for a small project and often have unhelpful licences, those that are small and compact are usually very incomplete and rely upon a single method that might not be available to you. My method uses fallbacks so as to guarantee a result and improves upon other methods out there: :: Better unicode normalisation All the transliterations examples I have seen out there that use PHP5.4's transliterator make the mistake of copy-pasting the example normalisation string given on the website. The results are simply wrong. My code uses a unicode normalisation method that I have worked out that has not been used anywhere else I have seen. It handles thousands of cases that no other library/function does because it uses two-stages of transliteration, first from any script to Latin-1 and then from Latin-1 to US-ASCII. This means that characters such as "" become "ko" where else this would fail on other code I have seen because it would transliterate "" to "" and then remove the "Ω" for being non-ASCII :: Readability Since the transliteration will be used for filenames or URLs, readability is important, so it's better to convert some things to more meaningful words such as "¥" to "yen" instead of just "y". In the example above, it would be trivial for the function to convert "" to "k-ohm" but I chose not to since the single character "" is almost never used on the Web

I don’t know anything about other languages and writing systems, so if there’s something amiss in my code, please let me know via email, the forums or by editing the GitHub gist