camen design

Improved Title Case Function for PHP

Update 2: The script is failing the test suite, I’ll have to update again once I’ve solved this, but it looks tricky - PHP and Unicode issues. Update: Fixed a couple of major problems:
1. HTML entities were not being removed before processing
2. Made a typo when porting from Javascript that caused some failures

John Gruber originally made available his script to Title Case text, working around the fringe-cases.

From this, a number of ports were made of the script of which particularly noteworthy David Gouch’s Javascript port that was smaller, simpler and handled more fringe cases.

I’ve ported this to PHP and put it to use on this site. My version is based on David Gouch’s Javascript port, unlike the WordPress port which is, frankly, crap.

Code below.

//original Title Case script © John Gruber <daringfireball.net>
//javascript port © David Gouch <individed.com>
//PHP port of the above by Kroc Camen <camendesign.com>

//this is required for PHP to not break unicode characters in your titles when using `strtolower`/`strtoupper`
//you can place this near the top of your script, or within the function itself
mb_internal_encoding ("UTF-8");

function titleCase ($title) {
	//remove HTML, storing it for later
	//       HTML elements to ignore    | tags  | entities
	$regx = '/<(code|var)[^>]*>.*?<\/\1>|<[^>]+>|&\S+;/';
	preg_match_all ($regx, $title, $html, PREG_OFFSET_CAPTURE);
	$title = preg_replace ($regx, '', $title);
	
	//break by punctuation, find the start of words
	preg_match_all ('/[\w&`\'‘’"“\.@:\/\{\(\[<>_]+-? */', $title, $matches, PREG_OFFSET_CAPTURE);
	foreach ($matches[0] as &$m) $title = substr_replace ($title, $m[0]=
		//find words that should be lowercase
		$m[1]>0 && mb_substr ($title, $m[1]-2, 1) !== ':' && preg_match (
			'/^(a(nd?|s|t)?|b(ut|y)|en|for|i[fn]|o[fnr]|t(he|o)|vs?\.?|via)[ \-]/i', $m[0]
		//convert them to lowercase
		) ?	mb_strtolower ($m[0])
		
		//else:	brackets and other wrappers
		: (	preg_match ('/[\'"_{(\[]/', mb_substr ($title, $m[1]-1, 3))
		//convert first letter within wrapper to uppercase
		? 	mb_substr ($m[0], 0, 1).mb_strtoupper (mb_substr ($m[0], 1, 1)).mb_substr ($m[0], 2)
		
		//else:	do not uppercase these cases
		: (	preg_match ('/[A-Z]+|&|[\w]+[._][\w]+/', mb_substr ($m[0], 1)) ||
			preg_match ('/[\])}]/', mb_substr ($title, $m[1]-1, 3))
		?	$m[0]
		:	mb_strtoupper (mb_substr ($m[0], 0, 1)).mb_substr ($m[0], 1)
		)),
		$m[1], strlen ($m[0])
	);
	
	//restore the HTML
	foreach ($html[0] as &$tag) $title = substr_replace ($title, $tag[0], $tag[1], 0);
	return $title;
}

Anything broken, please let me know.
Kind regards,

Gosh, the stuff you can do in FF3.1 is sick. Rotated, skewed, blurred and colour-adjusted <video> all with no slow down, using standards—

So sick, I don’t know how to make use of all this new technology without going back to 1996 and blinging everything to the max.

I should not have to type “http://” into any web form, ever. This is a fundamental usability flaw errant across the web.

How to Use <abbr> in HTML5, and in General

Before I begin, I should profess that I am completely accountable for having never followed any of these rules in the past. However, the whole reason for writing this article was to solve that problem. Since moving to my new website back-end, I decided to go through the entire site’s contents with a fine brush and polish all of the code.

In doing that, I discovered how vague I was on the semantics of the abbr element, and working through all the test-cases that have sprung up in the wealth of HTML I’ve written for this site, I’ve documented here my new understanding of the often-abused abbr element.

Ⅰ. Abbreviations Are Not Dictionary Definitions

Let’s first define abbreviation clearly:

An abbreviation is where you have shortened one or more words into:
either one word, or an alternative phrase or acronym

The problem with the use of <abbr> so far, has been that developers have assumed that every abbreviation and acronym has had to be defined in full. This is incorrect.

BAD:	I made some <abbr title="American Standard Code for Information Interchange">ASCII</abbr> art.

An abbr element expands its contents into the desired spoken form. When you read a document, you naturally expand the abbreviations as appropriate in your mind.

GOOD:	Red <abbr title="versus">vs.</abbr> Blue

You would not read out aloud the abbreviated “et cetera” in “Granny went to the market and bought apples, bread & milk etc.” as “eee-tee-see”? So as it should be with HTML abbreviations. Here are some examples:

BAD:	My <abbr title="Cascading Style Sheets">CSS</abbr> is tweaked almost daily.
GOOD:	My <abbr title="style sheet">CSS</abbr> is tweaked almost daily.

Here we’ve used the abbr element to span over an abbreviation and provide an alternative, natural way of reading the abbreviation.

GOOD:	price <abbr title="does not equal">!=</abbr> <abbr title="total cost of ownership">TCO</abbr>

We have adapted something unpronounceable as letters into something perfectly readable.

In general, abbreviations should maintain the grammar. Whilst not necessary, this example demonstrates how grammatical flow can be improved, whilst also expanding a Latin abbreviation:

Along the way, open-source has forgotten what it really means (<abbr title="that is,">i.e.</abbr> in real life) to give.

Try and communicate your intentions. If you would personally read something one-way, define the abbreviation how you intend it to be read:

Switch to using the <abbr title="“wizzy-wig”">WYSIWYG</abbr> editor, instead.

In the example below however, there’s an abbreviation CDs inside the abbreviation title:

<abbr title="recordable CDs">CD-Rs</abbr> and <abbr title="recordable DVDs">DVD±Rs</abbr> are susceptible to literal bit-rot.

Isn’t this wrong? No, because remember that the point of abbreviations are to expand one phrase into another. The user is assumed to already know what a CD is, it doesn’t have to be spelt out for them.

This follows neatly into the next point: when and where to expand abbreviations at all…

Ⅱ. the title Attribute Is Optional

Oh man, this is so important. The misuse of the abbr element is because almost everybody is under the assumption that abbr elements must have a title attribute, in fact— it’d seem pointless otherwise!

Your users do not need to know the definition of every single acronym and abbreviated technical term. In fact, they don’t care. They don’t have to know what the V in DVD stands for if they know a DVD when they see one.

Only title abbreviations that you expect people to read as the expanded form in their mind, or out aloud.

An abbr element without a title attribute should be used on any abbreviation / acronym that is written in all-capitals (unless you are providing a spoken alternative, like the WYSIWYG example from earlier), to communicate that the abbreviation is either unpronounceable as a word, or that it is capitalised—not for emphasis—but because each letter has an individual meaning. E.g.

The <abbr>FBI</abbr> are like the British <abbr>MI5</abbr>.

Ⅲ. Citations Are Not Abbreviations

This one is very sneaky and can easily catch you out.

BAD:	The site will be built using <abbr title="Hypertext Pre-Processor">PHP</abbr>.

Firstly, this reads wrong; the abbreviation breaks the grammar. Secondly, remember that abbreviations are to communicate how things should be read, not to define terms.

But thirdly, it is not an abbreviation. It is not a section of the document that has been shortened or re-phrased by the author to fit their chosen grammar. It is not a personal rendering of words. The sentence is referring to a software product. This is a citation.

GOOD:	The site will be built using <cite>PHP</cite>.

Even though a cited name can be an abbreviation of something else, the name seals that abbreviation and turns that name into a real word of sorts (a brand). Names that are already made from an abbreviation, can then even be abbreviated! (since they behave as normal words) For example “Mac OS X” is already an abbreviation of “Macintosh Operating System version Ten”, and people then often abbreviate further, calling it “OS X” or by referring to the version number / name “10.5 / Leopard”.

What Counts as a Citation?

A citation represents the title of a work, where you are referring to it in the context of your sentence, or in passing. A work is defined as an intellectual human creation.

A work can be a book, a poem, a published piece of writing, a piece of art, a website, a song, a film, a TV show, a game &c. and also software.

However this does not include the following: people’s names, the name of a ship or real products in general; such as a packet of crisps, a stereo or computer hardware.

There Are Exceptions

I won’t go into details, but there are exceptions here and there, that mostly lie around the context; whereby you are either referring to the citation itself, or the use of that work in a specific case - particularly with broadly used technologies like HTML, CSS and PHP.

I am referring directly to the <cite>PHP</cite> language/technology.
My website’s <abbr>PHP</abbr> is small.

That said, details like this will boil down to personal taste, and it’ll never really hurt to just stick to using one element or the other for all such instances, regardless.

Ⅳ. Abbreviations Should Be Meek

An abbreviation is merely anything that is read different from how it is written and vice-versa. It does not need to be in your face, Javascript-powered, “intelli-text”.

What if the Reader Does Not Know What a Technical Abbreviation / Acronym Means?

Isn’t the point of an abbr element so that these technical terms can be defined by hovering the mouse over the term?

There’s two valid answers to this:

  1. That’s what the <dfn> element is for, and…
  2. It is not your responsibility to be an encyclopædia.
    Being paranoid about your reader’s abilities is just going to make your life difficult

You, the author, only have to take the responsibility to know your audience and define those terms which you think they won’t know, or that you may be newly introducing to them.

If a user does not know a term, your website is not the only resource in existence where they can then find the definition! The user can easily google the term. In many browsers they can just right-click the word and choose to search the web for it. On a Mac, there’s a system-wide integrated dictionary you can access in a number of ways. There is no end to the ways a user can find out what a term means if they need to.

How to Style Your Abbreviations

The traditional way to style abbreviations is a grey dotted line, like so:

abbr	{border-bottom: 1px solid #666;}

However, this was under the previous model of using abbr as some kind of inline dictionary. Abbreviations are for the benefit of screen readers, search engines and enthusiasts like me. Generally, abbreviations shouldn’t be styled at all.

That said, abbreviations still do provide a useful service by allowing readers to uncover how something should be read. We need a subtle approach that doesn’t fill the user’s screen with grey dotted lines, but at the same time does allow them to discover where you’ve provided reading “hints”.

The method I’m using is to only show the grey-dotted underline when the user’s mouse is within the paragraph containing the abbreviations, so that when the user moves their mouse into the surrounding text, the abbreviations (with titles) will be marked, and the user can hover over them to then see the tooltip.

*:hover>abbr[title]	{border-bottom: 1px dotted #666; cursor: help;}
Update: The above code only works with abbreviations directly within paragraphs, if the abbr element is wrapped in a link or any other kind of tag, the grey dotted line won’t appear until you hover directly over it. The new CSS below fixes this:
(where section is the element/ID containing your blog posts)
/* first, the immediate descendants of the content area are set to highlight abbreviations on hover, but avoiding lists; as I don’t want *all* abbreviations highlighted when you hover on a root list… */
section>*:not(ol):not(ul):not(dl):hover abbr[title],
/* …only when hovering on each list-item */
 p:hover abbr[title], li:hover abbr[title], dl>*:hover abbr[title] {
	border-bottom: 1px dotted #666; cursor: help;
}

I hope this article provides with some practical guidance and enthusiasm.

If you spot any flaws in the HTML of my articles, please do contact me and let me know, I’ve got so many thousands of lines I’m sure to have made mistakes everywhere. Also you’re free to e-mail me if you’ve any questions about this article and using abbr, cite and HTML5 in general.

Special thanks goes to Adam of firsttube.com for reviewing the article whilst it was being prepared and spotting a number of flaws.

Under the Hood #5:
New Website-Ish

Welcome to Camen Design v0.2-ish. I’ve replaced the publishing code in the site, leaving the HTML5 & CSS intact. They will be replaced in the next update. I plan to target Firefox 3.1 (and hopefully Safari 4 may be out by then too), allowing me to make use of CSS animation/transitions and border-image.

In fact, because my site has its PHP / HTML / CSS entirely separated, any one can be replaced without touching a line of the other.

On the subject of future-proof CSS, I noted:

A CSS file is such that you can throw it away easily and start again. I could design my website any way I wanted without ever changing the HTML.

Clean and separated HTML/CSS means that parts can be replaced. That’s what future-proof is about – the ability to adapt to changes.

Being bit-rot proof is an entirely different matter!

That’s a different topic for another day though. This article is about the new back-end:

Clean URLs

In the previous version of the site, a PHP script handled the database, spitting out the data as requested. v0.2 is now all static XHTML5 files, ensuring faster load speed and better caching. The publishing script generates pre-gzipped “.xhtml” files for each of the articles, as well as each of the index pages. The home page is “1.xhtml”, the second page “2.xhtml” and so on. Each content-type is a folder, containing another set of numbered files for the pages. e.g. “blog/1.xhtml” and so on.

Mod_Rewrite is used to mask the “.xhtml” extension, so it isn’t required, giving nice looking URLs with no querystring as before. “/?blog&amp;page=2” now becomes “/blog/2”.

The ‘.htaccess’ file I wrote now handles everything dynamic, applying the ‘application/xhtml+xml’ mime-type to the HTML, but falling back to ‘text/html’ for browsers that can’t deal with that.

I’ve opened up my ‘.htaccess’ file so you can view it fully, but a detailed break-down is covered below.

Serving Compressed XHTML5

A big problem with the old code was that a single PHP page was not being cached very well, relying on me manually setting all the HTTP-Headers for the various pages requested. In this new version each page is a separate file, and so Apache and your browser can handle things fine.

FileETag MTime Size
AddDefaultCharset utf-8

This declaration tells Apache to send ETags in HTTP-Headers. The ETag is a unique hash of the file, so that the browser knows when the file has actually changed. Apache sends ETags automatically anyway, but uses the default “MTime INode Size”, which ties the ETag to the file’s storage cluster on the disk. If you were to upload the same file again, despite it’s contents not changing, Apache would send a different ETag in that case.

# .xhtml files are gzipped html5 documents ready to serve
AddType application/xhtml+xml .xhtml
AddEncoding gzip .xhtml

This creates a new file-type “.xhtml”, and serves it as ‘application/xhtml+xml’ by default. Though it is possible to serve HTML5 as ‘text/html’ in Firefox 3 & Safari, the <legend> tag will not work correctly when used inside a <figure> element. This is due to the all-round broken-ness of the <legend> tag in all browsers (caused by pandering to IE’s even more broken implementation).

The publishing script uses the gzencode PHP function when saving the files to zip the contents for bandwidth-savings and fast delivery. The “AddEncoding” declaration applies this to Apache, adding the necessary “Content-Encoding: gzip;HTTP-Header automatically.

# load page 1 by deafult
DirectoryIndex 1.xhtml index.php index.html

The home page is just page 1 of however many pages of the full archive of the site. Therefore “1.xhtml” is set as the default page to go to in a folder so that “/art/” returns “/art/1.xhtml”.

# if the url contains the ".xhtml", show the source code
RewriteCond %{THE_REQUEST} \b([^\.]*[^/])\.xhtml\b
RewriteRule ^ - [T=text/plain,L]

Viewing the HTML source of the pages on this site is an integral part of it’s design, so I wanted to make it very easy to do so. Just click the “html” link at the top of any page. The code above checks if the URL typed into the browser had the “.xhtml” included and if so, keeps the URL as is, but serves it as “text/plain” instead, preventing the browser from rendering the HTML.

# leave the ".xhtml" off (clean urls)
RewriteRule ^([^\.]*[^/])$ /$1.xhtml [L]

This finds any URL that has zero or one subfolder, and no dot in the filename. It then rewrites the URL to append “.xhtml” as the actual file to return. This is so “/blog/hello”, secretly returns the file “/blog/hello.xhtml”.

# although I don’t support IE, I do have to fall back to text/html,
# otherwise it will try and download the page instead of rendering it
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml
RewriteCond %{REQUEST_FILENAME} .*\.xhtml
RewriteRule ^ - [T=text/html,L]

As described, this will check the browser capabilities to see if it does not accept ‘application/xhtml+xml’ and revert to ‘text/html’. If this is not done, IE will try and download the file instead of showing it. In 2008.

Compressed CSS

# "csz" compressed CSS filetype
AddType text/css .csz
AddEncoding gzip .csz

As with the “.xhtml” definition, this creates a “.csz” filetype of mime-type “text/css”, and default gzip (compressed) encoding. The publishing script takes the normal ‘design.css’ file and spins off a compressed copy ‘design.csz’.

# on my localhost, don’t use a cached CSS file
RewriteCond %{DOCUMENT_ROOT} "^/Users/kroc/Sites/Camen Design/upload"
RewriteRule ^design/$ /design/design.css [L]
RewriteRule ^design/$ /design/design.csz [L]

When I’m editing the website on my computer, I’m refreshing constantly to see new CSS changes. This code checks if the webroot is that of my Mac’s localhost and passes the standard ‘.css’ file and stops processing. The next line passes the compressed CSS file in the case the document root match was not made (live server).

Compressed RSS

The publishing script creates a compressed ‘rss.rsz’ file in each folder and on root.

AddType application/rss+xml .rsz
AddEncoding gzip .rsz

RewriteRule ^([a-z0-9-]+/)?rss$ /$1.rsz [L]

This redirects URLs ending in “rss” to the compressed “rsz” file. e.g. “/tweet/rss” becomes “/tweet/rss.rsz”.

Static Publishing

When I mentioned my plans for v0.2, I noted one particular fallacy:

A simple text field is never going to replicate the editing power I have with TextMate. I’ve got no search and replace, no syntax highlighting, no keyboard shortcuts.

Trying to add these things is just re-implementing the wheel, and thus breaks my very own design principle №3, Let Everybody Else Do Their Job

Therefore, I removed the administration interface, in favour of a Laguna 2 (sadly offline now) style system.

The publish script is available to view online, but is not much use out of context. You can download a stub copy of this website with everything necessary to roll your own using the enclosure in your RSS reader, or the attachment at the bottom of this article.

Content on Disk

The source content of this website is just a folder, with a sub folder for each of the “content-types” (blog | tweet | photo &c.). In each folder is a file containing a JSON meta-data header and the raw HTML of the article. This layout directly maps to the new clean URLs too.

camen design v0.2’s data folder layout

Creating a new blog post is nothing more than creating a new file. Because content is now disk files, instead of database entries, I can use my text-editor’s global search and replace and HTML editing capabilities and I can use my O.S. to manage the files instead of having to implement more and more server-side administration pages to do the same thing.

Now I can blog the same way I create the stuff I blog about.

Inside a Content Entry

A content file looks like this inside: (this one is for this article)

{	"date"		:	200807101232,
	"updated"	:	200807101232,
	"title"		:	"Under The Hood #3: ¬Using A Quick &amp; Easy SQLite Database",
	"licence"	:	"cc-by",
	"tags"		:	["code-is-art", "web-dev"],
	"enclosure"	:	"sqlite.php"
}

HTML content goes here...

Pretty self-explanatory. When creating a new article, the “date” and “updated” fields are left out, and the publishing script then adds them in automatically. If I want to mark an article as updated, and thus push it to the top of the RSS regardless of it’s original publishing date, I just delete the “updated” line and the publishing script puts in a new timestamp.


The attached zip file is updated every time I publish, so it always contains the most up to date code.