Removing Ligatures in HTML Files Generated from LaTeX Files

I recently had to convert a LaTeX document to HTML and, after looking into several alternatives, decided to go with htlatex. Because my document contains accented characters, I chose to use the UTF-8 encoding as that seems to be the trend. To convert a LaTeX source file called file.tex you can issue the command below, which will create two main files: file.css and file.html (warning: the space before -cunif is a must):

htlatex file.tex “xhtml,charset=utf-8″ ” -cunihtf -utf8″

Overall, I’m very happy with the results produced by htlatex. Nevertheless, as I loaded file.html on my iPhone, I noticed that mobile Safari does not render all ligatures properly. For example, it has no problem with the ‘fi’ ligature, but it displays a hollow square in place of the characters for ‘ff’ and ‘ffi’ ligatures. I have not tested other mobile browsers, so I’m not sure if this is only an issue with mobile Safari. Safari on my desktop computer does not exhibit this problem.

To be safe, I thought I’d be better off removing all ligatures from the HTML file, which led me to search around for their UTF-8 codes and to write a little command-shell script that uses Perl to perform the task. Since this might turn out to be useful to someone else out there, I decided to post my shell script here. Use it at your own risk and enjoy!

perl -pi -e ‘s/\xef\xac\x80/ff/g’ file.html
perl -pi -e ‘s/\xef\xac\x81/fi/g’ file.html
perl -pi -e ‘s/\xef\xac\x82/fl/g’ file.html
perl -pi -e ‘s/\xef\xac\x83/ffi/g’ file.html
perl -pi -e ‘s/\xef\xac\x84/ffl/g’ file.html
perl -pi -e ‘s/\xc5\x92/OE/g’ file.html
perl -pi -e ‘s/\xc5\x93/oe/g’ file.html
perl -pi -e ‘s/\xc3\x86/AE/g’ file.html
perl -pi -e ‘s/\xc3\xa6/ae/g’ file.html
perl -pi -e ‘s/\xef\xac\x86/st/g’ file.html
perl -pi -e ‘s/\xc4\xb2/IJ/g’ file.html
perl -pi -e ‘s/\xc4\xb3/ij/g’ file.html

By the way, I’m only concerned with Latin ligatures, but you can find UTF-8 codes for other ligatures on this page. Bonus: here’s another useful article related to this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Leave a comment

Filed under Tips and Tricks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s