Octopress Invalid Byte Sequence in UTF-8 Error After Rake Generate
I didn’t spend 3 hours last night figuring this out…
I finished writing a summary of HTML and CSS, ran rake generate and got the following error:
Error invalid byte sequence in UTF-8 (ArgumentError)
## Generating Site with Jekyll
unchanged sass/screen.scss
Configuration from /Users/tomordonez/rails_projects/octopress/_config.yml
Building site: source -> public
/Users/tomordonez/rails_projects/octopress/plugins/raw.rb:11:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/tomordonez/rails_projects/octopress/plugins/raw.rb:11:in `unwrap'
from /Users/tomordonez/rails_projects/octopress/plugins/octopress_filters.rb:18:in `post_filter'
from /Users/tomordonez/rails_projects/octopress/plugins/octopress_filters.rb:33:in `post_render'
from /Users/tomordonez/rails_projects/octopress/plugins/post_filters.rb:124:in `block in post_render'
from /Users/tomordonez/rails_projects/octopress/plugins/post_filters.rb:123:in `each'
from /Users/tomordonez/rails_projects/octopress/plugins/post_filters.rb:123:in `post_render'
from /Users/tomordonez/rails_projects/octopress/plugins/post_filters.rb:151:in `transform'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/convertible.rb:84:in `do_layout'
from /Users/tomordonez/rails_projects/octopress/plugins/post_filters.rb:167:in `do_layout'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/post.rb:189:in `render'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/site.rb:193:in `block in render'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/site.rb:192:in `each'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/site.rb:192:in `render'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/lib/jekyll/site.rb:40:in `process'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/gems/jekyll-0.11.2/bin/jekyll:250:in `<top (required)>'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/bin/jekyll:19:in `load'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/bin/jekyll:19:in `<main>'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/bin/ruby_noexec_wrapper:14:in `eval'
from /Users/tomordonez/.rvm/gems/ruby-1.9.2-p320/bin/ruby_noexec_wrapper:14:in `<main>'
Troubleshooting
I posted before about the YAML error with Octopress so it had nothing to do with the titles of the postings I recently wrote, or the categories of the posting.
I went to each of the files mentioned on the errors above but did not make any changes. The blog was working before and I have never touched those files.
The problem had to be with the 3 new postings I wrote.
I changed the file titles to make sure they didn’t have any characters other than letters and numbers.
I also changed the titles inside the posting.
None of this worked.
I also found this posting called Octopress UTF-8 Issues
The author says he encounter this error when importing from Wordpress. But I am not importing anything.
He gives some tips:
- Octopress’ default markdown parser, rdiscount, REALLY doesn’t like UTF-8 characters in URLs.
- It was converting DaraÓBriain.jpg to Dara?%93_Briain.jpg, where the ? is an invalid UTF-8 character. (Should be Dara%C3%93_Briain.jpg)
The problem might be with a link on one of the postings…
What is UTF-8
I had to look up what UTF-8 was about. I have seen it before but I didn’t care to understand what it was.
http://en.wikipedia.org/wiki/UTF-8
This article explains a little bit less confusing.
UTF-8 is a popular encoding form of the Unicode/ISO-10646 standard.
ISO came up with a standard called ISO 10646 which defines a huge 31-bit Universal Character Set (UCS).
31-bit UCS can contain about two billion characters…Version 3.2 of the Unicode standard, for instance, provided codes for 95221 characters
Unicode and ISO 10646 are first of all code tables that assign integer numbers to characters. Hexadecimal numbers for those integer values are commonly preceded by “U+”. For instance, U+0041 is the character “Latin capital letter A”.
Unicode standard defines three encoding forms that allow the same character data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16, or 32-bits per code unit). These three encoding forms are called UTF-8, UTF-16 and UTF-32 respectively.
UTF-8 transforms all Unicode characters into a variable length byte sequence, it has the following properties:
- Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to 0x7F, this means UTF-8 is fully compatible with ASCII.
- All characters greater than U+007F are encoded as a sequence of several bytes, all of which are above 0x7F (namely no ASCII byte)
Examples of characters and hexadecimal UTF-8
This topic on UTF-8 can get pretty dense. It might be incorrect to explain like this but this is how I understand it.
UTF-8 converts your words into 1s and 0s like in the matrix, which is the language that computers really understand.
The character “$”:
Source: http://en.wikipedia.org/wiki/UTF-8
- U+0024
- Binary code point: 0100100
- Binary UTF-8: 00100100
- Hexadecimal UTF-8: 24
Solution
In the Octopress documentation I found that you can add published: false on the YAML header of the posting to keep drafts of your postings without publishing them.
But I also found out that you can use this to narrow down which posting is creating the issue.
I had 3 new postings. I added published: false to all of them. Ran rake generate and the issue was solved.
I left the published false to 2 of them and those were the ones with the issue.
Finding the URL with the issue.
As pointed out on this article the issue appeared to be from the markdown parser on Octopress not liking UTF-8 characters in the URLs.
There was one URL on each of the articles pointing to the source of where I got the information from:
http://learn.shayhowe.com/html-css/
There were no weird characters on that link. The only that I found was that it didn’t have the slash “/” at the end of the URL. But other links in other postings didn’t have that either.
What I did was simply delete the URL and paste it again. This solved the issue.
Lesson learned
If you find the error explained on this posting:
- Use (published: false) on the YAML header of your posting and run (rake generate) to narrow down the posting with the issue
- Go to the posting and check the URLs.
- Remove the URL and paste it again