Power Users

−0

If your goal is to fix your files like David Cary's iconving does, but you can't tell the mis-encodings that transpired to create your text, you can use a little Python and the ftfy library^[1] as found in PyPI to undo the mess.

Some quick examples

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>>> import ftfy
>>> ftfy.fix_text('âœ” No problems')
'✔ No problems'
Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:
>>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

I learned about ftfy several years after I wrote some (much less rigorous) tools to detect and unscramble content that had made its way through one or more different encodings.

"Fixed that for you" ↩︎

posted over 1 year ago

CC BY-SA 4.0

1y ago

Michael‭

299 reputation 8 12 37 40

Copy Link

Raw

Markdown

History

1 comment thread

Thanks for your answer! (1 comment)

Communities

Comments on Determine encoding of text

Determine encoding of text

0 comment threads

Some quick examples

1 comment thread