Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Comments on Determine encoding of text
Parent
Determine encoding of text
I have some text files which think they are encoded in utf8:
file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF line terminators
(https://github.com/samcarter/shared/blob/main/test.txt )
However if I look at their content, I think they might in reality have some other encoding:
ÒHi there. IÕm a test documentÓ
ÒTouchŽ.Ó
From context, this should read as
“Hi there. I'm a test document”
“Touché.”
How can I determine the original encoding of the text so that I can re-encode the file with iconv
to hopefully get a readable text?
Post
If your goal is to fix your files like David Cary's iconv
ing does, but you can't tell the mis-encodings that transpired to create your text, you can use a little Python and the ftfy
library[1] as found in PyPI to undo the mess.
Some quick examples
Here are some examples (found in the real world) of what ftfy can do:
ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>>> import ftfy >>> ftfy.fix_text('✔ No problems') '✔ No problems'
Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
ftfy can fix multiple layers of mojibake simultaneously:
>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.') "The Mona Lisa doesn't have eyebrows."
I learned about ftfy
several years after I wrote some (much less rigorous) tools to detect and unscramble content that had made its way through one or more different encodings.
-
"Fixed that for you" ↩︎
0 comment threads