Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Comments on Determine encoding of text

Parent

Determine encoding of text

+5
−0

I have some text files which think they are encoded in utf8:

file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF line terminators

(https://github.com/samcarter/shared/blob/main/test.txt )

However if I look at their content, I think they might in reality have some other encoding:

ÒHi there. IÕm a test documentÓ

ÒTouchŽ.Ó 

From context, this should read as

“Hi there. I'm a test document”

“Touché.”

How can I determine the original encoding of the text so that I can re-encode the file with iconv to hopefully get a readable text?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

Post
+3
−0

If your goal is to fix your files like David Cary's iconving does, but you can't tell the mis-encodings that transpired to create your text, you can use a little Python and the ftfy library[1] as found in PyPI to undo the mess.

Some quick examples

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

>>> import ftfy
>>> ftfy.fix_text('✔ No problems')
'✔ No problems'

Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

I learned about ftfy several years after I wrote some (much less rigorous) tools to detect and unscramble content that had made its way through one or more different encodings.


  1. "Fixed that for you" ↩︎

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Thanks for your answer! (1 comment)
Thanks for your answer!
samcarter‭ wrote about 1 year ago

Thanks for your answer!