Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Comments on Determine encoding of text

Parent

Determine encoding of text

+5
−0

I have some text files which think they are encoded in utf8:

file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF line terminators

(https://github.com/samcarter/shared/blob/main/test.txt )

However if I look at their content, I think they might in reality have some other encoding:

ÒHi there. IÕm a test documentÓ

ÒTouchŽ.Ó 

From context, this should read as

“Hi there. I'm a test document”

“Touché.”

How can I determine the original encoding of the text so that I can re-encode the file with iconv to hopefully get a readable text?

History
Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

Post
+3
−0

quick fix

I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss.

Try this:

# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

details

I've had the same thing happen to curly quotes in my files when trying to read text files I created on my old Macintosh that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text. Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as

# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

which was the solution for my text, but would mess up other letters in your particular text.

I used https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing) to figure out which 2 character sets had the same byte value representing "Z with caron" in one set and "e with acute" in the other set, etc.

Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text, translating C5 BD ( U+017D "Z with caron" ) to 8E, where the byte 8E when re-interpreted as "MAC OS Roman" represents "e with acute" (U+00E9 in Unicode).

(I feel that using named HTML character entity references are often a better way to represent characters than ambiguous raw binary codes, and would have prevented such problems. ).

History
Why does this post require moderator attention?
You might want to add some details to your flag.

1 comment thread

Thank you for your answer! It worked perfectly! (1 comment)
Thank you for your answer! It worked perfectly!
samcarter‭ wrote 8 months ago

Thank you for your answer! It worked perfectly!