Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Comments on Determine encoding of text
Parent
Determine encoding of text
I have some text files which think they are encoded in utf8:
file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF line terminators
(https://github.com/samcarter/shared/blob/main/test.txt )
However if I look at their content, I think they might in reality have some other encoding:
ÒHi there. IÕm a test documentÓ
ÒTouchŽ.Ó
From context, this should read as
“Hi there. I'm a test document”
“Touché.”
How can I determine the original encoding of the text so that I can re-encode the file with iconv
to hopefully get a readable text?
Post
The following users marked this post as Works for me:
User | Comment | Date |
---|---|---|
samcarter | (no comment) | Sep 6, 2023 at 09:47 |
quick fix
I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss.
Try this:
# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
details
I've had the same thing happen to curly quotes in my files when trying to read text files I created on my old Macintosh that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text. Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as
# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
which was the solution for my text, but would mess up other letters in your particular text.
I used https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing) to figure out which 2 character sets had the same byte value representing "Z with caron" in one set and "e with acute" in the other set, etc.
Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text, translating C5 BD ( U+017D "Z with caron" ) to 8E, where the byte 8E when re-interpreted as "MAC OS Roman" represents "e with acute" (U+00E9 in Unicode).
(I feel that using named HTML character entity references are often a better way to represent characters than ambiguous raw binary codes, and would have prevented such problems. ).
0 comment threads