Power Users

−0

The following users marked this post as Works for me:

User	Comment	Date
samcarter‭	(no comment)	Sep 6, 2023 at 09:47

Quick fix

I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss.

Try this:

# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

Details

I've had the same thing happen to curly quotes in my files when trying to read text files I created on my old Macintosh that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text. Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as

# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

which was the solution for my text, but would mess up other letters in your particular text.

I used Wikipedia's list of Latin charsets to figure out which 2 character sets had the same byte value representing "Z with caron" in one set and "e with acute" in the other set, etc.

Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text, translating C5 BD (U+017D "Z with caron") to 8E, where the byte 8E when re-interpreted as "MAC OS Roman" represents "e with acute" (U+00E9 in Unicode).

(I feel that using named HTML character entity references are often a better way to represent characters than ambiguous raw binary codes, and would have prevented such problems.)

posted over 1 year ago

CC BY-SA 4.0

1mo ago by Michael‭

DavidCary‭

126 reputation 3 2 16 0

Copy Link

Raw

Markdown

History

1 comment thread

Thank you for your answer! It worked perfectly! (1 comment)

Communities

Comments on Determine encoding of text

Determine encoding of text

0 comment threads

Quick fix

Details

1 comment thread