Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Post History

80%

+6 −0

Q&A Determine encoding of text

Quick fix I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss. Try this: # first convert f...

posted 1y ago by DavidCary‭ · edited 2mo ago by Michael‭

Answer

#2: Post edited by

Michael‭ · 2025-03-31T00:44:34Z (2 months ago)
Pretty print. Links to bib section. Capitalize subheadings.

Copy Link

Raw

Markdown

~~### quick fix~~
I agree with Canina that you need to do *two* translations
to fix this problem.
Fortunately, it appears that you can recover the original text without loss.
Try this:
~~```~~
# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```
~~### details~~
I've had the same thing happen to curly quotes in my files
when trying to read text files I created on my old Macintosh
that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as
~~```~~
# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```
which was the solution for my text,
but would mess up other letters in your particular text.
~~I used~~
https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
to figure out which 2 character sets
had the same byte value representing
"Z with caron" in one set and
"e with acute" in the other set,
etc.
Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
translating
~~C5 BD ( U+017D "Z with caron" ) to 8E,~~
~~where the byte 8E when re-interpreted as "MAC OS Roman"~~
~~represents "e with acute" (U+00E9 in Unicode).~~
(I feel that using named
~~[HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)~~
are often a better way to represent characters than ambiguous raw binary codes,
~~and would have prevented such problems.~~
).

### Quick fix
I agree with Canina that you need to do *two* translations
to fix this problem.
Fortunately, it appears that you can recover the original text without loss.
Try this:
```bash
# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```
### Details
I've had the same thing happen to curly quotes in my files
when trying to read text files I created on my old Macintosh
that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as
```bash
# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```
which was the solution for my text,
but would mess up other letters in your particular text.
I used [Wikipedia's list of Latin charsets][wiki-charsets]
to figure out which 2 character sets
had the same byte value representing
"Z with caron" in one set and
"e with acute" in the other set,
etc.
Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
translating
`C5 BD` (`U+017D` "Z with caron") to `8E`,
where the byte `8E` when re-interpreted as "MAC OS Roman"
represents "e with acute" (`U+00E9` in Unicode).
(I feel that using named
[HTML character entity references][wiki-entities]
are often a better way to represent characters than ambiguous raw binary codes,
and would have prevented such problems.)
[wiki-charsets]: https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
[wiki-entities]: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

#1: Initial revision by

DavidCary‭ · 2023-09-05T23:20:31Z (over 1 year ago)

Copy Link

Raw

Markdown

### quick fix

I agree with Canina that you need to do *two* translations
to fix this problem.
Fortunately, it appears that you can recover the original text without loss.

Try this:
```
# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```

### details

I've had the same thing happen to curly quotes in my files
when trying to read text files I created on my old Macintosh
that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as

```
# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```

which was the solution for my text,
but would mess up other letters in your particular text.

I used
https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
to figure out which 2 character sets
had the same byte value representing
"Z with caron" in one set and
"e with acute" in the other set,
etc.

Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
translating
C5 BD ( U+017D "Z with caron" ) to 8E,
where the byte 8E when re-interpreted as "MAC OS Roman"
represents "e with acute" (U+00E9 in Unicode).

(I feel that using named
[HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
are often a better way to represent characters than ambiguous raw binary codes,
and would have prevented such problems.
).

Communities

Post History