Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Post History
Quick fix I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss. Try this: # first convert f...
#2: Post edited
### quick fix- I agree with Canina that you need to do *two* translations
- to fix this problem.
- Fortunately, it appears that you can recover the original text without loss.
- Try this:
```- # first convert from UTF-8 to WINDOWS-1252
- iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
- # next re-interpret the text as "MAC OS Roman"
- # and convert back to UTF-8
- iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
- ```
### details- I've had the same thing happen to curly quotes in my files
- when trying to read text files I created on my old Macintosh
- that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
- Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as
```- # first convert from UTF-8 to ISO-8859-15
- iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
- # next re-interpret the text as "MAC OS Roman"
- # and convert back to UTF-8
- iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
- ```
- which was the solution for my text,
- but would mess up other letters in your particular text.
I usedhttps://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)- to figure out which 2 character sets
- had the same byte value representing
- "Z with caron" in one set and
- "e with acute" in the other set,
- etc.
- Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
- translating
C5 BD ( U+017D "Z with caron" ) to 8E,where the byte 8E when re-interpreted as "MAC OS Roman"represents "e with acute" (U+00E9 in Unicode).- (I feel that using named
[HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)- are often a better way to represent characters than ambiguous raw binary codes,
and would have prevented such problems.).
- ### Quick fix
- I agree with Canina that you need to do *two* translations
- to fix this problem.
- Fortunately, it appears that you can recover the original text without loss.
- Try this:
- ```bash
- # first convert from UTF-8 to WINDOWS-1252
- iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
- # next re-interpret the text as "MAC OS Roman"
- # and convert back to UTF-8
- iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
- ```
- ### Details
- I've had the same thing happen to curly quotes in my files
- when trying to read text files I created on my old Macintosh
- that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
- Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as
- ```bash
- # first convert from UTF-8 to ISO-8859-15
- iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
- # next re-interpret the text as "MAC OS Roman"
- # and convert back to UTF-8
- iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
- ```
- which was the solution for my text,
- but would mess up other letters in your particular text.
- I used [Wikipedia's list of Latin charsets][wiki-charsets]
- to figure out which 2 character sets
- had the same byte value representing
- "Z with caron" in one set and
- "e with acute" in the other set,
- etc.
- Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
- translating
- `C5 BD` (`U+017D` "Z with caron") to `8E`,
- where the byte `8E` when re-interpreted as "MAC OS Roman"
- represents "e with acute" (`U+00E9` in Unicode).
- (I feel that using named
- [HTML character entity references][wiki-entities]
- are often a better way to represent characters than ambiguous raw binary codes,
- and would have prevented such problems.)
- [wiki-charsets]: https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
- [wiki-entities]: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
#1: Initial revision
### quick fix I agree with Canina that you need to do *two* translations to fix this problem. Fortunately, it appears that you can recover the original text without loss. Try this: ``` # first convert from UTF-8 to WINDOWS-1252 iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt # next re-interpret the text as "MAC OS Roman" # and convert back to UTF-8 iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt ``` ### details I've had the same thing happen to curly quotes in my files when trying to read text files I created on my old Macintosh that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text. Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as ``` # first convert from UTF-8 to ISO-8859-15 iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt # next re-interpret the text as "MAC OS Roman" # and convert back to UTF-8 iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt ``` which was the solution for my text, but would mess up other letters in your particular text. I used https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing) to figure out which 2 character sets had the same byte value representing "Z with caron" in one set and "e with acute" in the other set, etc. Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text, translating C5 BD ( U+017D "Z with caron" ) to 8E, where the byte 8E when re-interpreted as "MAC OS Roman" represents "e with acute" (U+00E9 in Unicode). (I feel that using named [HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) are often a better way to represent characters than ambiguous raw binary codes, and would have prevented such problems. ).