Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Post History

71%
+3 −0
Q&A Determine encoding of text

quick fix I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss. Try this: # first convert f...

posted 1y ago by DavidCary‭

Answer
#1: Initial revision by user avatar DavidCary‭ · 2023-09-05T23:20:31Z (about 1 year ago)
### quick fix

I agree with Canina that you need to do *two* translations
to fix this problem.
Fortunately, it appears that you can recover the original text without loss.

Try this:
```
# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```

### details

I've had the same thing happen to curly quotes in my files
when trying to read text files I created on my old Macintosh
that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text.
Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as

```
# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt
```

which was the solution for my text,
but would mess up other letters in your particular text.

I used
https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
to figure out which 2 character sets
had the same byte value representing
"Z with caron" in one set and
"e with acute" in the other set,
etc.

Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text,
translating
C5 BD ( U+017D "Z with caron" ) to 8E,
where the byte 8E when re-interpreted as "MAC OS Roman"
represents "e with acute" (U+00E9 in Unicode).

(I feel that using named
[HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
are often a better way to represent characters than ambiguous raw binary codes,
and would have prevented such problems.
).