Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Determine encoding of text

−0

I have some text files which think they are encoded in UTF-8:

file test.txt
test.txt: Unicode text, UTF-8 text, with CRLF line terminators

However if I look at their content, I think they might in reality have some other encoding:

ÒHi there. IÕm a test documentÓ

ÒTouchŽ.Ó

From context, this should read as

“Hi there. I'm a test document”

“Touché.”

How can I determine the original encoding of the text so that I can re-encode the file with iconv to hopefully get a readable text?

character-encoding

posted almost 2 years ago

CC BY-SA 4.0

2mo ago by Andreas demands justice for humanity‭

samcarter‭

1661 reputation 39 45 236 139

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

0 comment threads

3 answers

Score Active Age

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

−0

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.

UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8 representing the same textual content.)

According to GitHub, the content of your file is:

00000000: c392 4869 2074 6865 7265 2e20 49c3 956d  ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74   a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e  ........Touch...
00000030: c393 20                                  ..

In UTF-8, the byte sequence C3 92 encodes the code point U+00D2, which really is the glyph Ò (which can also be written as U+004F U+0300, which would become the byte sequence 4F CC 80).

So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, some process has converted some text data into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.

You'd need to reverse that last step, converting back from UTF-8 into each candidate encoding, then convert that in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of that, you'd be looking for a first code point of U+201C (“) which in UTF-8 encodes as the byte sequence E2 80 9C (or its equivalent in whatever other encoding you choose to normalize to).

Doing so should give you a UTF-8 file representing the textual content of your mojibake'd file, as well as a good indication of its original encoding.

posted almost 2 years ago

CC BY-SA 4.0

2y ago

Canina‭

1114 reputation 5 20 127 49

Copy Link

Raw

Markdown

History

1 comment thread

Thanks for your answer! A lot of useful information for further investigation! (1 comment)

−0

Worked for samcarter‭

The following users marked this post as Works for me:

User	Comment	Date
samcarter‭	(no comment)	Sep 6, 2023 at 09:47

Quick fix

I agree with Canina that you need to do two translations to fix this problem. Fortunately, it appears that you can recover the original text without loss.

Try this:

# first convert from UTF-8 to WINDOWS-1252
iconv -f UTF-8 -t WINDOWS-1252 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

Details

I've had the same thing happen to curly quotes in my files when trying to read text files I created on my old Macintosh that were mis-interpreted as ISO-8859-1 or ISO-8859-15 text. Other options would work just as well to fix the curly quotes, since several different character encodings happen to put the curly quotes in the same place, such as

# first convert from UTF-8 to ISO-8859-15
iconv -f UTF-8 -t ISO-8859-15 < test.txt > junk.txt
# next re-interpret the text as "MAC OS Roman"
# and convert back to UTF-8
iconv -f MACINTOSH -t UTF-8 < junk.txt > output.txt

which was the solution for my text, but would mess up other letters in your particular text.

I used Wikipedia's list of Latin charsets to figure out which 2 character sets had the same byte value representing "Z with caron" in one set and "e with acute" in the other set, etc.

Fortunately I saw there that WINDOWS-1252 lines up with other letters in that text, translating C5 BD (U+017D "Z with caron") to 8E, where the byte 8E when re-interpreted as "MAC OS Roman" represents "e with acute" (U+00E9 in Unicode).

(I feel that using named HTML character entity references are often a better way to represent characters than ambiguous raw binary codes, and would have prevented such problems.)

posted over 1 year ago

CC BY-SA 4.0

2mo ago by Michael‭

DavidCary‭

136 reputation 3 2 17 0

Copy Link

Raw

Markdown

History

1 comment thread

Thank you for your answer! It worked perfectly! (1 comment)

−0

If your goal is to fix your files like David Cary's iconving does, but you can't tell the mis-encodings that transpired to create your text, you can use a little Python and the ftfy library^[1] as found in PyPI to undo the mess.

Some quick examples

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>>> import ftfy
>>> ftfy.fix_text('âœ” No problems')
'✔ No problems'
Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:
>>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

I learned about ftfy several years after I wrote some (much less rigorous) tools to detect and unscramble content that had made its way through one or more different encodings.

"Fixed that for you" ↩︎

posted over 1 year ago

CC BY-SA 4.0

1y ago

Michael‭

369 reputation 8 13 44 42

Copy Link

Raw

Markdown

History

1 comment thread

Thanks for your answer! (1 comment)

Communities

Determine encoding of text

0 comment threads

3 answers

1 comment thread

Quick fix

Details

1 comment thread

Some quick examples

1 comment thread