Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Post History

83%

+8 −0

Q&A Determine encoding of text

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities. UTF-8 is one of the few exceptions that I know of, becaus...

posted 2y ago by Canina‭ · edited 2y ago by Canina‭

Answer

#3: Post edited by

Canina‭ · 2023-08-24T18:05:35Z (almost 2 years ago)

Copy Link

Raw

Markdown

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
According to GitHub, the content of your file is:
00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
00000030: c393 20 ..
In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8 representing the same textual content.)
According to GitHub, the content of your file is:
00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
00000030: c393 20 ..
In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.

#2: Post edited by

Canina‭ · 2023-08-24T18:05:01Z (almost 2 years ago)

Copy Link

Raw

Markdown

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
According to GitHub, the content of your file is:
00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
00000030: c393 20 ..
In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
According to GitHub, the content of your file is:
00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
00000030: c393 20 ..
In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.

#1: Initial revision by

Canina‭ · 2023-08-24T18:03:33Z (almost 2 years ago)

Copy Link

Raw

Markdown

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.

UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)

According to GitHub, the content of your file is:

00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
00000030: c393 20 ..

In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).

So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.

**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).

**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.

Communities

Post History