Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Post History
Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities. UTF-8 is one of the few exceptions that I know of, becaus...
Answer
#3: Post edited
- Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)- According to GitHub, the content of your file is:
- 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
- 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
- 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
- 00000030: c393 20 ..
- In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
- So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
- **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
- **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
- Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
- UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8 representing the same textual content.)
- According to GitHub, the content of your file is:
- 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
- 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
- 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
- 00000030: c393 20 ..
- In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
- So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
- **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
- **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
#2: Post edited
- Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)- According to GitHub, the content of your file is:
- 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
- 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
- 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
- 00000030: c393 20 ..
- In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
- So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
- **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
- **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
- Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
- UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
- According to GitHub, the content of your file is:
- 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
- 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
- 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
- 00000030: c393 20 ..
- In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
- So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
- **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
- **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
#1: Initial revision
Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities. UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.) According to GitHub, the content of your file is: 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch... 00000030: c393 20 .. In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`). So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding. **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to). **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.