Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Post History

77%
+5 −0
Q&A Determine encoding of text

Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities. UTF-8 is one of the few exceptions that I know of, becaus...

posted 9mo ago by Canina‭  ·  edited 9mo ago by Canina‭

Answer
#3: Post edited by user avatar Canina‭ · 2023-08-24T18:05:35Z (9 months ago)
  • Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
  • UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
  • According to GitHub, the content of your file is:
  • 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
  • 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
  • 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
  • 00000030: c393 20 ..
  • In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
  • So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
  • **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
  • **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
  • Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
  • UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8 representing the same textual content.)
  • According to GitHub, the content of your file is:
  • 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
  • 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
  • 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
  • 00000030: c393 20 ..
  • In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
  • So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
  • **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
  • **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
#2: Post edited by user avatar Canina‭ · 2023-08-24T18:05:01Z (9 months ago)
  • Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
  • UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
  • According to GitHub, the content of your file is:
  • 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
  • 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
  • 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
  • 00000030: c393 20 ..
  • In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
  • So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
  • **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
  • **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
  • Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.
  • UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that by design, all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)
  • According to GitHub, the content of your file is:
  • 00000000: c392 4869 2074 6865 7265 2e20 49c3 956d ..Hi there. I..m
  • 00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74 a test document
  • 00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e ........Touch...
  • 00000030: c393 20 ..
  • In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).
  • So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.
  • **You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).
  • **Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.
#1: Initial revision by user avatar Canina‭ · 2023-08-24T18:03:33Z (9 months ago)
Determining the encoding of binary data representing text is a notoriously difficult problem, which basically comes down to probabilities.

UTF-8 is one of the few exceptions that I know of, because even absent a byte order mark or out-of-band metadata about the character set and encoding, UTF-8 has a rather specific structure which allows it to be identified with a high degree of certainty through fairly simple heuristics. The longer the text encoded in UTF-8 is, the more accurate this heuristic becomes. (The one notable exception being that all US-ASCII using only binary values 0x00..0x7F is also simultaneously valid UTF-8.)

According to GitHub, the content of your file is:

    00000000: c392 4869 2074 6865 7265 2e20 49c3 956d  ..Hi there. I..m
    00000010: 2061 2074 6573 7420 646f 6375 6d65 6e74   a test document
    00000020: c393 0d0a 0d0a c392 546f 7563 68c5 bd2e  ........Touch...
    00000030: c393 20                                  .. 

In UTF-8, the byte sequence `C3` `92` encodes the code point U+00D2, which really is the glyph `Ò` (which can also be written as U+004F U+0300, which would become the byte sequence `4F` `CC` `80`).

So what you're seeing is correct, as far as anyone could tell without further context. It appears that somewhere along the way, *some process* has converted *some text data* into UTF-8 and, in doing so, made an invalid assumption about the input data's encoding.

**You'd need to** reverse that last step, converting back from UTF-8 into each candidate encoding, *then* convert *that* in turn back from the now-known candidate encoding to a normalized encoding such as UTF-8. In the output of *that*, you'd be looking for a first code point of U+201C (`“`) which in UTF-8 encodes as the byte sequence `E2` `80` `9C` (or its equivalent in whatever other encoding you choose to normalize to).

**Doing so should** give you a UTF-8 file representing the textual content of your [mojibake](https://en.wikipedia.org/wiki/Mojibake)'d file, as well as a good indication of its original encoding.