Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Post History

75%

+4 −0

Q&A Determine encoding of text

If your goal is to fix your files like David Cary's iconving does, but you can't tell the mis-encodings that transpired to create your text, you can use a little Python and the ftfy library[1] as f...

posted 1y ago by Michael‭ · edited 1y ago by Michael‭

Answer

#3: Post edited by

Michael‭ · 2023-10-24T19:45:57Z (over 1 year ago)
Commentary

Copy Link

Raw

Markdown

If your goal is to fix your files like [David Cary's `iconv`ing does][davidcary], but you _can't tell_ the mis-encodings that transpired to create your text, you can use a little Python and [the `ftfy` library][ftfy][^1] as [found in PyPI][pip] to undo the mess.
> ## Some quick examples
> Here are some examples (found in the real world) of what ftfy can do:
>
> ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>
~~> ```py~~
> >>> import ftfy
> >>> ftfy.fix_text('âœ” No problems')
> '✔ No problems'
> ```
> Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
>
> ftfy can fix multiple layers of mojibake simultaneously:
>
~~> ```py~~
> >>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
> "The Mona Lisa doesn't have eyebrows."
> ```
[^1]: "Fixed that for you"
[davidcary]: https://powerusers.codidact.com/posts/289529/289602#answer-289602
[ftfy]: https://ftfy.readthedocs.io/en/latest/
[pip]: https://pypi.org/project/ftfy/

If your goal is to fix your files like [David Cary's `iconv`ing][davidcary] does, but you _can't tell_ the mis-encodings that transpired to create your text, you can use a little Python and [the `ftfy` library][ftfy][^1] as [found in PyPI][pip] to undo the mess.
> ## Some quick examples
> Here are some examples (found in the real world) of what ftfy can do:
>
> ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>
> ```python
> >>> import ftfy
> >>> ftfy.fix_text('âœ” No problems')
> '✔ No problems'
> ```
> Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
>
> ftfy can fix multiple layers of mojibake simultaneously:
>
> ```python
> >>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
> "The Mona Lisa doesn't have eyebrows."
> ```
I learned about `ftfy` several years after I wrote some (much less rigorous) tools to detect and unscramble content that had made its way through one or more different encodings.
[^1]: "Fixed that for you"
[davidcary]: https://powerusers.codidact.com/posts/289529/289602#answer-289602
[ftfy]: https://ftfy.readthedocs.io/en/latest/
[pip]: https://pypi.org/project/ftfy/

#2: Post edited by

Michael‭ · 2023-10-24T16:11:37Z (over 1 year ago)
Link David's answer

Copy Link

Raw

Markdown

If your goal is to fix your files like David Cary's `iconv`ing does, but you _can't tell_ the mis-encodings that transpired to create your text, you can use a little Python and the [`ftfy`][ftfy] library[^1] [in PyPi][pip] to undo the mess.
> ## Some quick examples
> Here are some examples (found in the real world) of what ftfy can do:
>
> ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>
> ```py
> >>> import ftfy
> >>> ftfy.fix_text('âœ” No problems')
> '✔ No problems'
> ```
> Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
>
> ftfy can fix multiple layers of mojibake simultaneously:
>
> ```py
> >>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
> "The Mona Lisa doesn't have eyebrows."
> ```
[^1]: "Fixed that for you"
[ftfy]: https://ftfy.readthedocs.io/en/latest/
[pip]: https://pypi.org/project/ftfy/

If your goal is to fix your files like [David Cary's `iconv`ing does][davidcary], but you _can't tell_ the mis-encodings that transpired to create your text, you can use a little Python and [the `ftfy` library][ftfy][^1] as [found in PyPI][pip] to undo the mess.
> ## Some quick examples
> Here are some examples (found in the real world) of what ftfy can do:
>
> ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>
> ```py
> >>> import ftfy
> >>> ftfy.fix_text('âœ” No problems')
> '✔ No problems'
> ```
> Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
>
> ftfy can fix multiple layers of mojibake simultaneously:
>
> ```py
> >>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
> "The Mona Lisa doesn't have eyebrows."
> ```
[^1]: "Fixed that for you"
[davidcary]: https://powerusers.codidact.com/posts/289529/289602#answer-289602
[ftfy]: https://ftfy.readthedocs.io/en/latest/
[pip]: https://pypi.org/project/ftfy/

#1: Initial revision by

Michael‭ · 2023-10-20T20:55:39Z (over 1 year ago)

Copy Link

Raw

Markdown

If your goal is to fix your files like David Cary's `iconv`ing does, but you _can't tell_ the mis-encodings that transpired to create your text, you can use a little Python and the [`ftfy`][ftfy] library[^1] [in PyPi][pip] to undo the mess.

> ## Some quick examples
> Here are some examples (found in the real world) of what ftfy can do:
> 
> ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
>
> ```py
> >>> import ftfy
> >>> ftfy.fix_text('âœ” No problems')
> '✔ No problems'
> ```
> Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
> 
> ftfy can fix multiple layers of mojibake simultaneously:
>
> ```py
> >>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
> "The Mona Lisa doesn't have eyebrows."
> ```

[^1]: "Fixed that for you"

[ftfy]: https://ftfy.readthedocs.io/en/latest/
[pip]: https://pypi.org/project/ftfy/

Communities

Post History