Coding Systems in `shell-command`
Problem
The day before yesterday, I looked into the Markdown menu of Emacs and saw "Preview" and "Export" commands while I was editing one markdown document, then, I just clicked, and opened the exported HTML file, but found out that almost all my contents were emptied, that document were written in Chinese.
I had another separate Emacs instance running on a terminal, I do the same thing there, the non-ASCII characters were well outputted.
Solution
Before, I always used markdown
directly by
shell-command
, and it worked well. I doubt there might be
something wrong in the encoding functions of markdown-mode
.
Then I checked.
Below is the markdown
function definition in
markdown-mode
.
1 | (defun markdown (&optional output-buffer-name) |
The cond
expression has two branches, the first one
executes markdown
using shell-mode
directly on
the file of current buffer, the second one instead uses
shell-command-on-region
on a region marked in current
buffer.
If markdown-mode
was written by me, I would always use
the shell-command-on-region
, because able to preview part
of a markdown document is definitely a good user experience. I was
right, the markdown-command-needs-filename
is set as
nil
by default.
I wanted to fix my encoding problem, so, I set it to t
,
and then "preview" again, non-ASCII characters showed up in exported
HTML.
But I didn't want to stop there, since it's a workaround, workaround smells.
There might be some differences on encoding between
shell-command
and shell-command-on-region
which caused this issue.
1 | To specify a coding system for converting non-ASCII characters |
Stepped into shell-command
function, I saw above
comments. Then tracing into coding-system-for-read
,
1 | coding-system-for-read is a variable defined in `C source code'. |
It looks for coding
system from file-coding-system-alist
at first, below
are mine.
1 | Value: (("\\.dz\\'" no-conversion . no-conversion) |
All my documents are uft-8
encoded, so, the output of
shell-command
are encoded as uft-8
format,
too.
shell-command-on-region
is different, below are the
comments of it.
1 | To specify a coding system for converting non-ASCII characters |
Instead, it uses process-coding-system-alist
as encoding
options, in my machine, it's nil
. So, it searches encoding
options in default-process-coding-system
.
In my machine the value of default-process-coding-system
is
(undecided-unix . iso-latin-1-unix)
but in the terminal Emacs instance, it's
(utf-8 . utf-8)
, that's why I get right output there.
Added below line to my dot emacs, problem was solved.
(setq default-process-coding-system '(utf-8 . utf-8))
Or,
(set-language-environment "UTF-8")
I discussed this problem with Xah
Lee, he mentioned default-process-coding-system
in his
environment is utf-8
, probably because he set
set-language-environment
to utf-8
, and I did a
test, it worked. I prefer this setting, because it looks like
environmentally global. :)
Complaint
Emacs supports nearly all encoding formats, there are dozens of (default) coding systems of different modes need to be set.
Imagine a package relies on one built-in function, and this function looks for coding system from some built-in variables, but a user may even do not know these variables exist, it will be hard for him to find what happened when he get a messy output.
As an amazing good editor or an operating system :), supports all kinds of encoding formats is needed, but to be a modern editor, people don't need to know what's under the hood every time he's about to configure something.
We could move this responsibility to Packages that people use (some people call them plugins or extensions).
For example, we can set the output coding system to be the same as the input before executing some built-in functions which support different kinds of coding systems.
We can provide options for people to set coding system in the package they use, not to figure out what's under the hood.
Even, we can just restrict the coding system in utf-8
,
leave nothing to worry about by people who uses the package.
I think all these are called as ABSTRACTION.