Coding Systems in `shell-command`

Posted on 2012-11-22

Problem

The day before yesterday, I looked into the Markdown menu of Emacs and saw "Preview" and "Export" commands while I was editing one markdown document, then, I just clicked, and opened the exported HTML file, but found out that almost all my contents were emptied, that document were written in Chinese.

I had another separate Emacs instance running on a terminal, I do the same thing there, the non-ASCII characters were well outputted.

Solution

Before, I always used markdown directly by shell-command, and it worked well. I doubt there might be something wrong in the encoding functions of markdown-mode. Then I checked.

Below is the markdown function definition in markdown-mode.

(defun markdown (&optional output-buffer-name)
  "Run 'markdown' on current buffer and insert output in buffer given by
'output-buffer-name' (defaults to `markdown-output-buffer-name').  Return the
OUTPUT-BUFFER used."
  (interactive)
  (save-window-excursion
    (let ((begin-region)
          (end-region))
      (if (and (boundp 'transient-mark-mode) transient-mark-mode mark-active)
          (setq begin-region (region-beginning)
                end-region (region-end))
        (setq begin-region (point-min)
              end-region (point-max)))

      (unless output-buffer-name
        (setq output-buffer-name markdown-output-buffer-name))

      (cond
       ;; Handle case when `markdown-command' does not read from stdin
       (markdown-command-needs-filename
        (if (not buffer-file-name)
            (error "Must be visiting a file")
          (shell-command (concat markdown-command " "
                                 (shell-quote-argument buffer-file-name))
                         output-buffer-name)))
       ;; Pass region to `markdown-command' via stdin
       (t
        (shell-command-on-region begin-region end-region markdown-command
                                 output-buffer-name))))
    output-buffer-name))

The cond expression has two branches, the first one executes markdown using shell-mode directly on the file of current buffer, the second one instead uses shell-command-on-region on a region marked in current buffer.

If markdown-mode was written by me, I would always use the shell-command-on-region, because able to preview part of a markdown document is definitely a good user experience. I was right, the markdown-command-needs-filename is set as nil by default.

I wanted to fix my encoding problem, so, I set it to t, and then "preview" again, non-ASCII characters showed up in exported HTML.

But I didn't want to stop there, since it's a workaround, workaround smells.

There might be some differences on encoding between shell-command and shell-command-on-region which caused this issue.

To specify a coding system for converting non-ASCII characters
in the shell command output, use C-x RET c before this command.

Noninteractive callers can specify coding systems by binding
`coding-system-for-read' and `coding-system-for-write'.

Stepped into shell-command function, I saw above comments. Then tracing into coding-system-for-read,

coding-system-for-read is a variable defined in `C source code'.
Its value is nil

Documentation:
Specify the coding system for read operations.
It is useful to bind this variable with `let', but do not set it globally.
If the value is a coding system, it is used for decoding on read operation.
If not, an appropriate element is used from one of the coding system alists.
There are three such tables: `file-coding-system-alist',
`process-coding-system-alist', and `network-coding-system-alist'.

It looks for coding system from file-coding-system-alist at first, below are mine.

Value: (("\\.dz\\'" no-conversion . no-conversion)
 ...
("\\.elc\\'" . utf-8-emacs)
("\\.utf\\(-8\\)?\\'" . utf-8)
("\\.xml\\'" . xml-find-file-coding-system)
 ...
("" undecided))

All my documents are uft-8 encoded, so, the output of shell-command are encoded as uft-8 format, too.

shell-command-on-region is different, below are the comments of it.

To specify a coding system for converting non-ASCII characters
in the input and output to the shell command, use C-x RET c
before this command.  By default, the input (from the current buffer)
is encoded using coding-system specified by `process-coding-system-alist',
falling back to `default-process-coding-system' if no match for COMMAND
is found in `process-coding-system-alist'.

Instead, it uses process-coding-system-alist as encoding options, in my machine, it's nil. So, it searches encoding options in default-process-coding-system.

In my machine the value of default-process-coding-system is

(undecided-unix . iso-latin-1-unix)

but in the terminal Emacs instance, it's (utf-8 . utf-8), that's why I get right output there.

Added below line to my dot emacs, problem was solved.

(setq default-process-coding-system '(utf-8 . utf-8))

Or,

(set-language-environment "UTF-8")

I discussed this problem with Xah Lee, he mentioned default-process-coding-system in his environment is utf-8, probably because he set set-language-environment to utf-8, and I did a test, it worked. I prefer this setting, because it looks like environmentally global. :)

Complaint

Emacs supports nearly all encoding formats, there are dozens of (default) coding systems of different modes need to be set.

Imagine a package relies on one built-in function, and this function looks for coding system from some built-in variables, but a user may even do not know these variables exist, it will be hard for him to find what happened when he get a messy output.

As an amazing good editor or an operating system :), supports all kinds of encoding formats is needed, but to be a modern editor, people don't need to know what's under the hood every time he's about to configure something.

We could move this responsibility to Packages that people use (some people call them plugins or extensions).

For example, we can set the output coding system to be the same as the input before executing some built-in functions which support different kinds of coding systems.

We can provide options for people to set coding system in the package they use, not to figure out what's under the hood.

Even, we can just restrict the coding system in utf-8, leave nothing to worry about by people who uses the package.

I think all these are called as ABSTRACTION.