Sunday, February 28, 2010

A Better Comment Script

A couple of years ago, I presented a shell script for commenting out lines, for use in a LaTeX mode for SubEthaEdit. The script is an improvement over the AppleScript approach used in other SubEthaEdit modes, but does something I don't really like: it always inserts or removes comments at the beginning of the lines, rather than at an appropriate indentation level.

Below, I present line-comment, an awk script that handles line-oriented comments, taking indentation into account. Lines to comment or uncomment are read from Standard Input, and the processed lines are written to Standard Output. The script uses the current commenting of the lines to determine whether to comment or uncomment.

There are two options that can be set. First, there is the TabWidth, which defines an indentation level; this defaults to the (basically useless) Unix standard of an eight-space tab. You'll almost always want to set this, even if you're using tabs, not spaces, for indentation. Second, there is the CommentString, whose meaning should be obvious; this defaults to the hash character # common to many programming languages.

As an example, a nice choice for Python could be line-comment TabWidth=4, while for Scala line-comment TabWidth=2 CommentString="//" would be more suitable.

Update: If the script doesn't seem to work, try setting the environment variable COMMAND_MODE=unix2003. This is relevant if you want to call it from SubEthaEdit: SEE uses COMMAND_MODE=legacy, which can cause the regular expressions to fail to match.

The script is unchanged:
#! /usr/bin/awk -f

function trimIndent(text, indRE, n, tokens) {
# Returns the text with the indentation removed. Sets
# global variable IndentLevel to show how many
# indentation levels were removed.
n = split(text, tokens, indRE)
if (n > 1) {
IndentLevel = n-1
rest = tokens[n]
} else {
IndentLevel = 0
rest = text
}
return rest
}

function commentIndex(txt, commtxt, n) {
n = index(txt, commtxt)
if (n > 0 && substr(txt, 1, n-1) !~ /^[ \t]*$/) {
n = 0
}
return n
}

function noncommentPrefix(txt) {
return match(txt, /^[ \t]*/) ? substr(txt, 1, RLENGTH) : ""
}

function multiString(str, mult, n, mstr) {
mstr = ""
for (n=0; n<mult; n++) {
mstr = mstr str
}
return mstr
}

function offsetString(offset) {
return multiString(" ", offset)
}

BEGIN {
TabWidth=8
CommentString="#"
}

NR == 1 {
# Establish regex based on tab settings. This comes after
# the BEGIN to allow the TabWidth to be overriden.
indentRegex = "( {0,"(TabWidth-1)"}\t| {"TabWidth"})"
# indentRegex = "( {0,3}\t|"offsetString(TabWidth)")"
}

{
# Common processing for all lines. Divide the line into a
# prefix of whitespace, followed immediately by the
# comment string, if present, or a non-tab, non-space
# character if not. The prefix consists of indentation
# steps followed by an offset, defined as a number of spaces
# insufficient to constitute an indentation step.
Line[NR] = $0
commInd = commentIndex($0, CommentString)
if (commInd > 0) {
prefix = substr($0, 1, commInd-1)
CommentPosition[NR] = commInd
} else {
prefix = noncommentPrefix($0)
}
offset = length(trimIndent(prefix, indentRegex))
}

NR == 1 {
BaseOffset = offset
BaseIndent = IndentLevel
MinOffset = BaseOffset
MinIndent = BaseIndent
}

NR > 1 {
if (IndentLevel < MinIndent ) {
MinIndent = IndentLevel
MinOffset = 0
} else if (offset < MinOffset) {
MinOffset = offset
}
}

END {
commLen = length(CommentString)
if (length(CommentPosition) == length(Line) && MinIndent == BaseIndent && MinOffset == BaseOffset) {
for (n=1; n<=NR; n++) {
commPos = CommentPosition[n]
print substr(Line[n], 1, commPos-1) substr(Line[n], commLen+commPos)
}
} else {
indentPart = MinIndent ? indentRegex"{"MinIndent"}" : ""
# indentPart = multiString(indentRegex, MinIndent)
offsetPart = offsetString(MinOffset)
# offsetPart = offsetString(MinOffset)
commRegex = "^" indentPart offsetPart
for (n=1; n<=NR; n++) {
match(Line[n], commRegex)
print substr(Line[n], 1, RLENGTH) CommentString substr(Line[n], 1+RLENGTH)
}
}
}

No comments: