Description
I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:
ack '\büber'
ack -w über
The first should find me exactly the lines containing a word starting with über
, the second should find exactly the lines where über
is a single word, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching with pcregrep
.
The first line also returns lines containing words that contain über (such as darüberhinaus
) and the second one also those containing words ending in über
(such as darüber
), which seems to suggest that the boundary matches before ü
, i.e. ü
is not counted as a word character (but should be).
Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.
(ack 2.12 / perl 5.18.2 with Ubuntu 14.04, and ack 2.14 / perl 5.22 on Mac OS X 10.11)
Matches can be made correct, as far as I can see, adding
use feature unicode_strings; # optional?
use re "/u";
Switching to Unicode processing would probably also help to attack #262 .
to the beginning of the ack script (probably also if adding to the library). Maybe this can be made an option for non-ASCII-ists?