Skip to content

Word boundaries with non-ASCII character #267

Open
@teoric

Description

@teoric

I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:

ack '\büber'
ack -w über

The first should find me exactly the lines containing a word starting with über, the second should find exactly the lines where über is a single word, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching with pcregrep.

The first line also returns lines containing words that contain über (such as darüberhinaus) and the second one also those containing words ending in über (such as darüber), which seems to suggest that the boundary matches before ü, i.e. ü is not counted as a word character (but should be).

Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.

(ack 2.12 / perl 5.18.2 with Ubuntu 14.04, and ack 2.14 / perl 5.22 on Mac OS X 10.11)

Matches can be made correct, as far as I can see, adding

use feature unicode_strings; # optional?
use re "/u";

Switching to Unicode processing would probably also help to attack #262 .

to the beginning of the ack script (probably also if adding to the library). Maybe this can be made an option for non-ASCII-ists?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions