Word boundaries with non-ASCII character

I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:

```
ack '\büber'
ack -w über
```

The first should find me exactly the lines containing a word _starting with_ `über`, the second should find exactly the lines where `über` is a _single word_, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching with `pcregrep`.

The first line also returns lines containing words that _contain_ über (such as `darüberhinaus`) and the second one also those containing words _ending in_ `über` (such as `darüber`), which seems to suggest that the boundary matches before `ü`, i.e. `ü` is not counted as a word character (but should be).

Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.

(ack 2.12 / perl 5.18.2 with Ubuntu 14.04, and ack 2.14 / perl 5.22 on Mac OS X 10.11)

Matches can be made correct, as far as I can see, adding 

  use feature unicode_strings; # optional?
  use re "/u";

Switching to Unicode processing would probably also help to attack beyondgrep/ack3#262 .

to the beginning of the ack script (probably also if adding to the library). Maybe this can be made an option for non-ASCII-ists?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Word boundaries with non-ASCII character #267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Word boundaries with non-ASCII character #267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions