groonga-normalizer-mysql
Groonga-normalizer-mysql is a Groonga plugin. It provides MySQL compatible normalizers and a custom normalizers to Groonga.
Here are MySQL compatible normalizers:
NormalizerMySQLGeneralCIforutf8mb4_general_ciNormalizerMySQLUnicodeCIforutf8mb4_unicode_ciNormalizerMySQLUnicode520CIforutf8mb4_unicode_520_ciNormalizerMySQLUnicode900(deprecated byNormalizerMySQLUnicode) for:utf8mb4_0900_ai_ci(NormalizerMySQLUnicode900)utf8mb4_0900_as_ci(NormalizerMySQLUnicode900("weight_level", 2))utf8mb4_0900_as_cs(NormalizerMySQLUnicode900("weight_level", 3))utf8mb4_ja_0900_as_cs(NormalizerMySQLUnicode900("locale", "ja", "weight_level", 3))utf8mb4_ja_0900_as_cs_ks(NormalizerMySQLUnicode900("locale", "ja", "weight_level", 4))
NormalizerMySQLUnicodefor:utf8mb4_0900_ai_ci(NormalizerMySQLUnicode("version", "9.0.0"))utf8mb4_0900_as_ci(NormalizerMySQLUnicode("version", "9.0.0", "accent_sensitive", true))utf8mb4_0900_as_cs(NormalizerMySQLUnicode("version", "9.0.0", "accent_sensitive", true, "case_sensitive", true))utf8mb4_ja_0900_as_cs(NormalizerMySQLUnicode("version", "9.0.0", "accent_sensitive", true, "case_sensitive", true, "locale", "ja"))utf8mb4_ja_0900_as_cs_ks(NormalizerMySQLUnicode("version", "9.0.0", "accent_sensitive", true, "case_sensitive", true, "locale", "ja", "kana_sensitive", true))utf8mb4_uca1400_ai_ci(NormalizerMySQLUnicode("version", "14.0.0"))utf8mb4_uca1400_ai_cs(NormalizerMySQLUnicode("version", "14.0.0", "case_sensitive", true))utf8mb4_uca1400_as_ci(NormalizerMySQLUnicode("version", "14.0.0", "accent_sensitive", true))utf8mb4_uca1400_as_cs(NormalizerMySQLUnicode("version", "14.0.0", "accent_sensitive", true, "case_sensistive", true))
Here are custom normalizers:
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark- It's based on
NormalizerMySQLUnicodeCI
- It's based on
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark- It's based on
NormalizerMySQLUnicode520CI
- It's based on
They are self-descriptive name but long. They are variant normalizers
of NormalizerMySQLUnicodeCI and NormalizerMySQLUnicode520CI. They
have different behaviors. The followings are the different
behaviors. They describes with
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark but they
are true for
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark.
NormalizerMySQLUnicodeCInormalizes all small Hiragana such asぁ,っto Hiragana such asあ,つ.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarkdoesn't normalizeぁtoあnorっtoつ.ぁandあare different characters.っandつare also different characters. This behavior is described byExceptKanaCIin the long name. This following behaviors ared described byExceptKanaWithVoicedSoundMarkin the long name.NormalizerMySQLUnicodenormalizes all Hiragana with voiced sound mark such asがto Hiragana without voiced sound mark such asか.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarkdoesn't normalizeがtoか.がandかare different characters.NormalizerMySQLUnicodenormalizes all Hiragana with semi-voiced sound mark such asぱto Hiragana without semi-voiced sound mark such asは.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarkdoesn't normalizeぱtoは.ぱandはare different characters.NormalizerMySQLUnicodenormalizes all Katakana with voiced sound mark such asガto Katakana without voiced sound mark such asカ.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarkdoesn't normalizeガtoカ.ガandカare different characters.NormalizerMySQLUnicodenormalizes all Katakana with semi-voiced sound mark such asパto Hiragana without semi-voiced sound mark such asハ.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarkdoesn't normalizeパtoハ.パandハare different characters.NormalizerMySQLUnicodenormalizes all halfwidth Katakana with voiced sound mark such asガto halfwidth Katakana without voiced sound mark such asカ.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMarknormalizes all halfwidth Katakana with voided sound mark such asガto fullwidth Katakana with voiced sound mark such asガ.
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark and
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark
are MySQL incompatible normalizers but they are useful for Japanese
text. For example, ふらつく and ブラック has different
means. NormalizerMySQLUnicodeCI identifies ふらつく with ブラック
but NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't identify them.
Add apt-line for the Groonga deb package repository
and install groonga-normalizer-mysql package:
% sudo apt-get -y install groonga-normalizer-mysql
Add apt-line for the Groonga deb package repository
and install groonga-normalizer-mysql package:
% sudo apt-get -y install groonga-normalizer-mysql
Install groonga-repository package:
% sudo dnf install -y https://packages.groonga.org/almalinux/8/groonga-release-latest.noarch.rpm
Then install groonga-normalizer-mysql package:
% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql
Install groonga-repository package:
% sudo dnf install -y https://packages.groonga.org/almalinux/9/groonga-release-latest.noarch.rpm
Then install groonga-normalizer-mysql package:
% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql
Install groonga-repository package:
% sudo dnf install -y https://packages.groonga.org/amazon-linux/2023/groonga-release-latest.noarch.rpm
Then install groonga-normalizer-mysql package:
% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql
Install groonga package (which includes groonga-normalizer-mysql):
% brew install groonga
You need to build from source. Here are build instructions.
Install the following build tools:
Download the latest Groonga source from GitHub releases. Source file name is formatted as groonga-X.Y.Z.zip.
Extract the source and move to the source folder:
> cd ...\groonga-X.Y.Z
groonga-X.Y.Z>
Run CMake. Here is a command line to install Groonga to C:\groonga folder:
groonga-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga
Build:
groonga-X.Y.Z> cmake --build . --config Release
Install:
groonga-X.Y.Z> cmake --build . --config Release --target Install
Download the latest groonga-normalizer-mysql source from GitHub releases. Source file name is formatted as groonga-normalizer-X.Y.Z.zip.
Extract the source and move to the source folder:
> cd ...\groonga-normalizer-mysql-X.Y.Z
groonga-normalizer-mysql-X.Y.Z>
IMPORTANT!!!: Set PKG_CONFIG_PATH environment variable:
groonga-normalizer-mysql-X.Y.Z> set PKG_CONFIG_PATH=C:\groonga\local\lib\pkgconfig
Run CMake. Here is a command line to install Groonga to C:\groonga folder:
groonga-normalizer-mysql-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga
Build:
groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release
Install:
groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release --target Install
First, you need to register normalizers/mysql plugin:
groonga> register normalizers/mysql
Then, you can use NormalizerMySQLGeneralCI and
NormalizerMySQLUnicodeCI as normalizers:
groonga> table_create Lexicon TABLE_PAT_KEY --default_tokenizer TokenBigram --normalizer NormalizerMySQLGeneralCI
- Groonga >= 8.0.4
- English: [email protected]
- Japanese: [email protected]
- Alexander Barkov <[email protected]>: The author of
MYSQL_SOURCE/strings/ctype-utf8.c. - ...
- Kouhei Sutou <[email protected]>
LGPLv2 only. See doc/text/lgpl-2.0.txt for details.
This program uses normalization table defined in MySQL source code. So
this program is derived work of MYSQL_SOURCE/strings/ctype-utf8.c,
MYSQL_SOURCE/strings/uca900_data.h,
MYSQL_SOURCE/strings/uca900_ja_data.h. This program is the same
license as them and they are licensed under LGPLv2 only.
This program also uses normalization table defined in MariaDB source code. The table is generated from https://www.unicode.org/Public/UCA/14.0.0/allkeys.txt . So the normalization table is licensed under https://www.unicode.org/copyright.html . It's compatible with LGPLv2 only. So the program can use LGPLv2 only.
editor doc/text/news.mdrake releaserelease task execute the following tasks:
- rake release:version:update
- rake release:tag
- rake dev:version:bump
A package for Ubuntu build and publish automatically on GitHub Actions. So, we only confirm result of build and publish on LaunchPad.