vgram - V-gram indexing and statistics for PostgreSQL

Key idea

contrib/pg_trgm indexes substrings whose length is 3. However, some 3-grams appear to be very frequent, consider the, which is a very common trigram in English. Additionally, some 2-grams are rare enough to index them separately, such as zz. The idea of this module is to index q-grams of variable q, providing optimal frequency based on the real-life distribution.

Collecting V-gram statistics

The first thing you need to do is to collect the statistics of frequent q-grams. Based on these statistics, we know that everything longer is indexable. The qgram_stat() SQL aggregate function serves for this purpose.

qgram_stat(s text, min_q int, max_q int, threshold float8) returns text[]

Where min_q and max_q are the minimal and maximal length of q-gram to be extracted. threshold is a threshold frequency, everything with a frequency of this or greater wouldn't be indexed. The result is an array of frequent q-grams. This array is to be passed as the vgram option of vgram_gin_ops opclass, along with min_q and max_q. You can also check how vgram extraction works using the get_vgrams() SQL function.

See following example to collect V-gram statistics using dblp_titles.s column, then use these statistics to build an index and accelerate queries containing LIKE/ILIKE expressions.

-- Collect V-gram statistics
# SELECT qgram_stat(s, 2, 4, 0.05) AS vgrams FROM titles;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             vgrams
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {$a,$a$,$an,$and,$ap,$app,$b,$ba,$bas,$c,$co,$com,$con,$d,$de,$di,$e,$f,$fo,$for,$g,$h,$ho,$hom,$i,$in,$in$,$int,$l,$m,$ma,$me,$mo,$mod,$n,$ne,$net,$o,$of,$of$,$on,$on$,$p,$pa,$pag,$pr,$pro,$r,$re,$s,$se,$st,$sy,$sys,$t,$te,$th,$the,$to,$to$,$tr,$u,$us,$v,$w,$wi,$wit,a$,ab,ac,act,ad,ag,age,age$,al,al$,am,an,an$,ana,and,and$,ap,app,ar,as,ase,ased,at,ate,ati,atio,ba,bas,base,bi,bl,c$,ca,cal,cat,cati,ce,ce$,ces,ch,ci,co,com,comp,con,ct,cti,ctio,cu,d$,da,de,del,di,du,e$,ea,ec,ect,ed,ed$,ee,el,el$,em,eme,en,ent,ent$,er,er$,era,eri,es,es$,ess,et,etw,ev,ex,f$,fe,fi,fo,for,for$,form,g$,ge,ge$,gen,gi,gn,gr,gra,h$,ha,he,he$,hi,ho,hom,home,ia,ic,ic$,ica,icat,id,ie,if,ig,il,im,in,in$,ine,ing,ing$,int,inte,io,ion,ion$,ions,ir,is,ist,it,ith,iti,ity,ity$,iv,ive,ive$,iz,k$,l$,la,le,le$,li,lin,ll,lo,lt,lu,ly,m$,ma,man,mat,mati,me,me$,men,ment,mi,mo,mod,mode,mp,ms,ms$,mu,mul,n$,na,nal,nc,nce,nce$,nd,nd$,ne,net,ng,ng$,ni,no,ns,ns$,nt,nt$,nte,o$,ob,oc,od,ode,odel,of,of$,og,ol,om,ome,ome$,omp,on,on$,ons,ons$,op,or,or$,ori,ork,orm,orma,os,ot,ou,pa,pag,page,pe,per,pl,po,pp,pr,pro,pt,qu,r$,ra,ral,rat,rc,re,re$,res,ri,rit,rk,rm,rma,ro,rs,rt,ry,s$,sc,se,sed,sed$,si,sin,sing,so,sp,ss,st,ste,stem,str,su,sy,sys,syst,t$,ta,tat,te,ted,ted$,tem,ter,th,th$,the,the$,ti,tic,tim,tin,ting,tio,tion,tiv,tive,to,to$,tor,tr,tra,tri,ts,ts$,tu,tur,tw,two,ty,ty$,ua,uc,ue,ul,ult,un,ur,ure,us,usi,usin,ut,va,ve,ve$,ver,vi,wi,wit,with,wo,wor,work,y$,ys,yst,yste}
(1 row)

-- Save it as psql variable
# \gset

-- Check V-gram extraction
# SELECT get_vgrams('indexing', 2, 4, :'vgrams');
    get_vgrams
------------------
 {ind,nde,dex,xi}
(1 row)

-- Build V-gram index
# CREATE INDEX titles_s_idx ON titles USING gin (s vgram_gin_ops (minq=2, maxq=4, vgrams=:'vgrams'));
CREATE INDEX

# EXPLAIN SELECT * FROM titles WHERE s LIKE '%indexing%';
                                 QUERY PLAN
----------------------------------------------------------------------------
 Bitmap Heap Scan on titles  (cost=38.88..42.89 rows=1 width=48)
   Recheck Cond: (s ~~ '%indexing%'::text)
   ->  Bitmap Index Scan on titles_s_idx  (cost=0.00..38.88 rows=1 width=0)
         Index Cond: (s ~~ '%indexing%'::text)
(4 rows)

# SELECT * FROM titles WHERE s LIKE '%indexing%';
                                           s
----------------------------------------------------------------------------------------
 Hybrid term indexing for different IR models.
 Efficient speech indexing and search for embedded devices using uniterms.
 The EURATOM automatic indexing project.
 Spectral cross-correlation features for audio indexing of broadcast news and meetings.
(4 rows)

Advanced statistics

You also may use vgram_text datatype instead of simple text and take advantage of statistics for frequent individual characters, 2-grams, and 3-grams. These statistics provide selectivity estimation for LIKE/ILIKE using Markov's model. LIKE/ILIKE operators are implemented directly for vgram_text while the rest of string operations are available using implicit cast to text. Note that you must use vgram_gin_ops2 opclass for vgram_text columns.

# CREATE TABLE titles (s text);
CREATE TABLE

# \copy titles from 'data/titles.data'
COPY 10000

# ANALYZE titles;
ANALYZE

# EXPLAIN ANALYZE SELECT * FROM titles WHERE s LIKE '%the%';
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on titles  (cost=0.00..225.00 rows=1030 width=48) (actual time=0.018..2.075 rows=1103.00 loops=1)
   Filter: (s ~~ '%the%'::text)
   Rows Removed by Filter: 8897
   Buffers: shared hit=100
 Planning Time: 0.094 ms
 Execution Time: 2.137 ms
(6 rows)

Time: 2.783 ms
# EXPLAIN ANALYZE SELECT * FROM titles WHERE s LIKE '%zz%';
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Seq Scan on titles  (cost=0.00..225.00 rows=64 width=48) (actual time=0.043..1.556 rows=105.00 loops=1)
   Filter: (s ~~ '%zz%'::text)
   Rows Removed by Filter: 9895
   Buffers: shared hit=100
 Planning Time: 0.145 ms
 Execution Time: 1.590 ms
(6 rows)

Installation

vgram is PostgreSQL extension which requires PostgreSQL 13 or higher. Before build and install you should ensure following:

PostgreSQL version is 13 or higher.
You have development package of PostgreSQL installed or you built PostgreSQL from source.
Your PATH variable is configured so that pg_config command available, or set PG_CONFIG variable.

Typical installation procedure may look like this:

$ git clone https://github.com/akorotkov/vgram.git
$ cd vgram
$ make USE_PGXS=1
$ sudo make USE_PGXS=1 install
$ make USE_PGXS=1 installcheck
$ psql DB -c "CREATE EXTENSION vgram;"

Author

Alexander Korotkov [email protected], Supabase Inc.

Availability

vgram is realized as an extension and not available in default PostgreSQL installation. It is available from github under the same license as PostgreSQL and supports PostgreSQL 13+.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
ci		ci
data		data
expected		expected
sql		sql
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
regression.conf		regression.conf
typedefs_gen.py		typedefs_gen.py
valgrind.supp		valgrind.supp
vgram--1.0.sql		vgram--1.0.sql
vgram.c		vgram.c
vgram.control		vgram.control
vgram.h		vgram.h
vgram_gin.c		vgram_gin.c
vgram_like.c		vgram_like.c
vgram_selfunc.c		vgram_selfunc.c
vgram_typanalyze.c		vgram_typanalyze.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vgram - V-gram indexing and statistics for PostgreSQL

Key idea

Collecting V-gram statistics

Advanced statistics

Installation

Author

Availability

About

Uh oh!

Releases

Packages

Languages

akorotkov/vgram

Folders and files

Latest commit

History

Repository files navigation

vgram - V-gram indexing and statistics for PostgreSQL

Key idea

Collecting V-gram statistics

Advanced statistics

Installation

Author

Availability

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages