Accelerating Fuzzy Search in PostgreSQL with Tokenisation
Background and Challenges
As our business data surpassed 2 million rows, traditional LIKE '%keyword%' fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the zhparser extension for full-text search.
Evolution Path and Environment Adaptation
This implementation went through four key phases, each addressing distinct technical challenges:
CentOS 7.9 VM (Feasibility Validation)
- Goal: Validate the compatibility of
SCWS+zhparseron older systems. - Core Action: Manually compiled
postgresql-16.2from source in a CentOS 7.9 environment, got the extension working end-to-end.
Conclusion: Confirmed significant performance improvements from the tokenisation approach for Chinese search.
Local Docker Container (Containerisation Exploration)
- Goal: Initial testing in a complete local system.
- Core Action: Injected binary
.sofiles viadocker cp, resolvedldconfigdynamic library path visibility issues. - Discovery: Identified that missing dictionary files cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.
EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)
- Goal: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
- Core Issue: Resolved
libscws.so.1loading errors. - Key Solutions:
- Ensured the
postgresruntime user has access permissions to/usr/local/scws/lib. - Modified
systemdservice environment variables or created/usr/lib64symlinks to force refresh library search paths.
- Ensured the
Production Deployment Preparation (Final Tuning)
- Goal: Ensure query stability at 2M+ data volume.
- Optimisation: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first +
pg_trgmindex assist" degraded query strategy.
Core Installation and Configuration Steps (Self-Compiled Environments)
Installing the SCWS Tokenisation Engine
SCWS is the underlying core dependency of zhparser and must be installed first.
-
Download and Extract: Download the source package (e.g.,
scws-1.2.3). -
Compile and Install:
-
Verify the Library: Ensure
/usr/local/scws/lib/libscws.so.1exists.
Compiling and Installing zhparser
This step requires pg_config from the self-compiled PostgreSQL installation.
-
Get the Source: Clone the
zhparserproject from GitHub. -
Compile with Specified Path:
Note: The
installstep automatically placeszhparser.sointo PG'spkglibdirand extension scripts into theextensiondirectory.
Resolving Dynamic Library Dependencies
-
Refresh System Cache:
-
Permission Check: Ensure the OS user running
postgreshasrxpermission on/usr/local/scws/lib. -
Force Symlink (Alternative): If
ldconfigfails, symlink the library file to/usr/lib64.
Restarting the Database
After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.
Database-Level Initialisation
Connect to psql and run the logical configuration:
Performance Analysis and Pitfalls
Index Performance Bottleneck Analysis
During testing, it was found that even exact queries suffered from Bitmap Index Scan due to an improperly designed composite index (with create_time as the leading column), resulting in query times as high as 482 ms.
- Improvement: Created single-column B-tree indexes on frequently searched columns, reducing response time to under 10 ms with
Index Scan.
GIN Index and Non-Semantic Matching
- Cross-Word Truncation: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with
@@due to tokenisation boundaries. - Mitigation Strategy: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to
LIKEfuzzy queries, assisted bypg_trgmindexing.
Final Deployment Strategy: Dual-Track Parallel Retrieval
After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:
- Step 1: Full-Text Search (Fast Track): Use GIN index for
@@matching. - Step 2: Result Evaluation: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
- Step 3: Fuzzy Fallback (Safe Fallback): Execute
LIKE '%keyword%'. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.
Search Optimisation: Integrating a Custom Business Lexicon
To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.
Lexicon Extraction and Preprocessing
Leverage the structural parsing capabilities of companynameparser to strip region names and industry suffixes, and use jieba for semantic validation to ensure core brand name integrity:
- Extraction Logic: Traverse all
buyer_unique_namevalues via a Python script, extracting the corebrandfield. - Weight Compensation: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to 50.0–60.0 to ensure their priority overrides built-in quantifier rules.
- Output Specification: Produce SCWS-compliant 4-field
UTF-8text (WORD, TF, IDF, ATTR). Use tab\tseparators to avoid parsing anomalies.
Lexicon Compilation and Deployment
Convert the text dictionary to SCWS's efficient binary format (XDB):
-
Compile the Binary Dictionary:
-
File Distribution and Permissions: Move the generated
.xdbfile to the tokenisation data directory and ensure thepostgresuser has read permission:
Database Parameter Configuration
Modify postgresql.conf to force-load zhparser and its custom extension lexicon:
Hot Index Rebuild and Verification
Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:
Physically Restart the Service:
Online Index Rebuild: Use the CONCURRENTLY keyword to refresh the GIN index without blocking DML operations on 400K rows:
Tokenisation Effectiveness Verification:
Optimisation Notes:
- Explicit Weight Compensation: This is the key technique that resolved the "元一" tokenisation failure (shown as
x). - Distinguish Restart from Reload:
shared_preload_librariesmust be activated viarestart, not a simple reload.