Search for CJK + other Asian locales does not work

• Jun 28, 2010 - 07:55
S4 - Minor

Two tests:
Japanese: (0) vs (3)
Simplified Chinese: (0) vs (9)

Problem can be connected with indexing multiple locales in one apache solr instance.


The current Apache Solr search engine on implements the default tokenizer which does not do the right job for CJK+other locales. So that explains the reason why it does not work. To solve this, there are several options:

  • for each different tokenizer, make a different Solr instance
  • add a field to schema.xml with different tokenizer and alter the search queries to use the different field

For the time being, a solution is being worked out by using Google Custom Search Engine for CJK+other locales. See a proof of concept at where the default search form is replaced by the google one.

ToDo: further extend the google solution for other CJK languages on until there is a real Solr solution.

Google Custom Search has been implemented now for Japanese, Chinese, Russian, Arabic, Ukrainian, Slovenian, Romanian. By testing more languages, I found out that the current Solr schema is probably not optimized for multi lingual content indexing. More investigation needed.

CJK search requires Content Translation module enabled to run properly. You can try to enable the module, rebuild the index and run cron a few times (depends on how much info you have on your site). It should work :)