Search for CJK + other Asian locales does not work

• Jun 28, 2010 - 07:55
Type
musescore.org
Severity
S4 - Minor
Status
closed
Project

Two tests:
Japanese: musescore.org (0) vs google.com (3)
Simplified Chinese: musescore.org (0) vs google.com (9)

Problem can be connected with indexing multiple locales in one apache solr instance.
See http://drupal.org/node/662736


Comments

The current Apache Solr search engine on musescore.org implements the default tokenizer which does not do the right job for CJK+other locales. So that explains the reason why it does not work. To solve this, there are several options:

  • for each different tokenizer, make a different Solr instance
  • add a field to schema.xml with different tokenizer and alter the search queries to use the different field

For the time being, a solution is being worked out by using Google Custom Search Engine for CJK+other locales. See a proof of concept at http://musescore.org/ja where the default search form is replaced by the google one.

ToDo: further extend the google solution for other CJK languages on musescore.org until there is a real Solr solution.

Google Custom Search has been implemented now for Japanese, Chinese, Russian, Arabic, Ukrainian, Slovenian, Romanian. By testing more languages, I found out that the current Solr schema is probably not optimized for multi lingual content indexing. More investigation needed.

CJK search requires Content Translation module enabled to run properly. You can try to enable the module, rebuild the index and run cron a few times (depends on how much info you have on your site). It should work :)