Drupal Core Search SQL: making it faster

Doug Green

on

September 13, 2007

Drupal Core Search SQL: making it faster

This is Part two of a six part series on Drupal's search functionality. This installment discusses how Drupal's search work from a SQL level and offer some solutions for improving performance.

Introduction

The Drupal search SQL is elegant and powerful. But it's time to update it. There are two issues that are no longer relevant that affect the current implementation and limit it's performance.
  1. Drupal's strive for software compatibility is a strength and a weakness. By not implementing features of newer versions of software (in this case, database software), people who don't have access to the newer versions aren't locked out of newer versions of Drupal. This is of course a trade off - compatibility means more people can deploy Drupal, but sacrifices potential performance features. Instead of degrading gracefully, the Drupal search module restricts implementation to the lowest common denominator. Changes in MySQL (the most common database used for drupal) between versions 3 and 4 open up some possibilities for improving Drupal's search performance.
  2. Also, a likely bug (that has probably already been fixed) is mostly responsible for the implementation that is not as fast as it could be.

Search Query Types

Search queries can be done in a variety of different ways. Drupal's search allows the user create very powerful search by merely adding a few key words or symbols.

AND Searches

To get results which match multiple words, simply type the words that you want to find. This is the default behavior of Drupal's search.

OR Searches

To search for data which may contain one of a few terms, but not necessarily all of the terms, you need an OR search. Use the keyword OR in between each search term: "dogs OR cats" or "dogs OR cats OR birds". Notice that the keyword "OR" must be capitalized. If you use the lower case "or" instead, search module thinks that this is another word to search for, and since it is probably smaller than the search word length (the default is 3 characters), search module simply ignores it and does a "AND" search on the two words.

Exclusionary Searches

You can search for things that don't exist. For example, if you wanted to search pet stores which do not have chickens, you need an exclusionary search. Drupal allows you to do this using the minus "-" sign. For example "pet stores -chickens" finds all nodes with "pet" and "stores that do not also contain the words "chickens".

Advanced Searches

The search module also supports the advanced options such as restricted results by kinds of content types. For example, if your site has forums, you can use "type:forum" to restrict the search only for content in forum posts. This can be helpful in sites with large amounts of diverse kinds of data. One of the best ways to explore the advanced options is to use the "Advanced Search" tab (of "Content" tab on the search page) and look at the string placed in the "keyword" input box, after you submit the form. This can give you clues about how to construct your own searches.

Overview of the Current SQL

The way that Drupal constructs the query that it uses to return results to a user is a three step process.
  1. According to the comments in search module (above do_search), the first pass selects all possible matching queries (anything that matches "dog AND cat" also matches "dog OR cat"). However, I think that this comment is slightly wrong. This query is actually more restrictive than the "OR" query. It does the "OR" or the "AND" query, the only difference being the number of results required for a match. (more on this below)
  2. The second pass narrows the results returned in the first pass by applying the "AND" and exclusionary terms. This pass is needed for two reasons, both of which can be overcome (see below).
  3. The final pass simply returns the pager query of the combined results from the first two passes.
It would be better for database caching purposes to have a single query because, if this were true, the query results could be cached. This is most noticable when paging through results. For example, when a search returns more than 10 matches, the results are paged, and each page requires a full re-query, because it can't be cached.

Solving the Problems with the current SQL

The "AND" case: unique search_index table and HAVING clause

If we know that the search_index table is unique, the "AND" terms can be applied in the first pass using the HAVING clause. The current "OR" SQL appends the clause HAVING COUNT(*) >= 1 and the "AND" SQL appends the clause HAVING COUNT(*) >= 2 (where 2 is the number of terms in the query). Notice that COUNT uses the greater-than-or-equal operator. This is necessary because prior to 5.x, frequently there were cron concurrency problems. Two separate executing threads could insert the same data into the database, creating more results than expected. This should not be a problem with current Drupal installs. However, there are some reports that duplicates still exist. There are two solutions:
  1. solve the problem of inserting duplicates, which may already be solved, and if not solved yet, is elusive, or
  2. add a unique index to the search_index table (see #143160)
Once this is done, the "AND" query will work as-is, although the code should be changed from an >= to an =.

Exclusionary Terms: use Subqueries

On the initial query, there is little (or no) performance difference between between temporary tables and sub-queries. However, on pager queries it makes a big difference. The database can cache sub-queries, where it can't cache temporary tables. The exclusionary terms could be done in a sub-query, but since Drupal 5.x supports mysql 3, subqueries could not be used. Now that Drupal 6.x requires MySQL 4.0.1 or higher, it's a simple solution to convert the current SQL to use sub-queries.

Groupby clause and INNODB

INNODB is slow when using our GROUP BY queries. I ran several empirical tests, and the queries using MyISAM were 15 times faster. Several months ago, I read that this is a known problem with INNODB and GROUP HAVING clauses, but I can't find the reference. Since search module uses these to provide search results, the search_index table should never be INNODB.

Views Fastsearch

Views fastsearch (views_fastsearch 5.x-1.x-dev) implement all of these ideas. It will become version 5.x-2.x after this views patch for subqueries is committed.

Share it!