Drupal Core Search SQL: making it faster
September 13, 2007
Drupal Core Search SQL: making it faster
This is Part two of a six part series on Drupal's search functionality. This installment discusses how Drupal's search work from a SQL level and offer some solutions for improving performance.
IntroductionThe Drupal search SQL is elegant and powerful. But it's time to update it. There are two issues that are no longer relevant that affect the current implementation and limit it's performance.
- Drupal's strive for software compatibility is a strength and a weakness. By not implementing features of newer versions of software (in this case, database software), people who don't have access to the newer versions aren't locked out of newer versions of Drupal. This is of course a trade off - compatibility means more people can deploy Drupal, but sacrifices potential performance features. Instead of degrading gracefully, the Drupal search module restricts implementation to the lowest common denominator. Changes in MySQL (the most common database used for drupal) between versions 3 and 4 open up some possibilities for improving Drupal's search performance.
- Also, a likely bug (that has probably already been fixed) is mostly responsible for the implementation that is not as fast as it could be.
Search Query TypesSearch queries can be done in a variety of different ways. Drupal's search allows the user create very powerful search by merely adding a few key words or symbols.
AND SearchesTo get results which match multiple words, simply type the words that you want to find. This is the default behavior of Drupal's search.
OR SearchesTo search for data which may contain one of a few terms, but not necessarily all of the terms, you need an OR search. Use the keyword OR in between each search term: "dogs OR cats" or "dogs OR cats OR birds". Notice that the keyword "OR" must be capitalized. If you use the lower case "or" instead, search module thinks that this is another word to search for, and since it is probably smaller than the search word length (the default is 3 characters), search module simply ignores it and does a "AND" search on the two words.
Exclusionary SearchesYou can search for things that don't exist. For example, if you wanted to search pet stores which do not have chickens, you need an exclusionary search. Drupal allows you to do this using the minus "-" sign. For example "pet stores -chickens" finds all nodes with "pet" and "stores that do not also contain the words "chickens".
Advanced SearchesThe search module also supports the advanced options such as restricted results by kinds of content types. For example, if your site has forums, you can use "type:forum" to restrict the search only for content in forum posts. This can be helpful in sites with large amounts of diverse kinds of data. One of the best ways to explore the advanced options is to use the "Advanced Search" tab (of "Content" tab on the search page) and look at the string placed in the "keyword" input box, after you submit the form. This can give you clues about how to construct your own searches.
Overview of the Current SQLThe way that Drupal constructs the query that it uses to return results to a user is a three step process.
- According to the comments in search module (above do_search), the first pass selects all possible matching queries (anything that matches "dog AND cat" also matches "dog OR cat"). However, I think that this comment is slightly wrong. This query is actually more restrictive than the "OR" query. It does the "OR" or the "AND" query, the only difference being the number of results required for a match. (more on this below)
- The second pass narrows the results returned in the first pass by applying the "AND" and exclusionary terms. This pass is needed for two reasons, both of which can be overcome (see below).
- The final pass simply returns the pager query of the combined results from the first two passes.
Solving the Problems with the current SQL
The "AND" case: unique search_index table and HAVING clauseIf we know that the search_index table is unique, the "AND" terms can be applied in the first pass using the HAVING clause. The current "OR" SQL appends the clause
HAVING COUNT(*) >= 1and the "AND" SQL appends the clause
HAVING COUNT(*) >= 2(where 2 is the number of terms in the query). Notice that COUNT uses the greater-than-or-equal operator. This is necessary because prior to 5.x, frequently there were cron concurrency problems. Two separate executing threads could insert the same data into the database, creating more results than expected. This should not be a problem with current Drupal installs. However, there are some reports that duplicates still exist. There are two solutions:
- solve the problem of inserting duplicates, which may already be solved, and if not solved yet, is elusive, or
- add a unique index to the search_index table (see #143160)