Wednesday, May 19, 2010

Google App Engine Basic Text Search

This post is part of the Who's @ Google I/O, a series of blog posts that give a closer look at developers who'll be speaking or demoing at Google I/O. This guest post is written by Brian Dorry from LTech who is demoing as part of the Developer Sandbox.

Having trouble implementing search on your App Engine Data Store? Try this technique for a basic search until official full text support is ready.

Since adding Google App Engine to our technical tool belt in 2008, we at LTech have utilized the platform for a wide range of products and customer solutions. It is cost effective, easy to use, and will automatically scale your application on Google's immense infrastructure. Knowing your applications will be running on the same technologies that Google's own systems take advantage of make it the easy choice again and again.

From our own experiences and participation in the developer community, the biggest complaint we hear is the lack of a full text search in the datastore. Google has marked this issue as "Started", but has not announced a release date yet, so alternative approaches are still going to be in play for the short term. We are big fans of Lucene (http://lucene.apache.org/), an open source indexing and search engine, but without the ability to save the index file to disk, it becomes a non-starter.

We need a quick, non-CPU taxing solution that still takes advantage of the Google infrastructure.

Problem
Taking advantage of the App Engine Datastore, we can issue an inequalities query to perform a basic "starts with" search. This can be a good solution for searching users, tags, and domains and works well for implementing a search box auto-complete feature.

Solution
Our example solution uses JDO to generate a query that instructs the DataStore to return all records that start with the search string. This is accomplished by issuing a greater than or equal condition against the search term, and a less than condition against the search input concatenated with the unicode replacement character ('\ufffd'). The resulting query limits results to items that start with the search input, as well as any other unicode characters that follow.

This code uses JDO behind the scenes, but this trick will work with straight GQL as well. Let's take a look at the sample:

import java.util.List;
import javax.jdo.PersistenceManager;
import javax.jdo.Query;

(...)

public static List searchGreeting(String query) {

// let's clean up the input
query = ( query != null ? query.toLowerCase() : "").trim();

PersistenceManager pm = PMF.get().getPersistenceManager();
Query q = pm.newQuery(Greeting.class);

// set the filter and params
q.setFilter("content >= :1 && content < :2");

// run query with param values and return results
return (List) q.execute(query, (query + "\ufffd"));

}

This code snippet is going to search the JDO defined Employee entity on the name column and return the full Employee payload for each match. Let's focus on the last two lines of code.

q.setFilter("name >= :1 && name < :2");

Here we set up the inequality. We are asking the data store to return all matches where name is between a set of two values. But how does that define a search?

return (List) q.execute(query, (query + "\ufffd"));

When we set our parameters, we pass the same query value to both with an extra character on the end of the second one. This is essentially telling the data store to return all records that start with the query term. In terms of sets, the first part of the query returns the set of all words greater than the query term, including words that don't even start with the query term. The second part of the query returns the set of all words less than the query term including any that start with the query term. The intersection of the two sets is the search result for all words starting with the search term.

This simple to implement technique will solve many basic search problems until a full text solution is available. It will work outside of JDO as well with regular GQL statements. For a python implementation, please see our friend Graeme's blog.

8 comments:

  1. Great!
    This is just what I've been looking out for.
    I'll try it asap.

    ReplyDelete
  2. Max implemented a LIKE operator for JPA (and JDO too, if I recall) that does exactly this under the covers -- no extra fiddling with the strings needed. It's been there for at least four months now, and is also a prefix-only search; the clause is:

    LIKE 'foobar%'

    where the % must be at the end of the string.

    ReplyDelete
  3. Neat! You're right "starts with" is probably enough for names and tags. Must try this immediately. For when starts-with ain't enough you can try the "like-ish" search technique that I posted about in /r/AppEngine: Simple like-ish serach.

    ReplyDelete
  4. Thanks for posting. Having experimented with this myself and various other techniques, I am somewhat disappointed and honestly expected a lot more from Google in this regard since were talking search here.. When will full text search for the datastore be available so we don't have to use these other techniques that are cumbersome and don't quite accomplish searches the way I have wanted to in the end.... Thank you in advance for any info you can provide.

    ReplyDelete
  5. I tried this technique and it does work to find the entities. But now the problem is, I cannot sort the results, even though the indexes are created for the sorting. Were you able to do this?

    ReplyDelete
  6. @Jim I totally agree. The search giant of the world does not implement a basic full-text search feature! That much irony is painful

    ReplyDelete
  7. I used it and it worked well, but it is case sensitive, which is a bit inefficient. Any ideas to make it NOT case sensitive?!

    ReplyDelete
  8. query.setOrdering("other");

    After I added the code, the exception is thrown:
    The first sort property must be the same as the property to which the inequality filter is applied.

    please help me

    ReplyDelete