It’s been a couple of months since I last posted about the MOOP Framework, but we’ve been working on some exciting new things. Today we’re going to talk about implementing Search in Azure using Lucene. Actually, we’re using a modified version called Lucene.Net and the Azure Library for Lucene.NET.
Implementing search actually breaks down into two separate tasks. The first is indexing your data and the second is delivering up search results. Let’s look at each of these in turn.
Indexing Data for Search
The easiest way to do this is to implement a worker role. On the negative side, this means I now am paying an additional chunk of money for a processor but it has a lot of advantages that make it worth the extra cost. Our worker role really just checks to see if there are any messages in queue, processes them and then sleeps for 10 seconds before trying again. For simplicity’s sake, I’m just showing the part of the code that processes the message.
if (messageData != string.Empty)
{
string queryName = string.Empty;
NameValueCollection htKeys = new NameValueCollection();
foreach (string messageDetail in messageData.Split("\r\n".ToCharArray()))
{
if (messageDetail.Contains(":"))
htKeys.Add(messageDetail.Substring(0, messageDetail.IndexOf(":")), messageDetail.Substring(messageDetail.IndexOf(":") + 1));
else
if (messageDetail != string.Empty)
queryName = messageDetail;
}
quoteSearchLoader qls = new quoteSearchLoader();
if (htKeys.Count == 0)
aqs.Messages(cmdType.delete, queueName, string.Empty, string.Format("popreceipt={0}", popReceipt), messageID);
qls.processRequest(queryName, htKeys);
if (htKeys.Count != 0)
aqs.Messages(cmdType.delete, queueName, string.Empty, string.Format("popreceipt={0}", popReceipt), messageID);
}
else
aqs.Messages(cmdType.delete, queueName, string.Empty, string.Format("popreceipt={0}", popReceipt), messageID);
}The most interesting thing here is how I package up the data in the message. The first line is basically a pointer to a blob that contains the information needed to build the index. Any other lines contain parameters that can be passed into the database. Let’s take a look at both a message
oneQuote
QuotePublicID:EA329A1E-179B-4890-ABEB-4C9F94638C50
And now the XML that is paired with that message:
<?xml version="1.0"?>
<moopdata>
<storedprocedure requirepost="true" connectionname="cipherSolverConnection" procedurename="dbo.getQuoteLucene">
<parameter isoutput="false" defaultvalue="DBNull" datalength="36" datatype="char" urlparametername="QuotePublicID" parametername="@QuotePublicID" />
<cacheinformation cacheability="nocache" expireseconds="0" />
</storedprocedure>
<lucenedata>
<field dataname="quotePublicID" lucenename="id" index="not_analyzed" store="Yes" />
<field dataname="quoteText" lucenename="quoteText" index="analyzed" store="Yes" />
<field dataname="quotedPersonName" lucenename="quotedPersonName" index="analyzed" store="Yes" />
<field dataname="quoteSourceName" lucenename="quoteSource" index="analyzed" store="Yes" />
<field dataname="approved" lucenename="approved" index="not_analyzed" store="No" />
<field dataname="currentStatus" lucenename="currentStatus" index="not_analyzed" store="No" />
</lucenedata>
</moopdata>
If you think that the stored procedure section looks familiar, you're right. That's the same style of XML we used for our data template before. Since we’ll be doing the same thing, there’s no reason to reinvent the wheel. The XML above is actually stored in a blob called oneQuote.xml. So, as you can see, the first line of our message defines which blob to use. And our stored procedure has one parameter, QuotePublicID, which is on the second line of our message as a name value pair separated by a colon. The Worker class just takes the message queue, parses it out and passes the name of the query and the NVP table to the Quote Search Loader class, which is where the fun begins.
Quote Search Loader
The Quote Search Loader class uses the lucenedata section of the XML to determine how to load the data from the database into Lucene. There are a number of good references on how to create your documents in Lucene if you search Google. This is more about how we go about indexing our data without recompiling. To do this, we have a definition of the fields that we are going to create. Each field has several attributes:
- dataname is the name of the field in the data set retrieved with the stored procedure call
- lucenename is the name of the field in the document that we are going to create
- index is a text representation of Lucene.Net.Documents.Field.Index and determines how the document will index this field for retrieval.
- store is a text representation of Lucene.Net.Documents.Field.Store and determines whether the data will be stored in the document so it can be presented in the search results
Now, the table of quotes that I have contain two fields that I want indexed but don’t want to store, approved and currentStatus. If a quote has not been approved then I don’t want it appearing in the results. All of the other fields I want to have available to show in the output.
So, the first thing the QuoteSearchLoader does is to parse the lucenedata elements into a set of arrays. Then it parses the storedprocedure data and makes the database call. Once it has data, it walks through the returned data set
while (dr.Read())
{
StringBuilder olioSearch = new StringBuilder();
Document doc = new Document();
for (int i = 0; i <= dataName.GetUpperBound(0); i++)
{
if(i==0)
{
idxW.DeleteDocuments(new Term(luceneName[i], dr[dataName[i]].ToString().ToLower()));
doc.Add(new Field(luceneName[i], dr[dataName[i]].ToString().ToLower(),Field.Store.YES, Field.Index.NOT_ANALYZED ));
}
else
try
{
doc.Add(new Field(luceneName[i], dr[dataName[i]].ToString(), fieldStore[i], indexType[i]));
if ( indexType[i] == Field.Index.TOKENIZED || indexType[i] == Field.Index.ANALYZED)
olioSearch.AppendFormat("\r\n{0}", dr[dataName[i]].ToString());
}
catch (Exception ex)
{
NotifyError(ex);
}
}
if(olioSearch.ToString() != string.Empty)
doc.Add(new Field("olioSearch",olioSearch.ToString(), Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));
idxW.AddDocument(doc);
}There are only a couple of interesting things here. First, the first item in the LuceneData block is used as the key to be able to uniquely identify the object. This allows me to delete the existing document rather than having multiple copies of it. I probably should modify this to be an attribute somewhere but that’s an optimization for later. With the data I loop through the array of attributes and create my document by adding fields as defined by the XML.
The second thing is that every column that’s defined as analyzable is being added to a generic field in the document called olioSearch. For my purposes, having one field that contains all of the searchable text makes sense so that a search of WILD will return not only quotes with the word wild but also those by Oscar Wilde. It’s another item that should be an attribute on the lucenedata/field element but that will be for the next revision. With this in place, I can add new data to be indexed without having to recompile, just define the database stored procedure and the XML file that defines how to index the data.
LuceneSearch Handler
Now that I have data indexed, I need a way to get search results. For this, I have created a generic handler and created a definition in my web.config to point all pages that have the ls (lucene search) extension to this handler. Just as I have done with data access, the page name is used to get an item out of blob storage that defines how to search and how to represent the results. Let’s look at a sample control XML for searching quotes:
<?xml version="1.0"?>
<MOOPData>
<luceneSearch>
<transform>searchQuotes.xslt</transform>
<query term="olioSearch" termValue="query" booleanClauseOccurs="must" variableValue="true" />
<query term="approved" termValue="Y" booleanClauseOccurs="must" variableValue="false" />
<query term="currentStatus" termValue="0" booleanClauseOccurs="must" variableValue="false"/>
</luceneSearch>
</MOOPData>
The luceneSearch element has two child elements: transform, containing an XSLT transform to apply to the search results and query, which defines how to build the search query.
The query elements help shape the search. The simplest to explain are those with a variableValue of false. These are simply applied as boolean queries with the value in termValue applied to the field name in term. Every search that gets made with this control xml will return indexed documents with approved=Y and currentstatus = 0. When variableValue is true, however, the termValue is used to get a value from the parameter list. The beauty of this is that http://azurearchitect.cloudapp.net/quoteSearch.ls?query=truth returns a list of those documents with Truth in the olioSearch field. Adding &approved=n will still return the same values because the only parameter value that will ever be used in this control is query.
Now, let’s look at the handler code.
var q = new BooleanQuery();
XmlNode xn = xdoc.SelectSingleNode("//luceneSearch[1]");
XmlNodeList xnl = xn.SelectNodes("//query");
Query[] qArray = new Query[xnl.Count];
for (int i = xnl.Count - 1; i >= 0; i--)
{
XmlNode node = xnl[i];
string term = node.Attributes["term"].Value;
string termValue = node.Attributes["termValue"].Value;
string variableValue = node.Attributes["variableValue"].Value;
if (variableValue == "true")
termValue = context.Request.Params[termValue].Replace("+"," ").Replace("%20"," ");
if (i == 0)
qArray[i] = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, term, new SnowballAnalyzer("English")).Parse(termValue);
else
qArray[i] = new TermQuery(new Term(term, termValue));
switch (node.Attributes["booleanClauseOccurs"].Value.ToLower())
{
case "must": q.Add(new BooleanClause(qArray[i], BooleanClause.Occur.MUST)); break;
case "must_not": q.Add(new BooleanClause(qArray[i], BooleanClause.Occur.MUST_NOT)); break;
case "should": q.Add(new BooleanClause(qArray[i], BooleanClause.Occur.SHOULD)); break;
default: q.Add(new BooleanClause(qArray[i], BooleanClause.Occur.MUST)); break;
}
}
IndexSearcher searcher2 = new IndexSearcher(azDir, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000, true);
searcher2.Search(q, collector);
int indexID = 1;
ScoreDoc[] hits = collector.TopDocs().scoreDocs;
StringBuilder xmlOutput = new StringBuilder();
xmlOutput.AppendFormat("<?xml version=\"1.0\"?>");
for (int i = 0; i < hits.Length; ++i)
{
xmlOutput.AppendFormat("");
int docId = hits[i].doc;
Document d = searcher2.Doc(docId);
xmlOutput.AppendFormat("{0}{1}", hits[i].score, indexID ++);
foreach (Field f in d.GetFields())
{
if (f.StringValue() == null)
xmlOutput.AppendFormat("<{0} />", f.Name());
else
xmlOutput.AppendFormat("<{0}></{0}>", f.Name(), f.StringValue());
}
xmlOutput.AppendFormat("");
}
xmlOutput.AppendFormat("");
retVal = xmlOutput.ToString();It's a long snippet but somewhat important. Again, I still have some dependencies that need to be ironed out and cleaned up but it’s good enough for a start.
Whatever the first query item is, that will be run through the analyzer (line 14). Then, depending on the type of booleanClauseOccurs attribute, it’s joined together with the other query items until a query is created, passed in to search and a list of hits is returned. Lines 32 through 48 create an XML document that creates all of the stored fields as elements under a hit element
<?xml version="1.0"?>
<root>
<hit>
<id>
<![CDATA[26ee9923-1f41-438e-980c-1da84664494d]]>
</id>
<quoteText>
<![CDATA[The world always makes the assumption that the exposure of an error is identical with the discovery of truth,
that the error and truth are simply opposite. They are nothing of the sort. What the world turns to, when it is cured on one error,
is usually simply another error, and maybe one worse than the first one.]]>
</quoteText>
<quotedPersonName>
<![CDATA[H.L. Mencken]]>
</quotedPersonName>
<quoteSource>
<![CDATA[]]>
</quoteSource>
</hit>
</root>
With this XML, I can easily use a transform to create an HTML page.
When I finished, it only took a couple of minutes to add a new search page by defining a new control XML and a new XSLT, no recompiles necessary.