A re-introduction to the ContentSearch API in Sitecore - Part 1

In this blog post, which is part of the Sitecore Search Series, I'll be doing a re-introduction of the ContentSearch API found in Sitecore. There are many articles to be found on the WWW, that introduces the ContentSearch API, this is my take.

A long time ago in a galaxy far, far way...

Deep within the darkest corners of the labs found inside the Apache foundation, emerged a powerful search engine framework, named Lucene. Lucene allows us to add search capability to our applications, and exposed an easy-to-use API, while hiding all the search-related complex operations.

Although mighty powerful, the arcane powers of Lucene proved too difficult for some to wield, and thus Solr was created. Solr is built around Lucene, but is not merely a wrapper around it. Solr is a web application, that offers an entire infrastructure and a lot more features in addition to what Lucene offers, making it more manageable to work with powers provided by Lucene.

From Stackoverflow: A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box.

Although Solr uses Lucene under the hood, Lucene has no knowledge about Solr, or its API for that matter. This means that even though both technologies provide some of the same basic functionalities, the way we as developers work with each of them is slightly different, as there has grown a gap between the two API's as time has gone by.

Entering the ContentSearch API

Looking at it from the perspective of Sitecore, Lucene was the first technology adopted, whereas support for Solr was added later. So you can work with either of the two technologies, but since there is a gap between the different API's, this usually means that you have to write completely different code when using Lucene or Solr and that you cannot reuse anything if you decide to switch from one technology to the other. Well... Good luck with that!

Lucky for us, that's not the case when using Sitecore, in fact, to make our life easier Sitecore actually provides an abstraction over the low level details of working with native search technologies like Lucene and Solr. This means that we can use one API from Sitecore, to work with either Lucene or Solr. However, there will be differences in configuration and so forth that you need to address. What this also means is that, if Sitecore were to support a new search index technology in addition to Lucene and Solr, like Elasticsearch (which is also built on top of Lucene), they could fit this in the same way, and we won't need to change any of our query specific code, only configuration. This is really neat, since it gives us developers a great flexibility in choosing the right tools at a given time, and not hindering us from choosing a different underlaying technology at a later point in time, if that technology becomes more feasible.

You can read more about the reasons for using Solr over Lucene in the official documentation from Sitecore on using Solr or Lucene.

Now that you know the idea behind the ContentSearch API, let's see how you can use it.

The basics of search index querying

If you have worked with Object-Relational Mappers (ORM), like NHibernate or the Entity Framework, search index querying in Sitecore should seem familiar. For those of you who haven't, I'll go over all the details you need to know in order to perform search index queries.

When performing queries against a search index, there are basically three steps that needs to be done:

  1. Get a handle to the search index you want to work on
  2. Open a connection to the search index, also known as getting a context
  3. Perform queries on the search index, using the context

I've provided a small example that illustrates the steps involved in performing a search index query:

Step 1. Get a handle to the search index you want to work on:
In order to work with one of the search indexes in Sitecore, you need to get a handle to the index, which is done by calling the GetIndex(string indexName) method on the ContentSearchManager instance that returns a ISearchIndex instance. The ISearchIndex instance represents the given search index, where you will be able to get different informations about the actual index, but you can also do things like triggering a rebuilding of the index, etc.

Step 2. Open a connection to the search index: With a handle to the index, the next thing we need to do is to create a search context, which allows us to perform queries against the search index. This is done by calling the CreateSearchContext() method, that effectively opens a connection to the search index - think of it like opening a database connection. The search context instance is wrapped in a using statement, since it needs to be disposed once we are done performing queries using it - again, this is a standard pattern used, when working with resources in .NET that needs to be disposed after usage.

Step 3. Perform queries on the search index: Once you have a connection to the search index, call the GetQueryable<T>() method context instance , that returns an instance of type IQueryable<T>. This is where the really cool part comes, as you are now able to write standard LINQ queries using the IQueryable<T> instance, where you can tune your search query against data in the search index.

The generic parameter T can be of any type, as long as it either is, or inherits from, the SearchResultItem base class, which is the default implementation provided by Sitecore.

In the example provided, we have performed a very simple query, where we ask the ContentSearch API to retrieve all search index entries that has content containing a specific text. However, we could have done other things as well, like asking for all search index entries that was created over the past 2 weeks, are based on a given template type, and also contains parts of a text in it's content field. Basically, the restrictions to what sorts of search queries you can build, are almost just limited by your imagination and the properties accessible on the SearchResultItem implementation provided as the generic type.

And that's it, you are now able to perform basic queries against your Sitecore search indexes, which we created in the previous blog post on setting up Solr for Sitecore 8.x.

Hold on, how does this relate to the data stored in the search index?

Now that we've seen how we can write queries using the ContentSearch API, you might be wondering how this is tied together with the data stored in the search index.

Let's look at a simplified view on how data is stored in the search index:

The data stored in the search index is simple strings, consisting of key/value pairs. The question is, how are the properties on the SearchResultItem bound to the fields in Solr? The way the binding between the properties and the Solr fields works is that each property of the SearchResultItem is decorated with a custom attribute named [IndexField]. This attribute takes in a name as it's parameter, whereas the name corresponds to the name of a field in Solr. When performing a search query, what the ContentSearch API does, is that it uses the [IndexField] attribute to figure out, which field it should use when building up the Solr query, based on the LINQ query.

You can choose to omit the [IndexField] attribute, and instead make sure that your property name is the exact same as the field name in the Solr index. This will also work, since the internal implementation of the SearchContent framework first tries to look up any properties [IndexField] attributes to use, and then it tries to find properties with match names.

I've added a small subset of the implementation for the SearchResultItem class, so you can see how this is working:

As you can see, the Content property refers to the _content field in Solr via the [IndexField] attribute. Likewise, the CreatedDate property refers to the __smallcreateddate field in Solr. Apart from these two properties, the default SearchResultItem implementation contains a range of different properties you can use, when building up your query.

Alright, but how do we know whether to use a string, bool or even DateTime as the type of the property? That is actually a good question, and to answer it, we need to take a closer look at the configuration file named Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config, found under the \www\App_Config\Include folder:

If we review the <typeMatches> section of the configuration, we'll find that for each field in Solr, there is a direct match to a .NET type. When the Solr queries are executed through the ContentSearch API, the internals of the framework will use the type match informations in terms of converting data back from Solr, into the appropriate .NET type. Let's say that the data stored in Solr is a text, then we need to use a string as the type for the property, likewise if the Solr field is a datetime, then we need to use a DateTime as type for our property.

To be continued...

In the next blog post, A re-introduction to the ContentSearch API in Sitecore - Part 2, I'll be showing you, how to deal with more complex (and dynamic) queries, applying sorting and paging of the search query results, explain and use facets, as well as going into the more exotic parts of the ContentSearch API. At the end I'll provide you with a fully fledged piece of working sample code, that will show how all of the different bits and pieces fit together.

As always, if you got additional details to the content explained in this blog post, or feedback in general, please drop me a note in the comment section below.