ElasticSearch Nested Filter Not Matching Emails

no match elasticsearch email filter uax tokenizerThere’s one situation where we need to help ElasticSearch to understand the structure of our data in order to be able to query it fully – when dealing with arrays of complex objects.

ElasticSearch has one great feature that it allows us to search in nested properties of complex JSON objects. It’s normally used for list of objects inside the parent document. Just to mention I am using ElasticSearch 1.4 for legacy reasons. Here is an example of the model we have stored locally in ES:

{
	total: 10,
	hasNext: true,
	contacts: [
		{
			id: "123",
			domains: [
				"gmail.com"
			],
			knownBy: [
				{
					provider: "friend@gmail.com",
					email: "jon.doe@gmail.com",
					name:  "Jon",
					surname: "Doe"
				}
			],
			tags: [ ],
			created: "2016-11-03T09:24:38.042+0000",
			updated: "2016-11-25T05:08:39.754+0000",
			active: true
		}
	]
}

We have service that retrieves records base on a internal field in the document contacts.knownBy.email. 

However ElasticSearch won’t be able to match the document due to one simple thing, the structure isn’t stored as a represented in the response and we need to specify the path of the nested object.

{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "nested": {
               "path": "knownBy",
               "filter": {
                  "bool": {
                     "must": [
                        {
                           "term": {
                              "knownBy.email": "jon.doe@gmail.com"
                           }
                        }
                     ]
                  }
               }
            }
         }
      }
   }
}

In the code above we say that we’re targeting our nested objects with path knownBy and that allows us to access the inner properties and more specifically in this case the email. Though we still have one more problem. The type of our field is string and we use default Standard Tokenizer. The problem with default tokenizer is that it treats the ampersand as a string delimiter and will split the email and remove the symbol. The data will be stored in the following format:

["jon.doe", "gmail.com"]

Obviously that will prevent us from searching on the whole email and we would only get results for one of the two parts of the email. That’s less than perfect so we need to change our tokenizer with more appropriate one for handling emails and URIs.

For that reason we can use UAX URL Email Tokenizer we can set it against the schema with following request:

`

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_analyzer": {
          "tokenizer": "email_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "uax_url_email",
          "max_token_length": 5
        }
      }
    }
  }
}

And now our data would be stored in correct format and allow us to search by full email address.

It's only fair to share...Buffer this pageShare on Facebook0Tweet about this on TwitterShare on Google+1Share on LinkedIn7Share on Reddit0Pin on Pinterest0Email this to someone
About

Just a guy with strong interest in PHP and Web technologies

Posted in ElasticSearch, NoSQL Tagged with: , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*