Making a simple Recommender System: Azure Cosmos Db and Apache Tinkerpop

Graph systems are one of the most ubiquitous models found in almost any natural and man-made structures we see everyday. Since computer science evolved around what we see and devise or deduce, graphs do play a huge role in day to day computation and programming techniques. It’s birth dates back to even before it was widely adopted in mathematics since mechanical computing was graph driven implicitly. If only we could get Alan Turing to talk over this.

Before I start, a big inspiration behind this write-up definitely goes to the work of Marko.A.Rodriguez . I was lucky to stumble upon his work while frolicking over the web and his works on graph computing systems and tinkerpop/gremlin is seriously inspiring.

What we are going to do today:

Let’s go right to what we want to do today. Our component of interest is the graph api of Azure Cosmos Db along with Apache Tinkerpop. Azure Cosmos Db has a graph api that allows us to store data as a graph or a network. In simple words, instead of storing data in a tabular format where you have row and column, you can store data as they describe each other in terms of relations. If the text here is not really doing it for you, let’s jump into some examples.

Let’s have a look into this sample graph I made from HIMYM. The big ellipses and boxes here are called vertices and the lines that connect them are called edges. Since we have a little arrow head telling the directions of them, this is actually a Directed Graph. If we look at the data behind the graph, then you can see, we can represent this in two ways. In a table, you can list up the vertices in one table and the edges in other. The edges table might look like:

From To Type
Ted Barney  friend
Ted Umbrella  found
Barney Robin  wife
Robin Barney  husband
Tracy Umbrella  lost
Ted Tracy  lost
….

I didn’t add all the rows of course and if you look at the design you can see it is not properly normalized. That means in understandable terms that we needed a Type table with all the edge types so we don’t write a Type twice. But that is not in this article’s scope. So, let’s ignore that. The thing here to see is, graph nodes/vertices can be of different types and the edges can mean different relationships. Like, here there are vertices that doesn’t represent a character from HIMYM, like the Umbrella and the MacLaren’s Pub. Storing these data in a regular database is possible and has been done many times. This way is called the ‘implicit’ way to store graph like data. Now one might think, by this rule all data in the world can be represented by a graph. And it is indeed true. But we use a graph database only when we see that the data itself focuses more on relations than the data content itself. Like friend graph in a social network website. Or a geographical model of all the offices of a big corporation. The vertices do carry data but the relations are really important here. In these case, a graph database comes in very handy. Azure Cosmos Db recently came up with a graph api and it can store graph data natively where the cost to traverse a graph is constant. In a regular database we have to join multiple tables to formulate these relations properly and in a lot of cases, these computational costs are not constant.

Apache Tinkerpop joins the party:

I hope by now we understand why we need a graph database and now is the right time we talk about Apache Tinkerpop. It was previously known as Apache Gremlin. This is essentially a fantastic graph computing engine which allows you to connect to multiple graph database using a single domain specific language construct. It has language supports for most of the popular languages and it is pretty native to groovy and java. Fret not, we have our way to use it over here with C# too. Before we start we need to create a simple graph database on Azure. The quickstart is here. You can also opt for java and nodejs. I personally used an Azure CosmosDb emulator. The quickstart to install that instead of an actual Azure CosmosDb is here. You can opt for any of these but I have to remind you, at the time this article is being written, Azure CosmosDb Emulator do not support creating graphs and browsing graph data through it’s local web portal. You have to use Gremlin Console to connect and talk to the local emulator. Gremlin Console is a command line REPL that you can use to traverse a local gremlin/tinkerpop graph or a remote one. It can connect to any gremlin servers anywhere as long as you have the credentials. You can use the gremlin console to talk to the actual Azure CosmosDb graph database. I suggest using Windows Subsystem For Linux (WSL), better known as “Bash on Ubuntu on Windows” for using gremlin-console. The gremlin console quickstart is here.

Our sample data set today:

We are going to use the movie review data set from grouplens from here . We are using the small dataset of about 100,000 ratings applied to 9,000 movies by 700 users. If you download and unzip the data you will see multiple csv fields. Of which I only used the movies.csv and ratings.csv files. I made a separate users.csv file for the users list. Don’t worry, all these are attached with the sample code. The movies csv comes with the movie id, movie name  and genres column. The ratings.csv comes with the user id, movie id and a rating column where the user has rated the movie from 0 to 5. The sample code has a simple command line uploader tool that will let you upload the data in your desired gremlin server and graph database connected to it. So, to understand how we talk to gremlin, let’s have a peek at the code, shall we?

Let’s talk code:

I’m not going to focus on the full source at a time.  Let’s have a look on how can we connect to an existing Azure CosmosDb Graph database. First, we create a DocumentClient. 


DocumentClient client = new DocumentClient(
 new Uri(endpoint),
 authKey,
 new ConnectionPolicy { ConnectionMode = ConnectionMode.Direct, ConnectionProtocol = Protocol.Tcp });

Definitely the thing that you will notice missing is the authkey variable here. And that is actually the primary key of your graph database that you will find in your azure portal. If you use the emulator they use a fixed key which you will find in the quickstart link I shared above.

Now that we do have a DocumentClient let’s upload the sets of movies that we will use for reviews. I’m using the movie name and the movie id for the sake of simplicity here. In our graph, the movie vertices/nodes will have a label ‘movie’. This sets the type of the vertex and we will set the name and id as properties of this vertex. The id uniquely distinguish any vertex. So remember, no matter what the label of the vertex is the id has to be unique. Or Azure CosmosDb will let you know that it can not add a vertex because already a vertex of that same key exists. If you don’t provide an id property, Azure CosmosDb will generate one and put it on the vertex.

We make sure at the beginning of the upload to drop the existing graph database if there is any.

        private async Task NukeCollection(DocumentClient client)
        {
            try
            {
                Console.WriteLine("Nuking...");
                var response = await client.DeleteDocumentCollectionAsync(UriFactory.CreateDocumentCollectionUri("graphdb", "Movies"));
                Console.WriteLine(response.StatusCode);
            }
            catch (DocumentClientException ex)
            {
                Console.WriteLine(ex.Message);
            }
        }

The database name I used for these sample is graphdb and the collection name is Movies. Pardon me for my indecency to hard code these but it’s a sample code, so I put effectiveness in front of decency.

Adding vertices to our ‘Movies’ graph:

Now, we upload the movies. Of course make sure we create the database if it’s already not there.

        private async Task UploadMovies(DocumentClient client)
        {
            try
            {
                Console.WriteLine("Uploading movies");
                Database database = await client.CreateDatabaseIfNotExistsAsync(new Database { Id = "graphdb" });

                DocumentCollection graph = await client.CreateDocumentCollectionIfNotExistsAsync(
                    UriFactory.CreateDatabaseUri("graphdb"),
                    new DocumentCollection { Id = "Movies" },
                    new RequestOptions { OfferThroughput = 1000 });

                Console.WriteLine("Connected to graph Movies collection");

                Console.WriteLine("Reading movie list");
                using (TextReader reader = new StreamReader("movies2.csv"))
                using (CsvReader csv = new CsvReader(reader))
                {
                    while (csv.Read())
                    {
                        string idField = csv.GetField<string>(0);
                        string titleField = csv.GetField<string>(1);
                        titleField = JsonConvert.ToString(titleField, '\"', StringEscapeHandling.EscapeHtml);

                        Console.WriteLine("Uploading " + titleField);

                        IDocumentQuery<dynamic> query = client.CreateGremlinQuery<dynamic>(graph, $"g.addV('movie').property('id', '{idField}').property('title', {titleField})");
                        while(query.HasMoreResults)
                        {
                            await query.ExecuteNextAsync();
                        }
                    }
                }
            }
            catch (DocumentClientException ex)
            {
                Console.WriteLine(ex.Message);
            }
        }

For the collection, same strategy is followed. We get a DocumentCollection instance trying to create the Movies collection if it doesn’t exist or just fetching the existing one if there is one already. Then we start reading the csv file. We use the same DocumentClient instance to create a gremlin construct to add a vertex/node in the collection for each of the movies. We add a Movie label along with the title and id  property as promised. I want to focus a little bit in the gremlin/tinkerpop construct we used here.

The full construct in a more readable format is:

g.addV('movie')
    .property('id', '{idField}')
    .property('title', {titleField})

Let’s ignore the idField and titleField as we already know what is there. The first construct g  stands for the graph in the collection. addV(‘label’) is the method construct to add a vertex. You can see the whole construct is design wise fluent. The next two property(propetyName, propertyValue) construct adds two properties in the newly added movie node. Nice builder interface, isn’t it? Pretty expressive. We are using the groovy standard constructs for gremlin. There are *other language constructs too, including javascript. And following the same approach, I uploaded the users in the sample code too.

By the time of writing this article, the whole connector library from nuget is in preview version and I don’t have one for .net core. So, we need to sit this one out for .net core. Sorry for that.

Connecting the vertices with edges:

The next thing in line is of course to add edges that represents the relationship between an user and a movie. We label the relationships with ‘rates’ and also will add a property named weight and it’s value will be the actual rating value given by that user.

        private async Task UploadReviews(DocumentClient client)
        {
            try
            {
                Console.WriteLine("Uploading movie reviews");

                DocumentCollection graph = await client.CreateDocumentCollectionIfNotExistsAsync(
                    UriFactory.CreateDatabaseUri("graphdb"),
                    new DocumentCollection { Id = "Movies" },
                    new RequestOptions { OfferThroughput = 1000 });

                Console.WriteLine("Connected to graph Movies collection");

                Console.WriteLine("Reading review list");
                using (TextReader reader = new StreamReader("ratings2.csv"))
                using (CsvReader csv = new CsvReader(reader))
                {
                    while (csv.Read())
                    {
                        string userId = "user" + csv.GetField<string>(0);
                        string movieId = csv.GetField<string>(1);
                        float rating = csv.GetField<float>(2);

                        Console.WriteLine("Uploading review for user " + userId + " to " + movieId + " with rating "+ rating);
                        IDocumentQuery<dynamic> query = client.CreateGremlinQuery<dynamic>(graph, $"g.V().hasLabel('user').has('id', '{userId}').addE('rates').property('weight', {rating}).to(g.V().has('id', '{movieId}'))");
                        while (query.HasMoreResults)
                        {
                            var result = await query.ExecuteNextAsync();
                            foreach (var item in result)
                            {
                                Console.WriteLine(item);
                            }
                        }
                    }
                }
            }
            catch (DocumentClientException ex)
            {
                Console.WriteLine(ex.Message);
            }
        }

The only noticeable change here in this snippet that we created edges between a movie and an user vertex. Like before, let’s zoom in the gremlin construct we used this time to create an edge between two nodes.

g.V()
    .hasLabel('user')
    .has('id', '{userId}')
    .addE('rates')
    .property('weight', {rating})
    .to(g.V()
    .has('id', '{movieId}'))

The first enumeration g.V() should enumerate all the vertices in the graph. We need to filter the user vertex first to create an edge from. The next hasLabel(”user) filters all the user vertices. Subsequently .has(‘id’, ‘{userId}’) filters the vertex of the user we want bearing that user id. Then we use addE(‘rates’) method to ensure that we add an edge labeled ‘rates’ from it. The following property(‘weight’, {rating}) should add the weight property with the rating as value on the edge we just created. The last thing we do is to tell where this edge points to. With to(g.V() .has(‘id’, ‘{movieId}’))  we filter out the movie node we want and use it to define which vertex the edge points to. Decent huh? You can find the full tinkerpop reference here.

By now a single user rating a movie should look something like:

Finally, we have all our data ready. Time to traverse this graph and make a simple movie recommendation for any user. 🙂

The simplest movie recommender system in this world:

Let’s devise a simple movie recommender and we will follow the dumbest real life approach we see around. Usually when I pick a movie to see, I pick a movie based on the movies I have already seen and liked. If I liked Deadpool, there’s a very good chance I will like Deadpool 2 and other superhero movies like Logan. Remember we didn’t add/use any genre data in our graph. All we have here are the users, the movies and the ratings made by the users on these movies.

So, let’s find out the other users who likes the same movies as our reference user do. Our reference user is the user who has asked for a recommendation for movies that he should see from us. He has only given us a list of movies he saw and rated. Our current traversed graph should look like this:

We want to traverse the users who like the same movies our reference user likes using gremlin. To construct the query, first we need to get out of our user vertex and find out movies that our user likes. The gremlin construct for that will be:

g.V()
    .hasLabel('user')
    .has('id', 'user7')
    .outE('rates')
    .has('weight', gte(4.5))

Let’s assume our reference user id is user7. And at first we filter the vertices with label user and id user7. After that we use outE(‘rates’) to traverse edges that is labeled rates and has weight more than or equal to 4.5. That’s how we will land on the nodes that we think the user likes. But right now, we are standing on the edges. To land on the movie nodes, we have to use:

g.V()
    .hasLabel('user')
    .has('id', 'user7')
    .outE('rates')
    .has('weight', gte(4.5))
    .inV()
    .as('exclude')

inV() construct will enumerate all the attached movie nodes with the rates edges. The as(‘exclude’) construct is used for a specific purpose. For now, all I can say is, I’m marking all the movies I saw as ‘exclude’.  We will see why I’m doing this soon.

So, now we want to find out all the other users who likes the movies our reference user likes.

So, this is where we want to end up. We found user2 and user3 likes the same movies almost as much our reference user likes. To progress through gremlin, our new gremlin query is:

g.V()
    .hasLabel('user')
    .has('id', 'user7')
    .outE('rates')
    .has('weight', gte(4.5))
    .inV()
    .as('exclude')
    .inE('rates')
    .has('weight', gte(4.5))
    .outV()

We added a InE(‘rates’) construct since we want to know now which inward edges with label ‘rates’ are pointing to these movies our reference user likes. We also filter the edges more with weight property value greater than or equal to 4.5 since we only want users who likes the same movies. At last we added outV() to find the users attached to these edges. Now we are standing on the users who likes the same movies our reference user does.

We want to know what other movies these users likes that our reference user has not rated or seen yet. I’m assuming our reference user rated all the movies he has seen.

From the graph above we can clearly see that user3 and user2 likes movie4 and movie5 that our current user has not rated yet. These are viable candidates for our users movie recommendation. It definitely seems naive, but it’s a start. If you remember about the exclude nodes, we are going to use these now to make sure that our recommender doesn’t recommend the movies our user already has seen. Our desired gremlin query is:

g.V()
    .hasLabel('user')
    .has('id', 'user7')
    .outE('rates')
    .has('weight', gte(4.5))
    .inV()
    .as('exclude')
    .inE('rates')
    .has('weight', gte(4.5))
    .outV()
    .outE('rates')
    .has('weight', gte(4.5))
    .inV()
    .where(neq('exclude'))

We traveled the movies all these other user likes and made sure that we exclude the ones we have already seen using  where(neq(‘exclude’)). To fix our naivety a little bit, let’s take the distinct movies using dedup() and order it by the movies by the amount of ratings it has received.

g.V()
    .hasLabel('user')
    .has('id', 'user7')
    .outE('rates')
    .has('weight', gte(4.5))
    .inV()
    .as('exclude')
    .inE('rates')
    .has('weight', gte(4.5))
    .outV()
    .outE('rates')
    .has('weight', gte(4))
    .inV()
    .where(neq('exclude'))
    .dedup()
    .order().by(inE('rates').count(), decr)
    .limit(10)
    .values('title')

This has to be really really naive to survive any production requirement. But it is indeed an eye opener on how Apache Tinkerpop and Azure CosmosDB Graph databases can do.

If you look on the final query quickly you will see we ordered the final movie nodes by the count of incoming rates edges it has and ordered it in a descending order using order().by(inE(‘rates’).count(), decr). We limited the nodes to first 10 and we only took the titles of the movies.

Putting the system to test:

I wrote a simple REPL to try out various gremlin commands towards our Azure CosmosDb.  Our reference user id was ‘user7’. The list of the movies seen by our reference user is: MoviesSeenByUser

If it is too hard to read, let’s put out the movies here:

  • “Braveheart (1995)”
  • “Star Wars: Episode IV – A New Hope (1977)”
  • “Shawshank Redemption, The (1994)”
  • “Wallace & Gromit: The Best of Aardman Animation (1996)”
  • “Wallace & Gromit: A Close Shave (1995)”
  • “Wallace & Gromit: The Wrong Trousers (1993)”
  • “Star Wars: Episode V – The Empire Strikes Back (1980)”
  • “Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)”
  • “Star Wars: Episode VI – Return of the Jedi (1983)”
  • “Grand Day Out with Wallace and Gromit, A (1989)”
  • “Amadeus (1984)”
  • “Glory (1989)”
  • “Beavis and Butt-Head Do America (1996)”

The movies our simple recommender suggested are:

  • “Forrest Gump (1994)”,
  • “Pulp Fiction (1994)”,
  • “Fargo (1996)”,
  • “Silence of the Lambs, The (1991)”,
  • “Star Trek: Generations (1994) “,
  • “Jurassic Park (1993)”,
  • “Matrix, The (1999)”,
  • “Toy Story (1995)”,
  • “Schindler’s List (1993)”,
  • “Terminator 2: Judgment Day (1991)”

Clearly this is not the best recommender engine, but definitely one of the simplest. We can always reuse the genome data along with the dataset and the genre data and do a proper collaborative filtering. But the scope of this article was just to demonstrate the capabilities of a simple graph traversal.

Hope it was fun to read. Try out Azure Cosmos Db Graph Api and Apache Tinkerpop if you can. They are really fun together to use to and there’s so much you can do using simple graph traversals.

The sample code is hosted here in github. Until next time!

 

Building your own REST query language: An experiment with Azure Cosmos Db

For all the REST resources we deploy out every day, one of the most common scenario a developer ends up handling is designing a nice and effective search/filter/query endpoint for these aforementioned resources. Usually in these scenarios conventions play a big part and that starts from query filter parameters in the query string to using a full blown search engine like ElasticSearch, Lucene or Algolia. Vinay Sahni did a phenomenal job for putting out the gist for basic REST practices here. If you are new into REST and want to get yourself started talking REST, it’s worth a look.

Today, we will do a simple yet fun experiment. Our goal today is to create a simple nifty looking query language for a sample REST resource. We will try to mimic a couple of features Twitter Search Api provides and our sample data storage today will be Azure Cosmos Db: Document Db which previously was known as just Azure Document db. Recently it moved to the Azure Cosmos Db family and if you have missed it, here is the full details.

Therefore, the first thing we need is a set of sample data we can use as our REST resource. Since we are going to mimic the twitter public search api, we definitely need some sample data first. And of course github comes to the rescue. Download the sample data and put it to your own document db instance. If you don’t know how to create a document db instance, follow the quick start here. I’m referencing the link to the .net quick start, but there are node.js, Python, Java and Xamarin resources right by the side of it. For those of you who do not want to test your work against an Azure hosted document db and want to test Azure document db locally, you can always opt for the azure document db emulator.

Now, for the sake of this experiment I ported the github twitter sample data to an azure document db instance. It has 10 entries in that small data set and hosted here. If you want to browse the data, azure portal does allow it. But I suggest looking at this open source app here named Azure DocumentDB Studio. You can use the endpoint and the key written in the sample code in github to connect. The database is public just for you guys so you can test the sample code I have hosted in github and will list out in the end of this article. TL;DR people, this is your cue to go to the end of the article, but I highly suggest to stick around.

Features to be built:

Out of all the features twitters search api provides, I will port the to:UserAccount, from:UserAccount and “exact text” search capabilities. That means, We will be able to search tweets sent from one account to other and we will also be able to search tweets by mentioning a string we like to be present in the tweets. For an example if we want to find tweets

  1. from an user account named @terminator
  2. to another user account named @robocop 
  3. or any tweet where the words “I’m back”
  4. or the hashtag #HastaLaVista is present,

our sample query language excerpt will be

http://our-awesome-api/search/q?to:robocop AND from:terminator OR “I’m back” OR #HastaLaVista”

If you have an eye for detail, you will notice that this is not exactly like the twitter search api since doesn’t use an and operator. But for the sake of the simplicity in this example this should be enough.

Tools we are going to use:

Our api stack will be written in asp.net core. We will use ANTLR as our query language lexer-parser. If you need an ANTLR primer and another experiment I did with it, please have a look here where I tried to gobble up a simple scripting language. If you are not really accustomed to any of .net stacks, fret not. All of these are totally doable in any other tech stack you will possibly prefer to use. Since ANTLR has targets for multiple languages, api can be built in basically any language and even for our sample storage solution azure document db here, you have access to multiple client SDKs.

The data:

As I mentioned above, we have a little data set of 10 entries. I’m posting a sample entry here so the rest of the tutorial makes sense. One entry points to a single tweet made by a sample user.

{
  "entities": {
    "user_mentions": [
      {
        "indices": [
          3,
          15
        ],
        "id_str": "178253493",
        "screen_name": "mikalabrags",
        "name": "Mika Labrague",
        "id": 178253493
      }
    ],
    "urls": [],
    "hashtags": [ "#Confused" ]
  },
  "in_reply_to_screen_name": null,
  "text": "RT @mikalabrags: Bipolar weather #Confused",
  "id_str": "210621130703245313",
  "place": null,
  "retweeted_status": {
    "entities": {
      "user_mentions": [],
      "urls": [],
      "hashtags": []
    },
    "in_reply_to_screen_name": null,
    "text": "Bipolar weather",
    "id_str": "210619512855343105",
    "place": null,
    "in_reply_to_status_id": null,
    "contributors": null,
    "retweet_count": 0,
    "favorited": false,
    "truncated": false,
    "source": "http://ubersocial.com",
    "in_reply_to_status_id_str": null,
    "created_at": "Thu Jun 07 06:29:39 +0000 2012",
    "in_reply_to_user_id_str": null,
    "in_reply_to_user_id": null,
    "user": {
      "lang": "en",
      "profile_background_image_url": "http://a0.twimg.com/profile_background_images/503549271/tumblr_m25lrjIjgT1qb6nmgo1_500.jpg",
      "id_str": "178253493",
      "default_profile_image": false,
      "statuses_count": 13635,
      "profile_link_color": "06544a",
      "favourites_count": 819,
      "profile_image_url_https": "https://si0.twimg.com/profile_images/2240536982/AtRKA77CIAAJRHT_normal.jpg",
      "following": null,
      "profile_background_color": "373d3a",
      "description": "No fate but what we make",
      "notifications": null,
      "profile_background_tile": true,
      "time_zone": "Alaska",
      "profile_sidebar_fill_color": "1c1c21",
      "listed_count": 1,
      "contributors_enabled": false,
      "geo_enabled": true,
      "created_at": "Sat Aug 14 07:31:28 +0000 2010",
      "screen_name": "mikalabrags",
      "follow_request_sent": null,
      "profile_sidebar_border_color": "08080a",
      "protected": false,
      "url": null,
      "default_profile": false,
      "name": "Mika Labrague",
      "is_translator": false,
      "show_all_inline_media": true,
      "verified": false,
      "profile_use_background_image": true,
      "followers_count": 214,
      "profile_image_url": "http://a0.twimg.com/profile_images/2240536982/AtRKA77CIAAJRHT_normal.jpg",
      "id": 178253493,
      "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/503549271/tumblr_m25lrjIjgT1qb6nmgo1_500.jpg",
      "utc_offset": -32400,
      "friends_count": 224,
      "profile_text_color": "352e4d",
      "location": "Mnl"
    },
    "retweeted": false,
    "id": 2.106195128553431E+17,
    "coordinates": null,
    "geo": null
  },
  "in_reply_to_status_id": null,
  "contributors": null,
  "retweet_count": 0,
  "favorited": false,
  "truncated": false,
  "source": "&amp;amp;amp;lt;a href=\"http://blackberry.com/twitter\" rel=\"nofollow\"&amp;amp;amp;gt;Twitter for BlackBerry®&amp;amp;amp;lt;/a&amp;amp;amp;gt;",
  "in_reply_to_status_id_str": null,
  "created_at": "Thu Jun 07 06:36:05 +0000 2012",
  "in_reply_to_user_id_str": null,
  "in_reply_to_user_id": null,
  "user": {
    "lang": "en",
    "profile_background_image_url": "http://a0.twimg.com/profile_background_images/542537222/534075_10150809727636812_541871811_10087628_844237475_n_large.jpg",
    "id_str": "37200018",
    "default_profile_image": false,
    "statuses_count": 5715,
    "profile_link_color": "CC3366",
    "favourites_count": 46,
    "profile_image_url_https": "https://si0.twimg.com/profile_images/2276155427/photo_201_1_normal.jpg",
    "following": null,
    "profile_background_color": "dbe9ed",
    "description": "protège-moi de mes désirs  23107961 ☍",
    "notifications": null,
    "profile_background_tile": true,
    "time_zone": "Singapore",
    "profile_sidebar_fill_color": "ffffff",
    "listed_count": 2,
    "contributors_enabled": false,
    "geo_enabled": true,
    "created_at": "Sat May 02 13:55:49 +0000 2009",
    "screen_name": "yoursweetiethea",
    "follow_request_sent": null,
    "profile_sidebar_border_color": "91f50e",
    "protected": false,
    "url": "http://yoursweetiethea.tumblr.com",
    "default_profile": false,
    "name": "Althea Arellano",
    "is_translator": false,
    "show_all_inline_media": false,
    "verified": false,
    "profile_use_background_image": true,
    "followers_count": 306,
    "profile_image_url": "http://a0.twimg.com/profile_images/2276155427/photo_201_1_normal.jpg",
    "id": "37200018",
    "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/542537222/534075_10150809727636812_541871811_10087628_844237475_n_large.jpg",
    "utc_offset": 28800,
    "friends_count": 297,
    "profile_text_color": "fa3c6b",
    "location": "Christian's Heart"
  },
  "retweeted": false,
  "id": "210621130703245300",
  "coordinates": null,
  "geo": null
}

The ANTLR grammar:

The ANTLR grammar we are going to use here is:

grammar Search;

expr: term (op term)*;
term: exactText | hashText | toText | fromText;
op: AND | OR;

toText: 'to:'ID;
fromText: 'from:'ID;
hashText: '#'ID;
exactText: EXACTTEXT;

// lexer rule
EXACTTEXT: '"' ~'"'* '"';
OR: 'OR';
AND: 'AND';
ID: [a-zA-Z_] [a-zA-Z0-9_]*;
WS: [ \n\t\r]+ -> skip;

It’s a small and simple grammar like the one we used in our scripting language tutorial. Since our REST resource query is essentially an expression, our entry rule will be expr. The railroad diagram for expr looks like:

expr_rest
Rule expr

This means we can have a single search term or multiple search term chained by a op  rule. The op rule is nothing but the two relational operator we support: AND and OR.

relop
Rule op

Along side of the op rule we also need to know what the term rule stands for. The term rule stands for the search term format we are allowed to use. In our sample query stated above, we have terms like  from:terminatorto:robocop, “I’m back” or a hashtag like #HastaLaVista. That’s why we have 4 rules defining all of these cases and the term rule is a OR relationship between them.

term_rest
rule term

I’m not going to post the railroad diagram for toText, fromText, hashText and exactText rules since they are pretty self-explanatory if you have a cursory look at the grammar.

So, what are we waiting for? Let’s start writing our little codebase that will parse this query string and translate it to an azure document db SQL that we can use to fetch the tweets. For that, we need a small repository that will connect to our desired database and collection in document db and will let us fetch some items. I only added methods that will allow us to read the tweets and connect to the database. I ignored the rest since you can always have a look at those in the quick start for azure document db.

Here’s our small rough database repository. Please remember to keep your endpoint and key strings somewhere secret and safe in production environment, since this is a tutorial, I went with what is easy and fast to go for a proof of concept.

namespace TweetQuery.Lib
{
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using Microsoft.Azure.Documents.Client;
    using Microsoft.Azure.Documents.Linq;

    public class CosmosDBRepository<T> where T : class
    {
        private readonly string Endpoint = "https://tweet.documents.azure.com:443/";
        private readonly string Key = "fjp9Z3qKPxSOfE0KS1aaKvUY27B8IoL347sdtMBMjkCQqPmoaKjGXoyltrItNXNN6h4QjAYLSY5nyb2djWWUOQ==";
        private readonly string DatabaseId = "tweetdb";
        private readonly string CollectionId = "tweets";
        private DocumentClient client;

        public CosmosDBRepository()
        {
            client = new DocumentClient(new Uri(Endpoint), Key);
        }

        public async Task<IEnumerable<T>> GetItemsAsync(string sql)
        {
            if (string.IsNullOrEmpty(sql))
                throw new ArgumentNullException(nameof(sql));

            FeedOptions queryOptions = new FeedOptions { MaxItemCount = -1 };

            var query = this.client.CreateDocumentQuery<T>(
                 UriFactory.CreateDocumentCollectionUri(DatabaseId, CollectionId),
                 sql, queryOptions)
                 .AsDocumentQuery();

            List<T> results = new List<T>();
            while (query.HasMoreResults)
            {
                results.AddRange(await query.ExecuteNextAsync<T>());
            }

            return results;
        }
    }
}

I opted for executing a SQL instead of a linq expression since constructing SQL is easier and simpler for a tutorial. Plus it decouples the query structure from compile time POCOs that we use as our models too.

I created a DocumentDbListener class based off the SearchBaseListener which was auto-generated from our ANTLR grammar. The sole purpose of this class is to generate a simple SQL against our search expression. To search inside nested arrays, I used a user defined function for azure document db. All of these are very crudely written, so forgive my indecency. Since this is just a tutorial, I tried to keep it as simple as possible.

function matchArrayElement(array, match) {
    for (var index = 0; index < array.length; index++) {
        var element = array[index];

        if (typeof match === "object") {
            for (var key in match) {
                if (match.hasOwnProperty(key) && element.hasOwnProperty(key)) {
                    var matchVal = match[key];
                    var elemVal = element[key];

                    return matchVal == elemVal;
                }
            }
        }
        else {
            return (element == match)
        }
    }

    return false;
}

All this method does is it tries to find nested array elements based on the match we send back. You can achieve the same result thorough JOINs in Azure Document Db or Array method ARRAY_CONTAINS, but I preferred a user defined function since it serves my purpose easily.

Constructing SQL from the query expression:

To understand how the SQL is generated from the query expression, let’s begin with the to:UserAccount expression. Since we start with the rule expr, let’s override the SearchBaseListener method EnterExpr first.

namespace TweetSearch.CosmosDb.DocumentDb
{
    using Antlr4.Runtime.Misc;
    using TweetSearch.CosmosDb.Util;

    public class DocumentDbListener : SearchBaseListener
    {
        private string projectionClause = "SELECT * FROM twt";
        private string whereClause;

        public string Output
        {
            get { return projectionClause + " " + whereClause; }
        }

        public override void EnterExpr([NotNull] SearchParser.ExprContext context)
        {
            this.whereClause = "WHERE";
        }
    }
}

The approach I took here is essentially the simplest. I handle the events fired the moment ANTLR enters a specific rule and I keep appending the SQL string to the whereClause. Since, entering the expr rule means that I will need a where SQL clause, I initialized it with “WHERE”. The thing to notice here is instead of concatenating I chose to initialize it because I expect this event to be fired exactly once since that is how the grammar is designed.

Following the same trail the next thing to handle will be the EnterTerm event. But, term is nothing but an OR relationship between 4 other rules. Handling them specifically gives me the edge since they produce simpler and smaller readable methods. For example, if we want to handle the to:UserAccount expression, a simple method like following should be sufficient for our use case.

        public override void EnterToText([NotNull] SearchParser.ToTextContext context)
        {
            var screenName = context.GetText().Substring(3).Enquote();
            this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.user_mentions, {{ \"screen_name\" : {screenName} }} )");
        }

This is where our user defined function also comes in play though. I’m trying to find any tweet that has an user mention to the parsed user account I fetched from the query.

By following the same rule I completed the rest of the four rules and my full listener class looks like:

namespace TweetSearch.CosmosDb.DocumentDb
{
    using Antlr4.Runtime.Misc;
    using TweetSearch.CosmosDb.Util;

    public class DocumentDbListener : SearchBaseListener
    {
        private string projectionClause = "SELECT * FROM twt";
        private string whereClause;

        public string Output
        {
            get { return projectionClause + " " + whereClause; }
        }

        public override void EnterExpr([NotNull] SearchParser.ExprContext context)
        {
            this.whereClause = "WHERE";
        }

        public override void EnterFromText([NotNull] SearchParser.FromTextContext context)
        {
            var screenName = context.GetText().Substring(5).Enquote();
            this.whereClause = string.Concat(whereClause, " ", "twt.user.screen_name = ", screenName);
        }

        public override void EnterOp([NotNull] SearchParser.OpContext context)
        {
            var text = context.GetText();
            this.whereClause = string.Concat(this.whereClause, " ", text.ToUpper());
        }

        public override void EnterToText([NotNull] SearchParser.ToTextContext context)
        {
            var screenName = context.GetText().Substring(3).Enquote();
            this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.user_mentions, {{ \"screen_name\" : {screenName} }} )");
        }

        public override void EnterHashText([NotNull] SearchParser.HashTextContext context)
        {
            var hashtag = context.GetText().Enquote();
            this.whereClause = string.Concat(whereClause, " ", $"udf.matchArrayElement(twt.entities.hashtags, {hashtag})");
        }

        public override void EnterExactText([NotNull] SearchParser.ExactTextContext context)
        {
            var text = context.GetText();
            this.whereClause = string.Concat(whereClause, " ", $"CONTAINS(twt.text, {text})");
        }
    }
}

We got our listener ready! Now, all we need is a context class that will bootstrap the lexer and parser and tokens so the input expression is transpiled and the output SQL is generated. Just like our last work on ANTLR, the TweetQueryContext class will look like the following:

namespace TweetSearch.CosmosDb.DocumentDb
{
    using Antlr4.Runtime;
    using Antlr4.Runtime.Tree;

    public class TweetQueryContext
    {
        private DocumentDbListener listener;

        public TweetQueryContext()
        {
            this.listener = new DocumentDbListener();
        }

        public SearchParser.ExprContext GenerateAST(string input)
        {
            var inputStream = new AntlrInputStream(input);
            var lexer = new SearchLexer(inputStream);
            var tokens = new CommonTokenStream(lexer);
            var parser = new SearchParser(tokens);
            parser.ErrorHandler = new BailErrorStrategy();

            return parser.expr();
        }

        public string GenerateQuery(string inputText)
        {
            var astree = this.GenerateAST(inputText);
            ParseTreeWalker.Default.Walk(listener, astree);
            return listener.Output;
        }
    }
}

Whew! That was easy, right?

Bootstrapping the api layer:

We have all we need except the api. Thanks to asp .net core, that is two clicks away. Open Visual Studio and open a .net core api project. Our TweetsController class looks like the following:

namespace TweetQuery.Controllers
{
    using Microsoft.AspNetCore.Mvc;
    using System.Threading.Tasks;
    using TweetQuery.Lib;
    using TweetQuery.Lib.Model;
    using TweetSearch.CosmosDb.DocumentDb;

    [Route("api/[controller]")]
    public class TweetsController : Controller
    {
        private CosmosDBRepository<Tweet> repository;
        private TweetQueryContext context;

        public TweetsController(CosmosDBRepository<Tweet> repository)
        {
            this.repository = repository;
            this.context = new TweetQueryContext();
        }

        [HttpGet("search")]
        public async Task<IActionResult> Search([FromQuery] string q)
        {
            if (string.IsNullOrEmpty(q))
                return BadRequest();

            var querySql = this.context.GenerateQuery(q).Trim();
            var result = await repository.GetItemsAsync(querySql);
            return Ok(result);
        }
    }
}

I reused the db repository we created earlier and as you see that is dependency injected in the controller which you have to configure in the ConfigureServices method in your Startup  class. I’m not adding that specific code here since it is already in the sample code and doesn’t belong to the scope of this tutorial. Same goes for the model class Tweet and the classes it uses inside.

Time to test!

The project is hosted here in github. Clone the code, and build and run it from your visual studio. As a sample query try the following:

http://localhost:5000/api/tweets/search?q=to:hatena_sugoi AND from:maeta_tw OR %23HopesUp OR "Surely June is a summer"

 

I url-encoded the hashtag here just to be nice on the REST client you might use. I highly suggest Postman if you don’t want anything heavy.

I only took the minimalists way of using ANTLR here, you can build your own expression tree based on the auto-generated listener and can do so much more if you want. A perfect example will be LinqToQuerystring written by Pete Smith. It generates the necessary LINQ expression for any IQueryable and thus allows you to write database agnostic Odata driven queries and it’s tons faster and lighter than the one Microsoft ships.

I hope this was fun. Happy RESTing!