Tuesday 24 November 2015

Interactive DataViz: Rock albums by the genre since 1960


Interactive DataViz here: http://wiki-rock.azurewebsites.net/top10-album-genres.html
Last week I presented a talk in #BuildStuffLT titled “From Power Chords to the Power of Models” which was a study of the Rock Music by the way of Data Mining, Mathematical Modelling and also Machine Learning. It is such a fun subject to explore, especially for me that Rock Music has been one of my passions since I was a kid.

The slides from the talk is available and the videos will be available soon (although my performance during the talk was suboptimal due to lack of sleep, a problem which seems to be shared by many at the event). BuildStuffLT is a great event, highly recommended if you have never been to. It is a software conference with known speakers such as Michael Feathers, Randy Shoup, Venkat Subramaniam, Pieter Hintjens and this year was the host of Melvin Conway (yeah, the visionary who came up with Conway’s law in 1968) with really mind stimulating talks. You also get a variety of other speakers with very interesting talks.

I will be presenting my talk in CodeMash 2016 so I cannot share all of the material yet but I think this interactive DataViz alone is many many slides in a single representation. I can see myself spending hours just looking at the trends and artist names and their album covers - yeah this is how much I love Rock Music and its history - but even for others this could be fun and also help you discover some new to listen to.

DataViz

This is an interactive percentage-based stacked area chart of top 10 genres in a year, since 1960, where Rock Music as we know it started to appear. That is a mouthful but basically for every year, top 10 genres selected so the dataset contains only those Rock (or related) genres that at some point were among the top 10 genres. You can access it here or simply clone GitHub repo (see below) and host your own.


The data was collected from Wikipedia by capturing Rock Albums and then processing their genres, finding top 10 in every year and then presenting in a chart - I am using Highcharts which is really powerful and simple to use and has a non-commercial license too. The data itself I have shared so you can run your own DataViz if you want to. The license for the data is of course Wikipedia’s, which covers these purposes.



I highly recommend you start with the Visualisation with “All Unselected” (Figure 2) and then select a genre and visualise its rise and fall in the history.


Then you can click on a point (year/genre) to list all albums of that genre for that year (Figure 3). Please note that even when the chart shows 0%, there could be some albums for that genre - which are from a year which that genre was not among the top 10 genres.

Looking at the data in a different way

Here is the 50 years of Rock (starting from 1965) with the selected albums:



Things to bear in mind

  • The data has been captured by capturing all albums for all links found in documents that traversed from the list of rock genres then to the artist pages. As far as I know, the list includes all albums by the major (and minor) rock artists - according to Wikipedia. If you find a missing album (or artist), please let me know.
  • Every album will contribute all its genres to the list. This means if it has genres “Blues Rock” and “Rock”, then it will be counted once for each of the its genres and you can find it if you look at both Rock or Blues Rock genres.
  • Data has some oddities, sometimes an album occurs more than once, mainly due to nuances of data in Wikipedia, there are multiple entries (URLs) for the same document, etc. Data has already been cleansed through many processes and these oddities do not materially change the results. In the future however, there are things that can be done remove these remaining oddities.
  • Again, it is highly recommended that you click the “Unselect All” button and click on the genres that you are interested one by one and explore the name of the albums.
  • Clicking “Select All” or “Unselect All” takes a bit too much time. I am sure it has an easy solution (turn rendering off when changing the state) but have not been able to find it. Expect your PRs!
  • There are some genres in the list which are not really Rock genres. These genres would have been mentioned alongside a rock genre in the album cover or had been a not-so-much-rock album by an otherwise Rock artist.

Code and Data

All code and data published in GitHub. Code uses Highchartsjs, knockoutjs and foundations UI framework. Have fun!

Saturday 19 September 2015

The Rule of "The Most Accessible" and how it can make you feel better

I remember when I was a kid, I watched a documentary on how to catch a monkey. Basically you dig a hole in a tree, big enough for a stretched monkey hand to go in but not too big that fisted hand can get out and sit and watch.

Source: http://www.tarekcoaching.com/blog/dont-fall-in-the-monkey-trap/


Apart from holes, buttons and levers (things that can be pushed) are concepts very easy for animals to learn. Without getting too Freudian, furrowing and protrusions (holes and buttons) are one of the first concepts we learn.

This is nice when dealing with animals. On the other hand, it can be dangerous - especially for kids. A meat mincer machine has exactly these two: a hole and a button. Without referring to the disturbing images of its victims on internet, it is imaginable what can happen - many children sadly lose their fingers or hands this way. Safety of these machines are much better now but I grew up with a kid who was left with pretty much a claw of his right hand after the accident.

Now, the point is: in confrontation with entities that we encounter for the first time or do not have enough appreciation of their complexity, we approach from the most accessible angle we can understand.  If this phenomenon did not have a name (and it is not BikeShedding, that is different), now it has: Rule of The Most Accessible TMA. The problem is, as the examples tried to illustrate, it is dangerous. Or it can be a sign of mediocrity.

*    *    *

Now what does it have to do with our geeky world?

Have you noticed that in some projects, critical bugs go unnoticed but there are half a dozen bugs raised for the layout being one pixel out? Have you written a document and the main feedback you got was the font being used? Have you attended a technical review meeting which you get a comment on the icons of your diagram? Have you seen a performance test project that focuses only on testing the API because it is the easiest to test? Have you witnessed a code review that results in a puny little comments on namings only?

When I say it can be a sign of mediocrity, I think you know now what I am talking about. I cannot describe my frustration when we replace the most critical with the most accessible. And I bet you feel the same.

Resist TMA in you

You know how bad it is when someone falls into the TMA trap? Then you shouldn't. Take your time, and approach from the angle which matters most. If you cannot comment anything worthwhile then don't. Don't be a hypocrite.

Ask for more time, break down the complexity and get a sense of the critical parts. And then comment.

Fight TMA in others

Someone does TMA to you? Show it to their face. Remind them that we need to focus on the critical aspects first. Ask them not to waste time on petty aspects of the problem.

If it cannot be fight, laugh inside

And I guess we all have cases where the person committing TMA is a manager high up that fighting TMA can have unpleasant consequences. Then you know what? Just remember face of the monkey cartoon above and laugh inside. It will certainly make you feel better :)



Thursday 27 August 2015

No-nonsense Azure Monitoring in 20 Minutes (maybe 21) using ECK stack

Azure platform has been there for 6 years now and going from strength to strength. With the release of many different services and options (and sometimes too many services), it is now difficult to think of a technology tool or paradigm which is not “there” - albeit perhaps not exactly in the shape that you had wished for. Having said that, monitoring - even to the admission of some of the product teams - has not been the strongest of the features in Azure. Sadly, when building cloud systems, monitoring/telemetry is not a feature: it is a must.

I do not want to rant for hours why and how a product that is mainly built for external customers is different from the internal one which on its strength and success gets packaged up and released (as is the case with AWS) but a consistent and working telemetry option in Azure is pretty much missing - there are bits and pieces here and there but not a consolidated story. I am informed that even internal teams within Microsoft had to build their own monitoring solutions (something similar to what I am about to describe further down). And as the last piece of rant, let me tell you, whoever designed this chart with this puny level of data resolution must be punished with the most severe penalty ever known to man: actually using it - to investigate a production issue.

A 7-day chart, with 14 data points. Whoever designed this UI should be punished with the most severe penalty known to man ... actually using it - to investigate a production issue.

What are you on about?

Well if you have used Azure to deliver any serious solution and then tried to do any sort of support, investigation and root cause analysis, without using one of the paid telemetry solutions (and even with using them), painfully browsing through gigs of data in Table Storage, you would know the pain. Yeah, that's what I am talking about! I know you have been there, me too.

And here, I am presenting a solution to the telemetry problem that can give you these kinds of sexy charts, very quickly, on top of your existing Azure WAD tables (and other data sources) - tried, tested and working, requiring some setup and very little maintenance.


If you are already familiar with ELK (Elasticsearch, LogStash and Kibana) stack, you might be saying you already got that. True. But while LogStash is great and has many groks, it has been very much designed with the Linux mindset: just a daemon running locally on your box/VM, reading your syslog and delivering them over to Elasticsearch. The way Azure works is totally different: the local monitoring agent running on the VM keeps shovelling your data to durable and highly available storages (Table or Blob) - which I quite like. With VMs being essentially ephemeral, it makes a lot master your logging outside boxes and to read the data from those storages. Now, that is all well and good but when you have many instances of the same role (say you have scaled to 10 nodes) writing to the same storage, the data is usually much bigger than what a single process can handle and shoveling needs to be scaled requiring a centralised scheduling.

The gist of it, I am offering ECK (Elasticsearch, ConveyorBelt and Kibana), an alternative to LogStash that is Azure friendly (typically runs in Worker Role), out-of-the-box can tap into your existing WAD logs (as well as custom ones) and with a push of a button can be horizontally scaled to N, to handle the load for all your projects - and for your enterprise if you work for one. And it is open source, and can be extended to shovel data from any other sources.

At core, ConveyorBelt employs a clustering mechanism that can break down the work into chunks (scheduling), keep a pointer to the last scheduled point, pushing data to Elasticsearch in parallel and in batches and gracefully retry the work if fails. It is headless, so any node can fail, be shut down, restarted, added or removed - without affecting integrity of the cluster. All of this, without waking you up at night, and basically after a few days, making you forget it ever existed. In the enterprise I work for, we use just 3 medium instances to power analytics from 70 different production Storage Tables (and blobs).

Basic Concepts

Before you set up your own ConveyorBelt CB, it is better to know a few concepts and facts.

First of all, there is a one-to-one mapping between an Elasticsearch cluster and a ConveyorBelt cluster. ConveyorBelt has a list of DiagnosticSources, typically stored in an Azure Table Storage, which contains all data (and state) pertaining to a source. A source typically is a Table Storage, or a blob folder containing diagnostic data (or other) - but CB is extensible to accept other data stores such as SQL, file or even Elasticsearch itself (yes if you ever wanted to copy data from one ES to another). DiagnosticSource contains connection information for the CB to connect. CB continuously breaks down the work (schedules) for its DiagnosticSources and keeps updating the LastOffset.

Once the work is broken down to bite size chunks, they are picked up by actors (it internally uses BeeHive) and data within each chunk pushed up to your Elasticsearch cluster. There is usually a delay between data captured (something that you typically set in Azure configuration: how often copy data), so you set a Grace Period after which if the data isn't there, it is assumed there won’t be. Your Elasticsearch data will usually be behind realtime by the Grace Period. If you left everything as defaults, Azure copies data every minute which Grace Period of 3-5 minutes is safe. For IIS logs this is usually longer (I use 15-20 minutes).

The data that is pushed to the Elasticsearch requires:
  • An index name: by default the date in the yyyyMMdd format is used as the index name (but you can provide your own index)
  • The type name: default is PartitionKey + _ + RowKey (or the one you provide)
  • Elasticsearch mapping: Elasticsearch equivalent of a schema which defines how to store and index data for a source. These mappings are stored on a URL (a web folder or a public read-only Azure Blob folder) - schema for typical Azure data (WAD logs, WAD Perf data and IIS Logs) already available by default and you just need to copy them to your site or public Blob folder.

Set up your own monitoring suite

OK, now time to create our own ConveyorBelt cluster! Basically the CB cluster will shovel the data to a cluster of Elasticsearch. And you would need Kibana to visualise your data. Here I will explain how to set up Elasticsearch and Kibana in a Linux VM box. Further below I am explaining how to do this. But ...

if you are just testing the waters and want to try CB, you can create a Windows VM, download Elasticsearch and Kibana and run their batch files and then move to setting up CB. But after you have seen it working, come back to the instructions and set it up in a Linux box, its natural habitat.

So setting this up in Windows is just to download the files from the links below, unzip and then running the batch files elasticsearch.bat and kibana.bat. Make sure you expose the ports 5601 and 9200 from your VM, by creating endpoints.

https://download.elastic.co/kibana/kibana/kibana-4.1.1-windows.zip
https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.zip

Set up ConveyorBelt

As discussed above, ConveyorBelt is typically deployed as an Azure Cloud Service. In order to do that, you need to clone Github repo, build and then deploy it with your own credentials and settings - and all of this should be pretty easy. Once deployed, you would need to define various diagnostic source and point them to your ElasticSearch and then just relax and let CB do its work. So we will look at the steps now.

Clone and build ConveyorBelt repo

You can use command line:
git clone https://github.com/aliostad/ConveyorBelt.git
Or use your tool of choice to clone the repo. Then open administrative PowerShell window, move to the build folder and execute .\build.ps1

Deploy mappings

Elasticsearch is able to guess the data types of your data and index them in a format that is usually suitable. However, this is not always true so we need to tell Elasticserach how to store each field and that is why CB needs to know this in advance.

To deploy mappings, create a Blob Storage container with the option "Public Container" - this allows the content to be publicly available in a read-only fashion. 

You would need the URL for the next step. It is in the format:
https://<storage account name>.blob.core.windows.net/<container name>/

Also use the tool of your choice and copy the mapping files in the mappings folder under ConveyorBelt directory.

Configure and deploy

Once you have built the solution, rename tokens.json.template file to tokens.json and edit tokens.json file (if you need some more info, find the instructions here). Then in the same PowerShell window, run the command below, replacing placeholders with your own values:
.\PublishCloudService.ps1 `
  -serviceName <name your ConveyorBelt Azure service> `
  -storageAccountName <name of the storage account needed for the deployment of the service> `
  -subscriptionDataFile <your .publishsettings file> `
  -selectedsubscription <name of subscription to use> `
  -affinityGroupName <affinity group or Azure region to deploy to>
After running the commands, you should see the PowerShell deploying CB to the cloud with a single Medium instance. In the storage account you had defined, you should now find a new table, whose name you defined in the tokens.json file.

Configure your diagnostic sources

Configuring the diagnostic sources can wildly differ depending on the type of the source. But for standard tables such as WADLogsTable, WADPerformanceCountersTable and WADWindowsEventLogsTable (whose mapping file you just copied) it will be straightforward.

Now choose an Azure diagnostic Storage Account with some data, and in the diagnostic source table, create a new row and add the entries below:

  • PartitionKey: whatever you like - commonly <top level business domain>_<mid level business domain>
  • RowKey: whatever you like - commonly <env: live/test/integration>_<service name>_<log type: logs/wlogs/perf/iis/custom>
  • ConnectionString (string): connection string to the Storage Account containing WADLogsTable (or others)
  • GracePeriodMinutes (int): Depends on how often your logs gets copied to Azure table. If it is 10 minutes then 15 should be ok, if it is 1 minute then 3 is fine.
  • IsActive (bool): True
  • MappingName (string): WADLogsTable . ConveyorBelt would look for mapping in URL "X/Y.json" where X is the value you defined in your tokens.json for mappings path   and Y is the TableName (see below).
  • LastOffsetPoint (string): set to ISO Date (second and millisecond MUST BE ZERO!!) from which you want the data to be copied e.g. 2015-02-15T19:34:00.0000000+00:00
  • LastScheduled (datetime): set it to a date in the past, same as the LastOffset point. Why do we have both? Well each does something different so we need both. 
  • MaxItemsInAScheduleRun (int): 100000 is fine
  • SchedulerType (string): ConveyorBelt.Tooling.Scheduling.MinuteTableShardScheduler
  • SchedulingFrequencyMinutes (int): 1
  • TableName (string): WADLogsTable, WADPerformanceCountersTable or WADWindowsEventLogsTable
And save. OK, now CB will start shovelling your data to your Elasticsearch and you should start seeing some data. If you do not, look at the entries you have created in the Table Storage and you will find an Error column which tells you what went wrong. Also to investigate further, just RDP to one of your ConveyorBelt VMs and run DebugView while having "Capture Global Win32" enabled - you should see some activity similar to below picture. Any exceptions will also show in there.


OK, that is it... you are done! ... well barely 20 minutes, wasn't it? :)


Now in case you are interested in setting up ES+Kibana in Linux, here is your little guide.

Set up your Elasticsearch in Linux

You can run Elasticsearch on Windows or Linux - I prefer the latter. To set up an Ubuntu box on Azure, you can follow instructions here. Ideally you need to add a Disk Volume as the VM disks are ephemeral - all you need to know is outlined here. Make sure you follow instructions to re-mount the drive after reboots. Another alternative, especially for your dev and test environments, is to go with D series machines (SSD hard disks) and use the ephemeral disks - they are fast and basically if you lose the data, you can always set ConveyorBelt to re-add the data, and it does it quickly. As I said before, never use Elasticsearch to master your logging data so you can recover losing it.

Almost all of the commands and settings below, needs to be run in an SSH session. If you are a geek with a lot of linux experience, you might find some of details below obvious and unnecessary - in which case just move on.

SSH is your best friend

Anyway, back to setting up ES - after you got your VM box provisioned, SSH to the box and install Oracle JDK:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
And then install Elasticsearch:
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.deb
sudo dpkg -i elasticsearch-1.7.1.deb
Now you have installed ES v 1.7.1. To set Elasticsearch to start at reboots (equivalent of Windows services) run these commands in SSH:
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
Now ideally you would want to move the data and logs to the durable drive you have mounted, just edit the Elasticsearch config in vim and change:
sudo vim /etc/elasticsearch/elasticsearch.yml
and then (note uncommented lines):
path.data: /mounted/elasticsearch/data
# Path to temporary files:
#
#path.work: /path/to/work

# Path to log files:
#
path.logs:  /mounted/elasticsearch/data
Now you are ready to restart Elasticsearch:
sudo service elasticsearch restart
Note: Elasticsearch is Memory, CPU and IO hungry. SSD drives really help but if you do not have them (class D VMs), make sure provide plenty of RAM and enough CPU. Searches are CPU heavy so it will depend on number of concurrent users using it.
If your machine has a lot of RAM, make sure you set ES memory settings as the default ones will be small. So update the file below and set the memory to 50-60% of the total memory size of the box:
sudo vim /etc/default/elasticsearch
And uncomment this line and set the memory size to half of your box’s memory (here 14GB, just an example!):
ES_HEAP_SIZE=14g
There are potentially other changes that you might wanna do. For example, based on number of your nodes, you wanna set the index.number_of_replicas in your elasticsearch.yml - if you have a single node set it to 0. Also turning off the multicast/Zen discovery since will not work in Azure. But these are things you can start learning about when you are completely hooked on the power of information provided by the solution. Believe me, more addicting than narcotics!

Set up the Kibana in Linux

Up until version 4, Kibana was simply a set of static HTML+CSS+JS files that would run locally on your browser by just opening root HTML in the browser. This model could not really be sustainable and with version 4, Kibana runs as a service on a box, most likely different to your ES nodes. But for PoC and small use cases it is absolutely fine to run it on the same box.
Installing Kibana is straightforward. You just need to download and unpack it:
wget https://download.elastic.co/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz
tar xvf kibana-4.1.1-linux-x64.tar.gz
So now Kibana will be downloaded to your home directory and be unpacked to kibana-4.1.1-linux-x64 folder. If you want to see where that folder is you can run pwd to get the folder name.
Now to run it you just run the command below to start kibana:
cd bin
./kibana
That will do for testing if it works but you need to configure it to start at the boot. We can use upstart for this. Just create a file in /etc/init folder:
sudo vim /etc/init/kibana.conf
and copy the below (path could be different) and save:
description "Kibana startup"
author "Ali"
start on runlevel [2345]
stop on runlevel [!2345]
exec /home/azureuser/kibana-4.1.1-linux-x64/bin/kibana
Now run this command to make sure there is no syntax error:
init-checkconf /etc/init/kibana.conf
If good then start the service:
sudo start kibana
If you have installed Kibana on the same box as the Elasticsearch and left all ports as the same, now you should be able to go to browser and browse to the server on port 5601 (make sure you expose this port on your VM by configuring endpoints) and you should see the Kibana screen (obviously no data).



Thursday 9 July 2015

Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus

In my last blog post, I outlined a few interesting results from a word2wec model trained on half a million news documents. This was pleasantly met with some positive reactions, some of which not necessarily due to the scientific rigour of the report but due to awareness effect of such "populist treatment of the subject" on the community. On the other hand, there were more than some negative reactions. Some believing I was "cherry-picking" and reporting only a handful of interesting results out of an ocean of mediocre performances. Others rejecting my claim that training on a small dataset in any language can produce very encouraging results. And yet others literally threatening me so that I would release the code despite I reiterating the code is small and not the point.

Am I the only one here thinking word2vec is freaking awesome?!

So I am back. And this time I have trained the model on a very small corpus of Rock artists obtained from Wikipedia, as part of my Rock History project. And I have built an API on top of the model so that you could play with the model and try out different combinations to your heart's content - [but please be easy on the API it is a small instance only] :) strictly no bots. And that's not all: I am releasing the code and the dataset (which is only 36K Wiki entries).

But now, my turn to RANT for a few paragraphs.

First of all, quantification of the performance of an unsupervised learning algo in a highly subjective field is very hard, time-consuming and potentially non-repeatable. Google in their latest paper on seq2seq had to resort to reporting mainly man-machine conversations. I feel in these subjects crowdsourcing the quantification is probably the best approach. Hence you would help by giving a rough accuracy score according to your experience.


On the other hand, sorry, those who were expecting to see a formal paper - perhaps in laTex format - you completely missed the point. As others said, there are plenty of hardcode papers out there, feel free to knock yourselves down. My point was to evangelise to a much wider audience. And, if you liked what you saw, go and try it for yourself.

Finally, alluding to "cognition" turned a lot of eyebrows but as Nando de Freitas puts it when asked about intelligence, whenever we build an intelligent machine, we will look at it as bogus not containing the "real intelligence" and we will discard it as not AI. So the world of Artifical Intelligence is a world of moving targets essentially because intelligence has been very difficult to define.

For me, word2vec is a breath of fresh air in a world of arbitrary, highly engineered and complex NLP algorithms which can breach the gap forming a meaningful relationship between tokens of your corpus. And I feel it is more a tool enhancing other algorithms rather than the end product. But even on its own, it generates fascinating results. For example in this tiny corpus, it was not only able to find the match between the name of the artists, but it can successfully find matches between similar bands - able to be used it as a Recommender system. And then, even adding the vector of artists generates interesting fusion genres which tend to correspond to real bands influenced by them.

API

BEWARE: Tokens are case-sensitive. So u2 and U2 not the same.

The API is basically a simple RESTful flask on top of the model:
http://localhost:5000/api/v1/rock/similar?pos=<pos>&neg=<neg>
where pos and neg are comma separated list of zero to many 'phrases' (pos for similar, and neg for opposite) - that are English words, or multi-word tokens including name of the bands or phrases that have a Wiki entry (such as albums or songs) - list if which can be found here .
For example:
http://localhost:5000/api/v1/rock/similar?pos=Captain%20Beefheart


You can add vectors of words, for example to mix genres:
http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50
or add an artist with an adjective for example a softer Bob Dylan:
http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan,soft&min_freq=50
Or subtract:
http://localhost:5000/api/v1/rock/similar?pos=Bob%20Dylan&neg=U2
But the tokens do not have to be a band name or artist names:
http://localhost:5000/api/v1/rock/similar?pos=drug
If you pass a non-existent or misspelling (it is case-sensitive!) of a name or word, you will get an error:
http://localhost:5000/api/v1/rock/similar?pos=radiohead

{
  result: "Not in vocab: radiohead"
}

You may pass minimum frequency of the word in the corpus to filter the output to remove the noice:
http://localhost:5000/api/v1/rock/similar?pos=Daft%20Punk,Tool&min_freq=50

Code

The code on github as I said is tiny. Perhaps the most complex part of the code is the Dictionary Tokenisation which is one of the tools I have built to tokenise the text without breaking multi-word phrases and I have found it very useful allowing to produce much more meaningful results.

The code is shared under MIT license.

To build the model, uncomment the line in wiki_rock_train.py, specifying the location of corpus:

train_and_save('data/wiki_rock_multiword_dic.txt', 'data/stop-words-english1.txt', '<THE_LOCATION>/wiki_rock_corpus/*.txt')

Dataset

As mentioned earlier, dataset/corpus is the text from 36K Rock music artist entries on the Wikipedia. This list was obtained by scraping the links from the "List of rock genres". Dataset can be downloaded from here. For information on the Copyright of the Wikipedia text and its terms of use please see here.

Sunday 14 June 2015

Five crazy abstractions my Deep Learning word2vec model just did

Seeing is believing. 

Of course, there is a whole host of Machine Learning techniques available, thanks to the researchers, and to Open Source developers for turning them into libraries. And I am not quite a complete stranger to this field, I have been, on and off, working on Machine Learning over the last 8 years. But, nothing, absolutely nothing for me has ever come close to what blew my mind recently with word2vec: so effortless yet you feel like the model knows so much that it has obtained cognitive coherence of the vocabulary. Until neuroscientists nail cognition, I am happy to foolishly take that as some early form of machine cognition.

Singularity Dance - Wiki

But, no, don't take my word for it! If you have a corpus of 100s of thousand documents (or even 10s of thousands), feed it and see it for yourselves. What language? Doesn't really matter! My money is on that you will get results that equally blow your tops off.

What is word2vec?

word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted. This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors. And what is much more interesting is the arithmetics: vectors can be added or subtracted for example vector of Queen is almost equal to King + Woman - Man. In other words, if you remove Man from the King and add Woman to it, logically you get Queen and but this model is able to represent it mathematically.

LeCun recently proposed a variant of this approach in which he uses characters and not words. Altogether this is a fast moving space and likely to bring about significant change in the state of the art in Natural Language Processing.

Enough of this, show us ze resultz!

OK, sure. For those interested, I have brought the methods after the results.

1) Human - Animal = Ethics

Yeah, as if it knows! So if you remove the animal traits from human, what remains is Ethics. And in word2vec terms, subtracting the vector of Human by the vector of Animal results in a vector which is closest to Ethics (0.51). The other similar words to the Human - Animal vector are the words below: spirituality,  knowledge and piety. Interesting, huh?

2) Stock Market ≈ Thermometer

In my model the word Thermometer has a similarity of 0.72 to the Stock Market vector and the 6th similar word to it - most of closer words were other names for the stock market. It is not 100% clear to me how it was able to make such abstraction but perhaps proximity of Thermometer to the words increase/decrease or up/down, etc could have resulted in the similarity. In any case, likening Stock Market to Thermometer is a higher level abstraction.

3) Library - Books = Hall

What remains of a library if you were to remove the books? word2vec to the rescue. The similarity is 0.49 and next words are: Building and Dorm.  Hall's vector is already similar to that of Library (so the subtraction's effect could be incidental) but Building and Dorm are not. Now Library - Book (and not Books) is closest to Dorm with 0.51 similarity.

4) Obama + Russia - USA = Putin

This is a classic case similar to King+Woman-Man but it was interesting to see that it works. In fact finding leaders of most countries was successful using this method. For example, Obama + Britain - USA finds David Cameron (0.71).

5) Iraq - Violence = Jordan

So a country that is most similar to Iraq after taking its violence is Jordan, its neighbour. Iraq's vector itself is most similar to that of Syria - for obvious reasons. After Jordan, next vectors are Lebanon, Oman and Turkey.

Not enough? Hmm there you go with another two...

Bonus) President - Power = Prime Minister

Kinda obvious, isn't it? But of course we know it depends which one is Putin which one is Medvedev :)

Bonus 2) Politics - Lies = Germans??

OK, I admit I don't know what this one really means but according to my model, German politicians do not lie!

Now the boring stuff...

Methods

I used a corpus of publicly available online news and articles. Articles extracted from a number of different Farsi online websites and on average they contained ~ 8KB of text. The topics ranged from local and global Politics, Sports, Arts and Culture, Science and Technologies, Humanities and Religion, Health, etc.

The processing pipeline is illustrated below:

Figure 1 - Processing Pipeline
For word segmentation, an approach was used to join named entities using a dictionary of ~ 40K multi-part words and named entities.

Gensim's word2vec implementation was used to train the model. The default n=100 and window=5 worked very well but to find the optimum values, another study needs to be conducted.

In order to generate the results presented in this post, most_similar method was used. No significant difference between using most_similar and most_similar_cosmul was found.

A significant problem was discovered where words with spelling mistake in the corpus or infrequent words generate sparse vectors which result in a very high score of similar with some words. I used frequency of the word in the corpus to filter out such occasions.

Conclusion

word2vec is relatively simple algorithm with surprisingly remarkable performance. Its implementation are available in a variety of Open Source libraries, including Python's Gensim. Based on the preliminary results, it appears that word2vec is able to make higher levels abstractions which nudges towards cognitive abilities.

Despite its remarkable it is not quite clear how this ability can be used in an application, although in its current form, it can be readily used in finding antonym/synonym, spelling correction and stemming.

Wednesday 27 May 2015

PerfIt! decoupled from Web API: measure down to a closure in your .NET application

Level [T2]

Performance monitoring is an essential part of doing any serious-scale software. Unfortunately in .NET ecosystem, historically first looking for direction and tooling from Microsoft, there has been a real lack of good tooling - for some reason or another effective monitoring has not been a priority for Microsoft although this could be changing now. Healthy growth of .NET Open Source community in the last few years brought a few innovations in this space (Glimpse being one) but they focused on solving development problems rather than application telemetry.

2 years ago, while trying to build and deploy large scale APIs, I was unable to find anything suitable to save me having to write a lot of boilerplate code to add performance counters to my applications so I coded a working prototype of performance counters for ASP .NET Web API and open sourced and shared it on Github, calling it PerfIt! for the lack of a better name. Over the last few years PerfIt! has been deployed to production in a good number of companies running .NET. I added the client support too to measure calls made by HttpClient and it was a handy addition.
From Flickr

This is all not bad but in reality, REST API calls do not cover all your outgoing or incoming server communications (which you naturally would like to measure): you need to communicate to databases (relational or NoSQL), caches (e.g. Redis), Blob Storages, and many other. On top of that, there could be some other parts of your code that you would like to measure such as CPU intensive algorithms, reading or writing large local files, running Machine Learning classifiers, etc. Of course, PerfIt! in this current incarnation cannot help with any of those cases.

It turned out with a little change and separating performance monitoring from Web API semantic (which is changing with vNext again) this can be done. Actually, not getting much credit for it, it was mainly ideas from two of my best colleagues which I am grateful for their contribution: Andres Del Rio and JaiGanesh Sundaravel.

New PerfIt! features (and limitations)

So currently at version alpha2, you can get the new PerfIt! by using nuget (when it works):
PM> install-package PerfIt -pre
Here are the extra features that you get from the new PerfIt!.

Measure metrics for a closure


So at the lowest level of an aspect abstraction, you might be interested in measuring metrics for a closure, for example:
Action action = Thread.Sleep(1000);
action(); // measure
Or in case of an async operation:
foo result = null;
Func<Task> asyncCall = async () => result = await _command.ExecuteScalar();

// and then
await asyncCall();
This closure could be wrapped in a method of course, but there again, having a unified closure interface is essential in building a common tool: each method can have different inputs of outputs while all can be presented in a closure having the same interface.

Thames Barriers Closure - Flickr. Sorry couldn't find a more related picture, but enjoy all the same
So in order to measure metrics for the action closure, all we need to do is:
var ins = new SimpleInstrumentor(new InstrumentationInfo() 
{ 
   Counters = CounterTypes.StandardCounters, 
   Description = "test", 
   InstanceName = "Test instance" 
}, 
   TestCategory); 

ins.Instrument(() => Thread.Sleep(100));

A few things here:
  • SimpleInstrumentor is responsible for providing a hook to instrument your closures. 
  • InstrumentationInfo contains the metadata for publishing the performance counters. You provide the name of the counters to raise to it (provided if they are not standard, you have already defined )
  • You will be more likely to create a single instrumentor instance for each aspect of your code that you would like to instrument.
  • This example assumes the counters and their category are installed. PerfitRuntime class provides mechanism to register your counters on the box - which is covered in previous posts.
  • Instrument method has an option to pass the context as a string parameter. This context can be used to correlate metrics with application context in ETW events (see below).

Doing an async operation is not that different:
ins.InstrumentAsync(async () => await Task.Delay(100));

//or even simpler:
ins.InstrumentAsync(() => Task.Delay(100))

SimpleInstrumentor is the building block for higher level abstractions of instrumentation. For example, PerfitClientDelegatingHandler now uses SimpleInstrumentor behind the scene.

Raise ETW events, effortlessly


Event Tracing for Windows (ETW) is a low overhead framework for logging, instrumentation, tracing and monitoring that has been in Windows since version 2000. Version 4.5 of the .NET Framework exposes this feature in the class EventSource. Probably suffice to say, if you are not using ETW you are doing it wrong.

One problem with Performance Counters is that they use sampling, rather than events. This is all well and good but lacks the resolution you sometimes need to find problems. For example, if 1% of calls take > 2 seconds, you need on average 100 samples and if you are unlucky a lot more to see the spike.

Another problem is lack of context with the measurements. When you see such a high response, there is really no way to find out what was the context (e.g. customerId) for which it took wrong. This makes finding performance bottlenecks more difficult.

So SimpleInstrumentor, in addition to doing counters for you, raises InstrumentationEventSource ETW events. Of course, you can turn it off or just leave it as it has almost no impact. But so much better, is that use a sink (Table Storage, ElasticSearch, etc) and persist these events to a store and then analyse using something like ElasticSearch and Kibana - as we do it in ASOS. Here is a console log sink, subscribed to these events:
var listener = ConsoleLog.CreateListener();
listener.EnableEvents(InstrumentationEventSource.Instance, EventLevel.LogAlways,
    Keywords.All);
And you would see:


Obviously this might not look very impressive but when you take into account that you have the timeTakenMilli (here 102ms) and have the option to pass instrumentationContext string (here "test..."), you could correlate performance with the context of in your application.

PerfIt for Web API is all there just in a different nuget package


If you have been using previous versions of PerfIt, do not panic! We are not going to move the cheese, so the client and server delegating handlers are all there only in a different package, so you just need to install Perfit.WebApi package:
PM> install-package PerfIt.WebApi -pre
The rest is just the same.

Only .NET 4.5 or higher


After spending a lot of time writing async code in CacheCow which was .NET 4.0, I do not think anyone should be subjected to such torture, so my apologies to those using .NET 4.0 but I had to move PerfIt! to .NET 4.5. Sorry .NET 4.0 users.

PerfIt for MVC, Windsor Castle interceptors and more

Yeah, there is more coming. PerfIt for MVC has been long asked by the community and castle interceptors can simply remove all cross cutting concern codes out of your core business code. Stay tuned and please provide feedback before going fully to v1!

Sunday 10 May 2015

Machine Learning and APIs: introducing Mills in REST API Design

Level [C3]

REST (REpresentational State Transfer) was designed with the "state" at its heart, literally, standing for the S in the middle of the acronym.

TL;DR: Mill is a special type of resource where server's authority purely comes from exposing an algorithm, rather than "defining, exposing and maintaining integrity of a state". Unlike an RPC style endpoint, it has to adhere to a set of 5 constraints (see below). 

Historically, when there were only a few thousand servers around, the state was predominantly documents. People were creating, editing and sharing a lot of text documents, and some HTML. With HTTP 1.1, caching and concurrency was built into the protocol and enabled it to represent richer distributed computing concerns and we have been building . With the rising popularity of REST over the last 10 years, much of today's web has been built on top of RESTful thinking, whether what is visible or what is behind the presentation (most external layer) servers. Nowadays when we talk of state, we normally mean data or rather records persisted in a data store (relational or non-relational). A lot of today's data, directly or indirectly, is created, updated and deleted using REST APIs. And this is all cool, of course.

When we design APIs, we map the state into REST Resources. It is very intuitive to think of resources as collection and instance. It is unambiguous and useful to communicate these concepts when for example we refer to /books and /books/123 URLs, as the collection or instance resources, respectively. We interact with these resources using verbs, and although HTTP verbs are not meant to be used just for CRUD operations, interacting with the state that exists on the server is inherent in the design.

But that is not all the story. Mainstream adoption of Machine Learning in the industry means we need to expose Machine Learning applications using APIs. The problem is the resource oriented approach of REST (where the state is at the heart of the design) does not work very well.

By the way, I am NOT 51...
How-old.net is an example of a Machine Learning application where instead of being an application, it could have been an API. For example (just for illustration, you could use other media types too):

POST /age_gender_classifier HTTP/1.1
Content-Type: image/jpeg
And the response:
200 OK
Content-Type: application/json

{
    "gender":"M"
    "age":37
}

Server is generating a response to the request by carrying out complex face recognition and running a model, most likely a deep network model. Server is not returning a state stored on the server, in fact this whole process is completely stateless.

And why does this matter? Well I feel if REST is supposed to move forward with our needs and use cases, it should define, clarify, internalise and finally digest edge cases. While such edge cases were pretty uncommon, with the rise and popularity of Machine Learning, such endpoints will be pretty standard.

A few days ago, on the second day of APIdays Mediterranea 2015 conference, I presented a talk on Machine Learning and APIs. And in this talk I presented simple concept of Mills. Mills, where you take your wheat to be ground and you carry back the flour.



Basically, it all goes back to the origin of a server's authority. To bring an example, a "Customer Profile" service, exposed by a REST API, is the authority to go to when another service requires access to a customer's profile. The "Customer Profile" service has defined a state, which is profile of the customers, and is responsible for ensuring integrity of the state (enforcing business rules on the state). For example, marketing email preference can have values of None, WeeklyDigest or All, it should not allow the value to be set to MonthlyDigest. We are quite used to these type of services and building REST APIs on top: CustomerProfile becomes a resource that we can query or interact with.

On the other hand, a server's authority could be exposing an algorithm. For example, tokenisation of text is a non-trivial problem that requires not only breaking the text to its words, but also maintaining muti-words and named entities intact. A REST API that exposes this functionality will be a mill.

5 constraints of a Mill

1) It encapsulates an algorithm not a state

Which was discussed ad nauseum, however, the distinction is very important. For example let's say we have an algorithm that you provide the postcode and it returns to you houses within 1 mile radius of that postcode - this is not an example of a mill.

2) Raw data in, processed data out

For example you send your text and get back the translation.

3) Calls are both safe and idempotent

Calling the endpoint should not directly change any state within the server. For example, the endpoint should not be directly mapped to the ML training engine, e.g. sending a text 1000 times skew the trained model for that text. The training endpoint is usually a normal resource, not a mill - see my slides

4) It has a single specialty

And as such, it accepts a single HTTP verb apart from OPTIONS, normally POST (although a GET with entity payload would be more semantically correct but not my preferred option for practical reasons).

5) It is named not as a verb but as a tool

A mill that exposes tokenisation, is to be called tokeniser. In a similar way, classifier would be the appropriate name for a system that classifies on top of a neural network, for example. Or normalising text, would have a normaliser mill.



No this is not the same as an RPC endpoint. No RPC smell. Honest :) That is why those 5 constraints exists.



Wednesday 22 April 2015

Pilgrimage into the world of Tarkovsky: through the eyes of hope and suffering

[Level N]

The world is not perfect. It has given us scientists, authors, artists and politicians - and I have lived enough to know none of them were really perfect. Among these, we have personal heroes, personalities that have made great discoveries, built wonderful things or have lived extraordinary lives. Whether it is Obama, Einstein or George Orwell, they have their deficiencies.

I am saying this because the word Pilgrimage in the title can put you off. In fact it puts me off. But ... it is there for a reason, and I hope by the time you finish reading - if you hang on long enough - you would see it.

*  *  * 

A stuttering boy who finally mutters a few words with no pause after a session of hypnotherapy, and then leading to a black screen of titles with the music of Bach, is not a typical opening scene. But this for me has been the most memorable opening among all the films I have seen. If you are looking to describe the body of work by the late director Tarkovsky, look no further, it is all there in the opening scene of The Mirror (1975). This scene somehow encapsulates Tarkovsky's view of himself. A timid lad who can barely speak two words in sequence without constantly stuttering but with the help of "supernatural" powers can speak and tell us his stories. And the process is painful for him, it is only achieved with determination and sacrifice.



*  *  * 

Stumbling a few times along the way, I find my way with difficulty through the aisles of the dark cinema. I think I have missed the first few minutes but that should be OK.

I am lucky to be here. After queueing several hours in a cold sunny day on February 1988, I have managed to buy a ticket to Tarkovsky's Stalker (1979) in Fajr Film Festival. A special section of the Festival is dedicated to the memory of late Tarkovsky who died the previous year and they are showing all his films - with understandable cuts when it does not meet with "the code", at the end of the day Iran is run as an Islamic country. These are the films that intellectuals go to - and I should go to since I am planning to become one!



And I sit there in the dark, watching this 220 minute epic where very few things actually happen. And the film is in fluent Russian with no subtitles!

And through the confusion of barely knowing the storyline, and not getting any of the dialogues, as a young 19-year-old student, I am mesmerised. The film works its way through me, somehow, precipitates deep marks that are ingrained with me until this day. The film communicates with a strange language which I feel I have known but very remotely, as if in a previous life. It is hazy, sublime and next to impossible to translate to words.

And next thing I know, I am sitting watching Mirror (this time it is the public screening and is translated) and incoherent images and storylets come and go, with apparently no relationship. And yet, by the end, I cannot control myself and my eyes are wet. And again, I have no explanation, when being accused of pretentious intellectualism or sentimentalist.

My journey (or Pilgrimage) has started. These films, I have lived with. They grew with me, and gradually, over quarter of a century, made sense. And this post is about why and how.

*  *  * 

It was not a coincident that in the same Fajr Festival 1988 there was a screening of Parajanov's Colour of Pomegranates. It is generally believed that films of Tarkovsky and Parajanov are very similar. Tarkovsky indeed was a fan of Parajanov works and I later found out they were in fact friends. I did manage to watch it later on the public screening but when it was even more bizarre, I did not quite like it. Form is the vehicle to deliver the meaning and not the meaning itself. Parajanov felt overly concerned with form and while narrative and a story of love is there, the meaning did not quite live up to the novelty of the form - I don't know maybe I will think differently.

Going back to Tarkovsky, "the meaning/message" is not easy to grasp. Commonly there are different interpretations and even it is said that his films are meant to take us to a personal journey to understand hence all interpretations are correct because they are true to ourselves - so post-modern!



Did Tarkovsky hide specific messages for us to grasp in his often difficult and unusual films? Most works of art (and even more so for the music and modern art) are open to personal interpretation. Abstract paintings famously invite us to find our personal comprehension of the work of art. But how about Tarkovsky?

Only he can answer us. And he did.

*  *  * 

It is very rare for a director to uncover his tricks and spoil the meaning of his films in a book. Well, he did not quite do that in Sculpting in time but he did reveal his vision of cinema as an art form. And more importantly, why he made his films. While for many, making film is a means of gaining fame, a career, or a vehicle to project one's intellectual viewpoints, or (as Tarkovsky refutes) as a means of self expression, for Tarkovsky it was a selfless and painful endeavour to fulfil a responsibility he was trusted with. While for some, making on average one film every 7 years means they were striving for perfection, for him it was painstakingly ensuring his duty in this world gets fulfilled.



What do we mean by responsibility? Hard to explain in words but easier to point you to his films. We get to meet Tarkovsky himself in his films - whom do you think Andrei Rublev was then?! An artist monk, sick of the decadence of the world, taking a vow of silence only to understand at the end that he cannot forfeit his duty as an artist in order to stay pure. His work will involve suffering but that is the sacrifice he is meant to make. An artist is not free, despite the theories of modern art, artist is not solely responsible to himself and his art. Tarkovsky shunned the modern art:
"Modern art has taken a wrong turn in abandoning search for the meaning of existence in order to affirm the value of the individual for its own sake."
Tarkovsky sees the process of making art as a consummation of the artist for the cause - he called artists "sufferers". Artist is a martyr and artistic creation a sanctimonious sacrifice:
"Artistic creation demands that he 'perish utterly' in the full tragic sense of those words."
The word self-expression, this inner looking for fulfilment, utterly made him frustrated with the artistic culture of the day. Artist himself is the last person to gain from the artistic creation - very much like the character Stalker that could not benefit himself from "The Room", nor any of the other Stalkers.
"The artist is always a servant, and is perpetually trying to pay for the gift that has been given to him as if by miracle."
Also, the artist is not merely an intellectual concerned with the abstract notions of his art form, but he is an evangelist (in its literal meaning) making his art for everyone:
"Art addresses everybody, in the hope of making an impression, ... of winning people not by incontrovertible rational argument but through the spiritual energy with which the artist has charged the work."
And oh boy, that spiritual energy that sets you on fire, making you look for the answer - in my case for quarter a century, Now probably it makes a lot more sense to think of this man as a prophet.

*  *  * 

Tarkovsky films are slow - for some, painfully slow. They contain many long takes, and this by itself does not signify a technique but it is a by-product of his vision and language for the cinema as an art form. This vision was used later by Bela Tarr, a true student of this vision.



On the surface, it could appear that this is a stylistic decision to come up with a unique formalism, a pretentious intellectual gesture. But Tarkovsky himself disdained pure experimentation to come up with a new formalism:
"People talk about experiment and search above all in relation to the avant-garde. But what does it mean? ... For the work of art carries within it an integral aesthetic and philosophical unity; it is an organism ... Can we talk of experiment in relation to the birth of a child? It is senseless and immoral."
And this again reminds of the burden of responsibility he felt in making his films. On the other hand, he is regarded as one of the proponents of "poetic cinema", a term that Tarkovsky himself find almost offensive:
"I find particularly irritating the pretensions of modern 'poetic cinema', which involves breaking off contact with fact and with time realism."
Tarkovsky talks of the works of arts that have inspired him and have shaped his artistic language. These range from late middle ages icons, Italian paintings of the renaissance period to the works of literature by Dostoevsky, Tolstoy, Goethe and finally to the films by Dovzhenko, Bresson and others. In an effort to describe an ideal piece of work, he brings an example from a relatively obscure painter of the renaissance period, whose painting had a deep effect on him. In contrast to Raphael's "Virgin and Child", he was captivated by the inexplicability of the works of Carpaccio.

Preparation of Christ's Tomb (1505) - Vittorio Carpaccio


Back to cinema, he believed that the ideal film is countless meters of celluloid capturing entire life of a person. This probably make it easier to understand why his films were usually longer than 2 hours and in case of Stalker 3 hours and 40 minutes! Tarkovsky believed that a work of art need to be true to life. And when we think of life, there are fast cuts and edits: it is a very long take.
"I want to make the point yet again that in film, every time, the first essential in any plastic composition ... is whether it is true to life."
Tarkovsky explains some of the techniques he used in order to make his scenes have a deeper impression on the viewer. These techniques move away from the cinematic languages of the time, from the cliches symbolisms of the common cinematic cinema. They usually enhance and magnify the image and make it imprint on our psyche. An example, is the scene from Mirror where the Doctor meets Mother and at the end of the scene a strong wind blows, making the Doctor look back towards the house.

All in all, for Tarkovsky, a masterpiece is a work of art that you cannot remove anything from it without completely destroying the work. And that is exactly what he saw in the works of Carpaccio - a unity of that cannot be broken. As such, it is really difficult to pinpoint what makes the masterpiece exceptional as it is.

*  *  * 

So where does Tarkovsky get his inspiration from? Where is his true role model?
It might come as a surprise for some but Tarkovsky was a devout Christian. He knew two of the gospels by heart and would recite them in conversations. His book is full of quotations of the New Testament (1 Corinthians a favourite of him) and phrases that can only mean he truly believed. He was not after happiness (remember Stalker - the Black Dog of depression):
"Let us imagine for a moment that people have attained happiness ... Man becomes Beelzebub."
He saw a strong similarity between art and religion:
"In art, as in religion, intuition is tantamount to conviction, to faith. It is a state of mind, not a way of thinking."
He felt a deep connection in his role to that of an evangelist:
"Art ... expresses its own postulate of faith."
And his role model: of course, Jesus: Selfless sacrifice, Servant, For everyone, Winning people. All his films and writings points to him. It is not accidental that we hear John's Revelation in Stalker. As apocalyptic as the film is, this could have not been more literal. No accident that we meet God in Solaris (Ocean), or Stalker is so stricken by the lack of faith of others, and in Sacrifice, one can save everyone.

Commonly people ask why he made his films so difficult under layers of meanings. Why? Exactly for the same reason Jesus, as a teacher, used parables to convey messages, not plainly.

*  *  *

And my quest is not finished, but surely it has eased off. After believing in Jesus in 2001, I revisited Tarkovsky again lately. Now all symbols and meanings crystal clear. I feel very close to what he tried so hard to shape into images. It just makes sense.

Messages and meanings ... and what are those? It will be clear by the process of your personal pilgrimage. And it could begin now...