GUI4SWE

YorkU SWE Virtualization Tool (GUI4SWE)

This is a virtualization tool to examine and improve the quality of the popular word embedding models, accessible through the following link.

YorkU SWE Virtualization Tool

For a long time, people have been researching how to build a machine (or computer) that is able to understand human languages. This is normally called natural language processing (NLP), a very important research and application area in computer science. NLP is normally regarded as a flagship task in artificial intelligent (AI). However, this task seems effortless to us but it turns to be extremely difficult for machines.

As the first step towards NLP, we need to teach computers to understand all words used in a natural language. The popular approach recently is to represent each word as a point in a high-dimensional space, normally called word embedding. In this way, the meanings of the words as well as the semantic relationships among them may be inferred from the distances between the points representing these words. In other words, word similarity in the meaning and usage may be predicted based on the distance calculated between their corresponding points. Furthermore, these points may be fed into a complicated computation model for possible interpretations of a phrase, a sentence and even a long article. This may be a plausible path for NLP because computers are much better at computing than anything else.

In the past few years, some methods have been proposed to learn the so-called word embedding model to project each discrete word into a point in a continuous space [1]. These methods are typically designed to be efficient so that they can quickly process large text corpora available in the Internet, such as wikipedia. However, the word embedding models learned in this way are not perfect and they normally suffer from severe deficiencies. This virtualization tool is designed for two purposes:

Providing a graphical interface to virtually examine the quality of an existing word embedding model, which is learned from English wikipedia text of 5 billion words as in [2].
Using a user-friendly graphical interface to teach computers to correct mistakes to improve the quality of the above model.

The virtualization tool shows the near neighbourhood of one particular word in the embedding space. For a word (randomly sampled or inputed from the search bar), it shows a certain number of words located very close to the central word, implying a certain degree of similarity or relatedness to the central word. These neighbouring words are ranked and displayed counter-clockwise (from the right above) in the descending order according to their distances to the central word as follows:

As you may see, the model is far from perfect: many displayed neighbouring words do not make sense at all or their ranking order is not satisfactory. If you see these problems, you may use the mouse to drag and re-rank all neighbouring words to best reflect your own linguistic knowledge. Ideally, all synonyms should be ranked closest to the central word, followed by similar words, antonyms, relevant words and so on. All irrelevant words may be dragged into the garbage bin to discard.

Your corrections will be recorded by the system and will be used to improve the word embedding model based on the learning algorithm in [2].

Other hints for how to use this tool:

You may use any ID to login. The ID is used ONLY for saving all corrections you input.
You may login as anonymous by clicking the symbol 'X' on the top-right corner.
You are asked to work on 20 random words in each session.
You may check the near neighbourhood of any word by inputing it from the search bar.
You may change the number of neighbouring words to display (limited to max 50 for good viewing).
Use the left button of the mouse to drag-and-drop to re-ranking words.
Use the right button of the mouse to go to an online dictionary to look up an unknown word.
Make changes only to the words you are certain and leave as is if you are unsure.
Simply skip if the central word is a typo or proper noun (which you think hard to rank).
Discard all typo-like neighbouring words to the garbage bin.

Reference:

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” In Proceedings of NIPS, pages 3111–3119, 2013.

[2] Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling and Yu Hu, “Learning SemanticWord Embeddings based on Ordinal Knowledge Constraints,” Proceedings of The 53th Annual Meeting of the Association for Computational Linguistics (ACL 2015), July, 2015. (here)