This is a virtualization tool to examine and improve the quality of the popular word embedding models, accessible through the following link.
For a long time, people have been researching how to build a machine (or computer) that is able to understand human languages. This is normally called natural language processing (NLP), a very important research and application area in computer science. NLP is normally regarded as a flagship task in artificial intelligent (AI). However, this task seems effortless to us but it turns to be extremely difficult for machines.
As the first step towards NLP, we need to teach computers to understand all words used in a natural language. The popular approach recently is to represent each word as a point in a high-dimensional space, normally called word embedding. In this way, the meanings of the words as well as the semantic relationships among them may be inferred from the distances between the points representing these words. In other words, word similarity in the meaning and usage may be predicted based on the distance calculated between their corresponding points. Furthermore, these points may be fed into a complicated computation model for possible interpretations of a phrase, a sentence and even a long article. This may be a plausible path for NLP because computers are much better at computing than anything else.
In the past few years, some methods have been proposed to learn the so-called word embedding model to project each discrete word into a point in a continuous space [1]. These methods are typically designed to be efficient so that they can quickly process large text corpora available in the Internet, such as wikipedia. However, the word embedding models learned in this way are not perfect and they normally suffer from severe deficiencies. This virtualization tool is designed for two purposes:
The virtualization tool shows the near neighbourhood of one particular word in the embedding space. For a word (randomly sampled or inputed from the search bar), it shows a certain number of words located very close to the central word, implying a certain degree of similarity or relatedness to the central word. These neighbouring words are ranked and displayed counter-clockwise (from the right above) in the descending order according to their distances to the central word as follows:
As you may see, the model is far from perfect: many displayed neighbouring words do not make sense at all or their ranking order is not satisfactory. If you see these problems, you may use the mouse to drag and re-rank all neighbouring words to best reflect your own linguistic knowledge. Ideally, all synonyms should be ranked closest to the central word, followed by similar words, antonyms, relevant words and so on. All irrelevant words may be dragged into the garbage bin to discard.
Your corrections will be recorded by the system and will be used to improve the word embedding model based on the learning algorithm in [2].
Other hints for how to use this tool:
Reference:
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” In Proceedings of NIPS, pages 3111–3119, 2013.
[2] Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling and Yu Hu, “Learning SemanticWord Embeddings based on Ordinal Knowledge Constraints,” Proceedings of The 53th Annual Meeting of the Association for Computational Linguistics (ACL 2015), July, 2015. (here)