Abstract
Morphology is the branch of linguistics that deals with words, their internal structure, how they are
formed, and their relationship to other words in the same language. It involves analyzing the structure
of words and parts of words, such as stems, root words, prefixes, and suffixes. It also looks at parts of
speech, intonation and stress, and the ways context can change a word’s pronunciation and meaning.
In most languages, if not all, many words can be related to other words by rules that collectively
describe the grammar for that language. For example, English speakers recognize that the words dog
and dogs are closely related, differentiated only by the plurality morpheme “-s”, only found bound to
nouns.
With recent advancements in computational linguistics, we can now learn one-hot vector represen-
tation for each word, also called word representations, from monolingual corpus of a language (training
corpus). Word representations have been shown to contain syntactic as well as semantic (morphologi-
cal) regularities. These word representations are being widely used to solve problems of various areas
of natural language processing. These include but are not limited to dependency parsing, named entity
recognition and parsing.
One major requirement for learning good word representations (word embeddings) is large enough
corpus to train. Size of training corpus directly affects the corresponding quality of word representations
our model learns. Many languages, even though widely spoken, suffer from being computationally
resource poor, which results in relatively poorer trained word embeddings. On top of it, morphologically
rich languages suffer from morphologically induced data sparsity, since there are cases, where one
morphological form of a word is common but other is rare in the same training corpus.
Hence to learn better word representations for low resourced languages and to better exploit morpho-
logical regularities present in distributional word representations, we present a language independent,
unsupervised method for building word embeddings using morphological expansion of text by exploit-
ing morphological regularities present in distributed word representations. Our model handles the prob-
lem of data sparsity and yields improved word embeddings by relying on training word embeddings on
artificially generated sentences. We evaluate our method using small sized training sets on eleven test
sets for the word similarity task across seven languages. Further, for English, we evaluated the impacts
of our approach using a large training set on three standard test sets. Our method improved results across all languages.
We also present an unsupervised, language agnostic approach for exploiting morphological regular-
ities present in high dimensional vector spaces. We propose a novel method for generating embeddings
of words from their morphological variants using morphological transformation operators. We evaluate
this approach on MSR word analogy test set with an accuracy of 85% which is 12% higher than the
previous best known system.