Abstract
Since the 1970s, databases continue to remain the predominant choice for storage of information for
commercial as well as personal needs. While many types of databases (hierarchical, network, file-based,
object-based) have been developed and put to use, relational databases continue to hold the major share
of the various means of information storage. A relational database (or RDBMS) is a tool that stores data as a collection of rows and columns in tables and provides relational operators to manipulate the data in a tabular form. The data is stored in a structured manner and it is updated, manipulated or retrieved using a special language called the Structured Query Language (SQL).
There are several reasons for the popularity and wide-spread use of RDBMS including: 1) high
performance; 2) scalability, robustness and flexibility; 3) strong data protection; and, 4) ease of data
management. In addition to these, structured data allows us to index, sort, filter or perform aggregation
on the data stored within the columns of the tables through SQL.
However, RDBMS suffer from a disadvantage as well. Presently, the assistance of an RDBMS expert,
who knows the database schema and SQL, is required for performing computation on data and providing it to the user. Therefore, a non-technical person cannot retrieve data from a RDBMS. Moreover, form based data retrieval approaches lack the flexibility of a natural language and therefore, lack the ability to intuitively capture the user’s intent and instructions.
To overcome these limitations, a medium (a software tool) is required which accepts instructions (or questions), related to the stored data, in natural language from the user and sends the response back
to the user. Such a tool is called a Natural Language Interface to Databases (or NLIDB system). A NLIDB system takes as input a natural language (NL) query from a user and returns the output retrieved from the database to the user.
Despite being a deeply researched area for over five decades, NLIDB systems still remain an open
research problem. In the earlier systems, the focus was on implementing the NLIDB systems on a particular database. Since this was achieved with high accuracies, the focus shifted to exploring the issues of portability of NLIDB systems which included dealing with various system engineering problems of
robustness and scalability of the application. An important issue in portability is the ability to quickly
customize a NLIDB system to a new language and domain (of the database). Conventionally, NLIDB
systems depend heavily on linguistic and domain experts for customization. Quite often, the system
modules integrating the language and the domain are tightly coupled, thus, making the customizations
either very expensive or sometimes impossible.
Another important aspect of NLIDB systems is its ability to handle the aggregations specified in
a natural language query. Aggregations are a common phenomena in natural language and the words
specifying aggregations are called aggregates. In aggregations, a function is applied to a set of values
or entities in a database to yield a single value. Applications using databases through SQL get restricted in processing aggregates because of the limited set of corresponding aggregate functions available in a database. Moreover, complex aggregates do not have inbuilt aggregate functions, therefore, they cannot be processed by RDBMS. Successful resolution of aggregations in NLIDB systems increases the variety of queries that a system can process and adds to the improvement of the overall accuracy.
Based on the above hypothesis, in our research work, we have focussed on developing a NLIDB
framework which allows us to overcome the limitation of portability across languages and domains
while maintaining a high level of scalability and robustness. To achieve this, the framework uses a high accuracy general purpose syntactic parser along with semantic frames to parse a NL query to an SQL
query. With this approach, we have successfully derived the semantics of a given NL query without
using a semantic parser while maintaining an overall accuracy of above 90%. Such precision cannot be
achieved even by the state-of-art general purpose semantic parsers available with today’s technology.
We have also overcome the limitation of conventional RDBMS in handling aggregations by developing
a framework which can process simple or complex aggregates independent of the processing capability of the RDBMS.