Abstract
The exchange of information between humans involves the use of speech and text and
is known as Natural Language. Each day, people communicate with each other in various
languages through speech or text, sharing a vast amount of information. The data produced
through natural language communication can offer valuable insights despite being ambiguous,
unstructured, and noisy. However, computers cannot interpret natural language on their own
yet. To fully comprehend this data and respond intelligently, computers need the ability to
understand and emulate human language. This is where Natural Language Processing (NLP)
comes in - it is a branch of Artificial Intelligence that focuses on enabling machines to read,
comprehend and derive meaning from human languages. NLP integrates the disciplines of
linguistics and computer science. It decodes language, its structure, rules, and creates models
capable of comprehending, analyzing, and extracting important information from both text and
speech.
An abundance of information is available in the form of text, including books, documents,
articles, social media posts, and more. A document, one of the oldest forms of information
exchange, refers to written, printed, or electronic material that is created to facilitate the exchange of information from its author to its intended audience. These files contain valuable
information that can significantly benefit business activities. With the use of NLP applications,
insights can be extracted from text data. Enterprises utilize NLP applications for various purposes, ranging from document understanding, information extraction, or providing answers to
common questions. In this thesis, we develop techniques for deeper analysis and understanding
of documents that are commonly used in enterprises.
A contract is a frequently used type of document in the corporate world. Contracts are agreements between two or more parties, that govern what each party can or cannot do and are usually dense in information. Automatically extracting key components or components that
contain rare or novel information from these large documents makes reviewing contracts easier. Nevertheless, it can be a challenging task as the key and novel components are not present
in isolation within the contract. Extraction of significant components (key components + novel
components) from contracts aims to simplify the end user’s comprehension and reduce dependency on legal experts for reviewing contracts. In this thesis, we introduce approaches for the
automatic identification and extraction of significant components from a contract. We propose
a Bidirectional Encoder Representations from Transformers (BERT) based model that automatically identifies or highlights significant components of a contract.
In the corporate world, reports are also a frequently encountered type of document. A report is
a document that provides information and analysis on a particular topic or issue. Reports are
used to convey important information to stakeholders, such as managers, executives, investors,
and customers. The vast data available in these reports have the potential to revolutionize datadriven analysis. Causality identification and span detection is one such data-driven task. The
relationship between two entities where one causes another event to happen is known as cause
and effect. We explored various transformer-based models that help in classifying sentences as
well as identifying spans in a sentence.