Abstract
The practical applications of Handwritten Text Recognition (HTR) have flourished with many successful commercial APIs, solutions, and diverse use cases. Despite the
availability of numerous industrial solutions, academic research in HTR, particularly for English, has been hindered by the scarcity of publicly accessible data. To bridge this gap, this
paper introduces IIIT-HW-English-Word, a large and diverse collection of offline handwritten
English documents. This dataset comprises unconstrained camera-captured images featuring
20,800 handwritten documents crafted by 1,215 writers. Within this dataset, covering 757,830
words, we identify 174,701 unique words encompassing a variety of content types, such as alphabetic, numeric, and stop-words. We also establish a baseline for the proposed dataset,
facilitating evaluation and benchmarking, explicitly focusing on word recognition tasks. Our
findings suggest that our dataset can effectively serve as a training source to enhance performance on respective datasets