Abstract
Pre-training language models like BERT and then fine-tuning them on downstream tasks has demonstrated state-of-the-art results in various natural language processing tasks. However, pre-training is
usually independent of the downstream task, and previous works have shown that this pre-training alone
might not be sufficient to capture the task-specific nuances. We propose a novel way to tailor a pretrained BERT model for the downstream task via task-specific masking prior to the standard supervised
fine-tuning.
Our approach involves first creating a word list specific to the task at hand. For example, if the task
is sentiment analysis, we gather a small set of words that represent positive and negative sentiments. If
the task is hate speech detection, then we gather a small set of words that represent hate words. If the
task is humor detection, then we gather a small set of words that represent humorous words. Next, we
use word embeddings to measure the importance of each word in the task using the word list, which we
call the word’s task score. Based on the task score, we assign a probability of masking to each word.
This probability reflects the likelihood of masking the word during the fine-tuning process.
We experiment with different masking functions, including the step function, the linear function, and
the exponential function, to determine the best approach for assigning the probability of masking based
on the word’s task score. We then use this selective masking strategy to train the BERT model on the
masked language modeling (MLM) objective. During MLM training, rather than randomly masking
20% of the input tokens, we selectively mask these input tokens based on their assigned probability of
masking, calculated using the word’s task score.
Finally, we fine-tune the BERT model on different downstream binary and multi-class classification
tasks, such as sentiment analysis, hate speech detection, formality style detection, named entity recognition, and humor detection. Our experiments show that our selective masking strategy outperforms
random masking, indicating its effectiveness in fine-tuning the pre-trained BERT model for specific
tasks.
Overall, our approach provides a more targeted and effective way of fine-tuning pre-trained language models for specific tasks, by incorporating task-specific knowledge between the pre-training and
the fine-tuning stages. By selectively masking input tokens based on their importance, we are able to
better capture the nuances of a particular task, leading to improved performance on various downstream
classification tasks