The 'Dev-Train' dataset is like the validation dataset. Even though the performance of a trained model on 'Dev-Train' is good, it may still be far behind in the 'Test' dataset. The reason for this could be because the Train dataset's distribution may be different from the distribution of 'Test' dataset. For this reason, a small portion of the Test dataset (termed as Dev-Test) can be used to calibrate the training process.
'Training bigger models' and 'Getting more data' are two things that someone can always try.
Data synthesis refers to changing the training dataset so that the performance on the real problem (i.e. Test dataset) can be improved. For example, oversampling the examples which are underrepresented can balance the dataset.