Most Frequently Mentioned ML Topics in 2020

The progress of the machine learning (ML) and artificial intelligence fields never stopped surprising us this year either. In natural language processing (NLP), new, powerful models such as GPT-3 and T5 are published one after another. The Transformer found its way into the computer vision (CV) field as well (Chen et al. 2020, Dosovitskiy et al. 2020). The exponential growth trend of the number of papers published on arXiv and at conferences hasn't slowed down yet.

In this post, I'm going to use NLP techniques to analyze all the ML/NLP/CV papers published on arXiv this year and summarize the "most frequently mentioned ML topics in 2020." These top-ranked keywords represent the ML trends in 2020 very well, and knowing them in advance will make your job easier when it comes to reading more scientific articles (this is very important for non-native English speakers like me!)

Specifically, I collected the titles and abstracts of all the papers published on arXiv in 2020 via the arXiv API, and extracted named entities with a model trained on SciREX. The SciREX model can extract typed named entities such as tasks, metrics, datasets, and methods, which enables us to rank the mentions per type. The technical details of the analyses are shown in the "technical details" section at the bottom of this article.

Most Frequently Mentioned Topics in 2020 (per Type)

I only focus on the three AI fields—general machine learning (cs.LG), natural language processing (cs.CL), and computer vision (cs.CV). I'm aware that there are many other AI/ML categories on arXiv, although I limited to just three to simplify things.

Machine Learning (cs.LG)

First, let's look at the most mentioned topics in machine learning per type below:

Datasets	Metrics	Tasks	Methods
CIFER-10	accuracy	classification	neural network
ImageNet	robustness	machine learning	deep neural network
MNIST	complexity	training	convolutioal neural network
COVID-19	convergence	learning	deep learning
CIFER-100	computational cost	generalization	machine learning
SVHN	classification accuracy	prediction	reinforcement learning
chest x-ray	computational complexity	inference	GAN
COCO	precision	NLP	machine learning models
KITTI	f1 score	reinforcement learning	graph neural network
Twitter	sample complexity	artificial intelligence	classifier

If you look at the list of datasets, most of them are related to computer vision, which is arguably the most actively researched area in machine learning.

As for the methods, you see neural networks everywhere. Among generic methods such as "neural network" and "deep learning" you also see "graph neural network," which is one of the biggest recent trends in ML.

Machine Learning (cs.LG)

First, let's look at the most mentioned topics in machine learning per type below:

Datasets	Metrics	Tasks	Methods
CIFER-10	accuracy	classification	neural network
ImageNet	robustness	machine learning	deep neural network
MNIST	complexity	training	convolutioal neural network
COVID-19	convergence	learning	deep learning
CIFER-100	computational cost	generalization	machine learning
SVHN	classification accuracy	prediction	reinforcement learning
chest x-ray	computational complexity	inference	GAN
COCO	precision	NLP	machine learning models
KITTI	f1 score	reinforcement learning	graph neural network
Twitter	sample complexity	artificial intelligence	classifier

If you look at the list of datasets, most of them are related to computer vision, which is arguably the most actively researched area in machine learning.

Natural Language Processing (cs.CL)

Next, the most mentioned topics in NLP are shown below:

Datasets	Metrics	Tasks	Methods
COVID-19	accuracy	NLP	BERT
English	F1 score	machine translation	language model
Twitter	bleu score	question answering	transformer
Wikipedia	robustness	named entity recognition	LSTM
GLUE	word error rate	automatic speech recognition	neural network
German	quality	neural machine translation	deep neural network
SQuAD	precision	downstream tasks	NLP
LibriSpeech	recall	classification	recurrent neural network
Wikidata	translation quality	sentiment analysis	neural models
Hindi	evaluation metrics	generation	convolutional neural network

This was obvious in retrospect, but the most mentioned topic in NLP was "COVID-19." A dataset of COVID-related papers CORD-19 was published. An information extraction shared task was held at the W-NUT 2020 workshop. A lot of research efforts were made for analyzing COVID-related information on social media as well as on clinical text. It is also nice to see many authors name the language(s) they work on in the abstract.

Top-mentioned methods are all related to BERT, language models, and transformers. I wonder when the "ImageNet Moment" with transformer-based transfer learning is going to peak out in NLP.

Computer Vision (cs.CV)

Finally, here's the list of most mentioned topics in computer vision.

Datasets	Metrics	Tasks	Methods
ImageNet	accuracy	segmentation	convolutional neural network
CIFER-10	robustness	classification	deep neural network
COCO	classification accuracy	computer vision	deep learning
KITTI	precision	object detection	neural network
CIFER-100	computational cost	detection	GAN
COVID-19	maximum a posteriori	training	deep convolutional neural network
MNIST	speed	semantic segmentation	deep learning models
Cityscapes	computational complexity	image classification	transfer learning
chest x-ray	generalization ability	generalization	classifier
RGB images	sensitivity	inference	deep learning methods

Among the "regulars" such as CIFER-10/100, ImageNet, and MNIST, COVID-19 and chest x-ray are in the most mentioned dataset list. There has been a lot of research work on, for example, diagnosis of COVID-19 from chest x-ray images.

The most mentioned task was (semantic) segmentation, which is an important CV task with a wide range of applications such as autonomous driving and medical image processing.

Topics on the Rise in 2020

In the second half of this article, we are going to dive into some trends of individual fields by comparing the number of mentions between 2019 and 2020 and focusing on mentions that have significantly more (or fewer) mentions.

Machine Learning (cs.LG)

First, let's look at the mention trends in machine learning. The following table lists the 20 most mentioned topics in 2020 (regardless of their types) along with their monthly trends (relative number of papers that mention each keyword) in the past 24 months. If a topic has statistically significantly more mentions compared to the expected value based on the 2-year average, the trend is shown in green; red if it's significantly fewer.

Rank	Topic	# Papers (2019)	# Papers (2020)
1	accuracy	2209	2971
2	machine learning	1503	2079
3	neural network	1530	1984
4	deep neural network	1585	1969
5	deep learning	1174	1548
6	convolutional neural network	1150	1296
7	reinforcement learning	955	1204
8	classification	838	1127
9	robustness	680	970
10	training	635	828
11	learning	650	815
12	generalization	486	646
13	generative adversarial network	601	602
14	artificial intelligence	340	568
15	CIFAR-10	438	560
16	COVID-19	1	551
17	machine learning models	365	547
18	graph neural network	230	530
19	natural language processing	307	529
20	prediction	375	515
	Total	19,268	25,272

Technical Details

I'll describe how I obtained the ranking. The code for the analysis is here.

First, I collected all the paper titles and abstracts in the target categories (cs.LG, cs.CL, cs.CV) published in 2020. All the metadata on arXiv are available under the public domain license. I used arXiv API Python library for fetching the data. The total number of papers analyzed for this post is 83,339.

I then extracted ML-related mentions from collected titles and abstracts. I used the named entity recognition (NER) system trained on SciREX. SciREX is a dataset of scientific papers annotated with mentions (datasets, methods, metrics, and tasks) and their relationship. The official repository has code for training a strong baseline based on SciBERT + CRF, which I used here. Note that SciREX also contains rich annotations such as coreference and N-ary relationship in addition to individual mentions, although I only used their NER model.

Extracted mentions were then normalized (e.g., variants such as cnn, cnn, convolutional neural network, convolutional neural networks) with a handcrafted mapping table. The number of papers that mention each phrase is then counted. I used a chi-square test to determine whether a topic is on the increase/decline.

The "trend" graphs above are visualized using jQuery Sparklines after computing normalized values by dividing the number of papers that mention the topic by the number of total papers published in each month.

If you are interested in the datasets (papers with extracted mentions) I used for this post, download them from the resources page (you need to sign up for free menbership).

Most Frequently Mentioned ML Topics in 2020

Masato Hagiwara

Masato Hagiwara

Most Frequently Mentioned Topics in 2020 (per Type)

Machine Learning (cs.LG)

Machine Learning (cs.LG)

Natural Language Processing (cs.CL)

Computer Vision (cs.CV)

Topics on the Rise in 2020

Machine Learning (cs.LG)

Technical Details

NeuralMorse — reinventing Morse code with neural networks

State of Automated Essay Scoring with Pretrained Language Models