A dataset of mentorship in bioscience with semantic and demographic estimations
www.nature.com/scientificdata
A dataset of mentorship in
Data Descriptor bioscience with semantic and
demographic estimations
OPEN
Qing Ke1 ✉, Lizhen Liang2, Ying Ding3, Stephen V. David
4
& Daniel E. Acuna
2✉
Mentorship in science is crucial for topic choice, career decisions, and the success of mentees and
mentors. Typically, researchers who study mentorship use article co-authorship and doctoral
dissertation datasets. However, available datasets of this type focus on narrow selections of fields
and miss out on early career and non-publication-related interactions. Here, we describe Mentorship,
a crowdsourced dataset of 743176 mentorship relationships among 738989 scientists primarily in
biosciences that avoids these shortcomings. Our dataset enriches the Academic Family Tree project by
adding publication data from the Microsoft Academic Graph and “semantic” representations of research
using deep learning content analysis. Because gender and race have become critical dimensions when
analyzing mentorship and disparities in science, we also provide estimations of these factors. We
perform extensive validations of the profile–publication matching, semantic content, and demographic
inferences, which mostly cover neuroscience and biomedical sciences. We anticipate this dataset will
spur the study of mentorship in science and deepen our understanding of its role in scientists’ career
outcomes.
Background & Summary
Mentorship is a form of guidance provided by a more experienced person (mentor) to a less seasoned one
(mentee). Likewise, mentors in science draw from their experiences to help mentees–who often are early-career
researchers–navigate various issues inside and outside of academia. Mentorship is a crucial phase in a scientist’s
development that has long-term effects throughout her career. Mentorship can occur formally through doctoral
and postdoctoral advisor–advisee relationships or informally through collaborations. Mentees not only learn
new knowledge and skills from mentors but also get involved in mentors’ social connections1. Numerous studies
have pointed out the association between mentor’s characteristics and mentee’s academic success, like productivity2–4, career preference and placement2,5,6, mentorship fecundity7,8, and impact9. Despite the large role of
mentorship and interest in studying it, previous studies have relied on single-field datasets and indirect signals
of mentorship (e.g., co-authorship) and therefore have limited generalizability. Large, curated, and open datasets on mentorship have the potential of bringing significant benefit to our understanding of the phenomenon,
similar to how citation and publication datasets have accelerated the emerging field of science of science10,11.
Studying mentorship requires access to a broad set of relationship types, including publication. There are a
few data sources for mentorship in science (Table 1); here, we list a handful of them. The Mathematics Genealogy
Project (MGP)12 is an online database for academic genealogy only in mathematics, though more broadly construed to include “mathematics education, statistics, computer science, or operations research”. MGP lacks publication records. The Astronomy Genealogy Project is a similar online database confined to astronomy that also
does not have publication information13,14. ProQuest is a database of theses and dissertations predominantly
from the US15. Although it is multi-disciplinary, it does not disambiguate researchers, making it hard to link
advisor and advisee and construct lineages. Also, it does not provide publication information. More importantly,
ProQuest is not publicly available, and its access is rate-limited. Apart from genealogy and thesis data, other
researchers have proposed to use paper co-authorships as indirect signals of mentorship16. However, mentorship
1
School of Data Science, City University of Hong Kong, Kowloon, Hong Kong. 2School of Information Studies,
Syracuse University, Syracuse, New York, 13244, USA. 3School of Information, University of Texas at Austin, Austin,
Texas, 78712, USA. 4Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon,
97239, USA. ✉e-mail: ;
Scientific Data |
(2022) 9:467 | https://doi.org/10.1038/s41597-022-01578-x
1
www.nature.com/scientificdata
www.nature.com/scientificdata/
Database
Discipline
Country
Tree
Publication data Open
Demographics
Semantics
Mentorship
all
world-wide
✓
✓
✓
✓
✓
Academic Family Tree
all
world-wide
✓
✓
✓
✗
✗
Mathematics Genealogy
Project
Math
world-wide
✓
✗
✓
✗
✗
Astronomy Genealogy
Project
Astronomy
world-wide
✓
✗
✓
✗
✗
ProQuest
all
US
✗
✗
✗
✗
✗
Table 1. Comparison of existing datasets of mentorship in science with ours (Mentorship).
can start much earlier than publishing works, and it does not necessarily lead to publications17. To summarize,
datasets about mentorship in science are in general fragmented.
Here, we start from the Academic Family Tree (AFT) website18 and extend it to create a large-scale dataset
of mentorship relationships in science. The AFT is an online portal for mentorship in science. We match each
AFT profile to the Microsoft Academic Graph (MAG) we retrieved in September 2020, a leading bibliographic
database19. Moreover, we apply natural language processing techniques to extract semantic representations of
researchers based on deep learning content analysis of their publications. Given the recent interest to understand
the role of gender and race/ethnicity in science20, we also provide estimations of researchers’ demographics.
Compared to existing databases, our dataset, Mentorship (Mentorship with Semantic, Hierarchical, and demographIc Patterns), covers a wide range of disciplines with a richer set of features, making it ideal for studying
generalizable mentorship patterns. We expect it to be the base of future studies covering various aspects of scientific mentorship, including semantic and demographic factors.
Methods
Data sources. The AFT website displays researchers’ profile information, like direct academic parents and
children and a limited set of publication records in the PubMed. Originally focused on neuroscience21, AFT
has been expanding to other areas such as chemistry, engineering, and education. As a crowd-sourcing website,
contents on AFT are contributed by registered users. Contributions can be diverse, from adding a new researcher
to adding mentors, trainees and collaborators of an existing researcher. Visitors can also indicate whether the
website has correctly matched a profile with a publication. Due to the crowd-sourcing nature, researchers on AFT
may not be a representative sample of the academic population.
In AFT, the user-contributed data are stored in a database consisting of several tables that are available
online22. These tables are the starting point for the present work. In particular, we use four tables: (1) the people tabl (...truncated)