A dataset of mentorship in bioscience with semantic and demographic estimations (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41597-022-01578-x.pdf

A dataset of mentorship in bioscience with semantic and demographic estimations

www.nature.com/scientificdata A dataset of mentorship in Data Descriptor bioscience with semantic and demographic estimations OPEN Qing Ke1 ✉, Lizhen Liang2, Ying Ding3, Stephen V. David 4 & Daniel E. Acuna 2✉ Mentorship in science is crucial for topic choice, career decisions, and the success of mentees and mentors. Typically, researchers who study mentorship use article co-authorship and doctoral dissertation datasets. However, available datasets of this type focus on narrow selections of fields and miss out on early career and non-publication-related interactions. Here, we describe Mentorship, a crowdsourced dataset of 743176 mentorship relationships among 738989 scientists primarily in biosciences that avoids these shortcomings. Our dataset enriches the Academic Family Tree project by adding publication data from the Microsoft Academic Graph and “semantic” representations of research using deep learning content analysis. Because gender and race have become critical dimensions when analyzing mentorship and disparities in science, we also provide estimations of these factors. We perform extensive validations of the profile–publication matching, semantic content, and demographic inferences, which mostly cover neuroscience and biomedical sciences. We anticipate this dataset will spur the study of mentorship in science and deepen our understanding of its role in scientists’ career outcomes. Background & Summary Mentorship is a form of guidance provided by a more experienced person (mentor) to a less seasoned one (mentee). Likewise, mentors in science draw from their experiences to help mentees–who often are early-career researchers–navigate various issues inside and outside of academia. Mentorship is a crucial phase in a scientist’s development that has long-term effects throughout her career. Mentorship can occur formally through doctoral and postdoctoral advisor–advisee relationships or informally through collaborations. Mentees not only learn new knowledge and skills from mentors but also get involved in mentors’ social connections1. Numerous studies have pointed out the association between mentor’s characteristics and mentee’s academic success, like productivity2–4, career preference and placement2,5,6, mentorship fecundity7,8, and impact9. Despite the large role of mentorship and interest in studying it, previous studies have relied on single-field datasets and indirect signals of mentorship (e.g., co-authorship) and therefore have limited generalizability. Large, curated, and open datasets on mentorship have the potential of bringing significant benefit to our understanding of the phenomenon, similar to how citation and publication datasets have accelerated the emerging field of science of science10,11. Studying mentorship requires access to a broad set of relationship types, including publication. There are a few data sources for mentorship in science (Table 1); here, we list a handful of them. The Mathematics Genealogy Project (MGP)12 is an online database for academic genealogy only in mathematics, though more broadly construed to include “mathematics education, statistics, computer science, or operations research”. MGP lacks publication records. The Astronomy Genealogy Project is a similar online database confined to astronomy that also does not have publication information13,14. ProQuest is a database of theses and dissertations predominantly from the US15. Although it is multi-disciplinary, it does not disambiguate researchers, making it hard to link advisor and advisee and construct lineages. Also, it does not provide publication information. More importantly, ProQuest is not publicly available, and its access is rate-limited. Apart from genealogy and thesis data, other researchers have proposed to use paper co-authorships as indirect signals of mentorship16. However, mentorship 1 School of Data Science, City University of Hong Kong, Kowloon, Hong Kong. 2School of Information Studies, Syracuse University, Syracuse, New York, 13244, USA. 3School of Information, University of Texas at Austin, Austin, Texas, 78712, USA. 4Oregon Hearing Research Center, Oregon Health and Science University, Portland, Oregon, 97239, USA. ✉e-mail: ; Scientific Data | (2022) 9:467 | https://doi.org/10.1038/s41597-022-01578-x 1 www.nature.com/scientificdata www.nature.com/scientificdata/ Database Discipline Country Tree Publication data Open Demographics Semantics Mentorship all world-wide ✓ ✓ ✓ ✓ ✓ Academic Family Tree all world-wide ✓ ✓ ✓ ✗ ✗ Mathematics Genealogy Project Math world-wide ✓ ✗ ✓ ✗ ✗ Astronomy Genealogy Project Astronomy world-wide ✓ ✗ ✓ ✗ ✗ ProQuest all US ✗ ✗ ✗ ✗ ✗ Table 1. Comparison of existing datasets of mentorship in science with ours (Mentorship). can start much earlier than publishing works, and it does not necessarily lead to publications17. To summarize, datasets about mentorship in science are in general fragmented. Here, we start from the Academic Family Tree (AFT) website18 and extend it to create a large-scale dataset of mentorship relationships in science. The AFT is an online portal for mentorship in science. We match each AFT profile to the Microsoft Academic Graph (MAG) we retrieved in September 2020, a leading bibliographic database19. Moreover, we apply natural language processing techniques to extract semantic representations of researchers based on deep learning content analysis of their publications. Given the recent interest to understand the role of gender and race/ethnicity in science20, we also provide estimations of researchers’ demographics. Compared to existing databases, our dataset, Mentorship (Mentorship with Semantic, Hierarchical, and demographIc Patterns), covers a wide range of disciplines with a richer set of features, making it ideal for studying generalizable mentorship patterns. We expect it to be the base of future studies covering various aspects of scientific mentorship, including semantic and demographic factors. Methods Data sources. The AFT website displays researchers’ profile information, like direct academic parents and children and a limited set of publication records in the PubMed. Originally focused on neuroscience21, AFT has been expanding to other areas such as chemistry, engineering, and education. As a crowd-sourcing website, contents on AFT are contributed by registered users. Contributions can be diverse, from adding a new researcher to adding mentors, trainees and collaborators of an existing researcher. Visitors can also indicate whether the website has correctly matched a profile with a publication. Due to the crowd-sourcing nature, researchers on AFT may not be a representative sample of the academic population. In AFT, the user-contributed data are stored in a database consisting of several tables that are available online22. These tables are the starting point for the present work. In particular, we use four tables: (1) the people tabl (...truncated)