Training a large language model to do scientific data extraction for a corpus of materials properties
A wealth of experiments and simulation results are contained in the scientific literature. Extracting it in a machine-readable format is daunting as it typically requires domain expertise, rigid schema, or intensive labor. Recently, Dunn et al. (DOI: 10.48550/arXiv.2212.05238) fine-tuned a large language model, GPT-3 by OpenAI, on a collection of text snippets containing relevant materials information and structured output with surprisingly good results on held-out text examples. Here, we extend Dunn et al.'s work for use with the Materials Platform for Data Science (MPDS), a platform based on the Pauling File representing over 1000 full-time academic manpower years' worth of curated, manual data extraction. Additionally, MPDS contains approximately 200 unique types of properties covering a broad range of topics for electronic, mechanical, optical, thermodynamic, and magnetic properties. We propose fine-tuning GPT-3 using the structured information MPDS provides to merge the semantic prowess of one of the world's most significant language models to date with one of the largest experimental materials properties databases currently available. We plan to look at in- and out-of-distribution extraction from scientific texts as a function of property type and journal. This helps pave the way toward human-like semantic parsing by leveraging the vastness of the scientific literature.
Attachment: abstract.png (38.3 KB)