Innovation and Entrepreneurship Research
Innovation and Entrepreneurship Research
Domain-specific language models have become an important tool in the social sciences. They transform text into data points, which can then be used for further analysis. However, these models are usually generated for one specific purpose, e.g., a model trained on scientific publications has learned different features than a model trained on patents. We develop a textual relatedness model for both the scientific and patent domains, optimized for similarity comparisons. During training, we use citations as a proxy for semantic similarity. Once the model is trained, citations are no longer required, and the model relies only on the text of the new documents to identify similarities. Throughout the project, we employ different strategies to build and train the models. After a thorough comparison, we select the best performing model for real-world applications.
Publications
Ghosh, Mainak; Erhardt, Sebastian; Rose, Michael E.; Buunk, Erik; Harhoff, Dietmar (2024). PaECTER: Patent-Level Representation Learning Using Citation-Informed Transformers, arXiv preprint 2402.19411.
Erhardt, Sebastian; Ghosh, Mainak; Buunk, Erik; Rose, Michael E.; Harhoff, Dietmar (2022). Logic Mill – A Knowledge Navigation System, arXiv preprint 2301.00200.
External Funding
EPO ARP (European Patent Office, Academic Research Programme 2021)
Personen
Michael E. Rose, Ph.D.,
Sebastian Erhardt, M.Sc.,
Mainak Ghosh, M.Sc.,
Erik Buunk, M.Sc.,
Cheng Li, M.Sc.