What is the goal ?

Large-scale language modeling and natural language prompting have demonstrated exciting capabilities for few and zero shot learning in NLP. However, translating these successes to specialized domains such as biomedicine remains challenging, due in part to biomedical NLP’s significant dataset debt – the technical costs associated with datasets that are not consistently documented or easily incorporated into popular machine learning frameworks. To help address these challenges, we are launching a hackathon to create an open source, community resource of over 150 biomedical datasets. We need your help!

Our goals

Provide lightweight, programmatic access to biomedical datasets via the Datasets API
Better documentation for dataset provenance, licensing, and other key attributes
Easier generation of prompt-based supervision and dataset remixing using schemas standardized by task type

Where should I start?

Go to Github Repository.

Join the Discord Server.

The hackathon runs from 2nd April 2022 - 15th April 2022. We have detailed instructions for participation on our GitHub page.

What will I do during the hackathon?

We’re asking participants to implement standardized data loading scripts for a curated list of biomedical datasets. Visit our project board to volunteer for specific datasets.

Implementing 3 or more dataset loaders will guarantee authorship on our forthcoming academic paper. We recognize that some datasets require more effort than others, so please reach out if you have questions. Our goal is to be inclusive with credit!

What is the BigScience initiative?