Bases: DatasetBuilder
Multiple-choice questions based on the United States Medical License Exams (USMLE).
Paper: What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
28 Sep 2020 · Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits
https://arxiv.org/abs/2009.13081
Currently, uses only med_qa_en_bigbio_qa subset of the dataset, but can be extended to other subsets.
Train/validation/test splits available.
We use the following version uploaded on HuggingFace datasets: https://huggingface.co/datasets/bigbio/med_qa
Source code in medplexity/benchmarks/medqa/medqa_dataset_builder.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75 | class MedQADatasetBuilder(DatasetBuilder):
"""Multiple-choice questions based on the United States Medical License Exams (USMLE).
Paper: What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
28 Sep 2020 · Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits
<https://arxiv.org/abs/2009.13081>
Currently, uses only med_qa_en_bigbio_qa subset of the dataset, but can be extended to other subsets.
Train/validation/test splits available.
We use the following version uploaded on HuggingFace datasets: <https://huggingface.co/datasets/bigbio/med_qa>
"""
EXAMPLE_QUESTIONS_PATH = Path(__file__).resolve().parent / "examples.json"
def build_dataset(
self,
split_type: MedQADatasetSplitType = "train",
config=None,
) -> Dataset[MedQADataPoint]:
if config is None:
config = {"subset": MedQASubsetConfig.med_qa_en_bigbio_qa}
dataset = self.loader.load("bigbio/med_qa", config["subset"], split=split_type)
questions = [MedQAQuestion(**row) for row in dataset]
data_points = [
MedQADataPoint(
id=question.id,
input=MultipleChoiceInput(
question=question.question,
options=question.choices,
),
# always expect just one answer
expected_output=format_answer_to_letter(
question.choices, question.answer[0]
),
metadata=None,
)
for question in questions
]
return Dataset[MedQADataPoint](
data_points=data_points, description=self.__doc__
)
|