Skip to content

MedQA

Bases: DatasetBuilder

Multiple-choice questions based on the United States Medical License Exams (USMLE).

Paper: What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

28 Sep 2020 · Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits https://arxiv.org/abs/2009.13081

Currently, uses only med_qa_en_bigbio_qa subset of the dataset, but can be extended to other subsets.

Train/validation/test splits available.

We use the following version uploaded on HuggingFace datasets: https://huggingface.co/datasets/bigbio/med_qa

Source code in medplexity/benchmarks/medqa/medqa_dataset_builder.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class MedQADatasetBuilder(DatasetBuilder):
    """Multiple-choice questions based on the United States Medical License Exams (USMLE).

    Paper: What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

    28 Sep 2020 · Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits
    <https://arxiv.org/abs/2009.13081>

    Currently, uses only med_qa_en_bigbio_qa subset of the dataset, but can be extended to other subsets.

    Train/validation/test splits available.

    We use the following version uploaded on HuggingFace datasets: <https://huggingface.co/datasets/bigbio/med_qa>
    """

    EXAMPLE_QUESTIONS_PATH = Path(__file__).resolve().parent / "examples.json"

    def build_dataset(
        self,
        split_type: MedQADatasetSplitType = "train",
        config=None,
    ) -> Dataset[MedQADataPoint]:
        if config is None:
            config = {"subset": MedQASubsetConfig.med_qa_en_bigbio_qa}

        dataset = self.loader.load("bigbio/med_qa", config["subset"], split=split_type)

        questions = [MedQAQuestion(**row) for row in dataset]

        data_points = [
            MedQADataPoint(
                id=question.id,
                input=MultipleChoiceInput(
                    question=question.question,
                    options=question.choices,
                ),
                # always expect just one answer
                expected_output=format_answer_to_letter(
                    question.choices, question.answer[0]
                ),
                metadata=None,
            )
            for question in questions
        ]

        return Dataset[MedQADataPoint](
            data_points=data_points, description=self.__doc__
        )