Bases: DatasetBuilder
PubMedQA is a biomedical QA dataset designed to answer research questions with yes/no/maybe. The dataset consists of 1k expert-annotated questions, 61.2k unlabeled questions, and an additional 211.3k artificially generated QA instances. Every instance contains a question sourced or derived from a research article title, context from the abstract without its conclusion, a long answer in the form of the abstract's conclusion, and a summarized yes/no/maybe answer.
Paper: PubMedQA: A Dataset for Biomedical Research Question Answering
13 Sep 2019 · Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu
https://arxiv.org/abs/1909.06146
Only train split available.
Divided into three subsets:
- pqa_artificial: 211.3k artificially generated QA instances
- pqa_labeled: 1k expert-annotated questions
- pqa_unlabeled: 61.2k unlabeled questions
Dataset version used: https://huggingface.co/datasets/pubmed_qa
Source code in medplexity/benchmarks/pubmedqa/pubmedqa_dataset_builder.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 | class PubmedQADatasetBuilder(DatasetBuilder):
"""PubMedQA is a biomedical QA dataset designed to answer research questions with yes/no/maybe. The dataset consists of 1k expert-annotated questions, 61.2k unlabeled questions, and an additional 211.3k artificially generated QA instances. Every instance contains a question sourced or derived from a research article title, context from the abstract without its conclusion, a long answer in the form of the abstract's conclusion, and a summarized yes/no/maybe answer.
Paper: PubMedQA: A Dataset for Biomedical Research Question Answering
13 Sep 2019 · Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu
<https://arxiv.org/abs/1909.06146>
Only train split available.
Divided into three subsets:
- pqa_artificial: 211.3k artificially generated QA instances
- pqa_labeled: 1k expert-annotated questions
- pqa_unlabeled: 61.2k unlabeled questions
Dataset version used: <https://huggingface.co/datasets/pubmed_qa>
"""
EXAMPLE_QUESTIONS_PATH = Path(__file__).resolve().parent / "examples.json"
def build_dataset(
self, split_type: PubMedQADatasetSplitType = "train", config=None
) -> Dataset[PubmedQADataPoint]:
if config is None:
config = {"subset": PubMedQADatasetTypes.pqa_labeled}
dataset = self.loader.load("pubmed_qa", config["subset"], split=split_type)
questions = [PubMedQAQuestion(**row) for row in dataset]
options = ["Yes", "No", "Maybe"]
data_points = [
PubmedQADataPoint(
id=f"{split_type}-{i}",
input=MultipleChoiceInput(
question=question.question,
options=options,
context=" ".join(question.context.contexts),
),
expected_output=format_answer_to_letter(
options, question.final_decision.value.capitalize().strip()
),
metadata=PubmedQAMetadata(
explanation=question.long_answer,
labels=question.context.labels,
meshes=question.context.meshes,
),
)
for i, question in enumerate(questions)
]
return Dataset[PubmedQADataPoint](
data_points=data_points, description=self.__doc__
)
|