Bases: DatasetBuilder
Dataset of consumer health questions released by Google for the Med-PaLM paper.
This HealthSearchQA dataset consists of 3,173 commonly searched consumer health questions. These questions were curated using seed medical conditions and their associated symptoms, reflecting real-world consumer concerns in the healthcare domain.
Paper: Large Language Models Encode Clinical Knowledge
2022 * Singhal, K., Azizi, S., Tu, T. et al.
https://arxiv.org/abs/2212.13138
No dataset splitting (only "train" split).
Dataset version used: https://huggingface.co/datasets/katielink/healthsearchqa
Source code in medplexity/benchmarks/healthsearchqa/healthsearchqa_dataset_builder.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 | class HealthSearchQADatasetBuilder(DatasetBuilder):
"""Dataset of consumer health questions released by Google for the Med-PaLM paper.
This HealthSearchQA dataset consists of 3,173 commonly searched consumer health questions. These questions were curated using seed medical conditions and their associated symptoms, reflecting real-world consumer concerns in the healthcare domain.
Paper: Large Language Models Encode Clinical Knowledge
2022 * Singhal, K., Azizi, S., Tu, T. et al.
<https://arxiv.org/abs/2212.13138>
No dataset splitting (only "train" split).
Dataset version used: <https://huggingface.co/datasets/katielink/healthsearchqa>
"""
def build_dataset(
self,
split_type: str = "train",
config=None,
) -> Dataset[HealthSearchQADataPoint]:
if config is None:
config = {"subset": HealthSearchQASubsetConfig.all_data}
dataset = self.loader.load(
"katielink/healthsearchqa", config["subset"], split=split_type
)
questions = [HealthSearchQAQuestion(**row) for row in dataset]
data_points = [
HealthSearchQADataPoint(
id=str(question.id),
input=question.question,
expected_output=None,
metadata=None,
)
for question in questions
if question.id is not None and question.question is not None
]
return Dataset[HealthSearchQADataPoint](
data_points=data_points, description=self.__doc__
)
|