Skip to content

HealthSearchQA

Bases: DatasetBuilder

Dataset of consumer health questions released by Google for the Med-PaLM paper. This HealthSearchQA dataset consists of 3,173 commonly searched consumer health questions. These questions were curated using seed medical conditions and their associated symptoms, reflecting real-world consumer concerns in the healthcare domain.

Paper: Large Language Models Encode Clinical Knowledge

2022 * Singhal, K., Azizi, S., Tu, T. et al. https://arxiv.org/abs/2212.13138

No dataset splitting (only "train" split).

Dataset version used: https://huggingface.co/datasets/katielink/healthsearchqa

Source code in medplexity/benchmarks/healthsearchqa/healthsearchqa_dataset_builder.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class HealthSearchQADatasetBuilder(DatasetBuilder):
    """Dataset of consumer health questions released by Google for the Med-PaLM paper.
    This HealthSearchQA dataset consists of 3,173 commonly searched consumer health questions. These questions were curated using seed medical conditions and their associated symptoms, reflecting real-world consumer concerns in the healthcare domain.

    Paper: Large Language Models Encode Clinical Knowledge

    2022 * Singhal, K., Azizi, S., Tu, T. et al.
    <https://arxiv.org/abs/2212.13138>

    No dataset splitting (only "train" split).

    Dataset version used: <https://huggingface.co/datasets/katielink/healthsearchqa>
    """

    def build_dataset(
        self,
        split_type: str = "train",
        config=None,
    ) -> Dataset[HealthSearchQADataPoint]:
        if config is None:
            config = {"subset": HealthSearchQASubsetConfig.all_data}

        dataset = self.loader.load(
            "katielink/healthsearchqa", config["subset"], split=split_type
        )

        questions = [HealthSearchQAQuestion(**row) for row in dataset]

        data_points = [
            HealthSearchQADataPoint(
                id=str(question.id),
                input=question.question,
                expected_output=None,
                metadata=None,
            )
            for question in questions
            if question.id is not None and question.question is not None
        ]

        return Dataset[HealthSearchQADataPoint](
            data_points=data_points, description=self.__doc__
        )