Bases: DatasetBuilder
MTS-Dialog (Medical Training Summarization Dialog) is a comprehensive dataset featuring 1.7k doctor-patient conversations, along with their corresponding summaries, including section headers and contents.
The dataset is structured as follows:
-
Training set: Comprises 1,201 pairs of conversations and associated summaries, aimed at facilitating the training of models for medical dialogue understanding and summarization.
-
Validation set: Contains 100 pairs of conversations and their summaries, used for model tuning and intermediate evaluation.
-
Test sets: Includes two distinct test sets, each with 200 conversations and corresponding section headers and contents:
-
MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv: Serves as the official test set for the MEDIQA-Chat 2023 challenge (Task A), focusing on chat-based medical consultations.
-
MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv: Used as the official test set for the MEDIQA-Sum 2023 challenge (Task A & Task B), emphasizing the summary generation from medical dialogues.
Paper: "MTS-Dialog: A New Dataset for Medical Training Summarization in Doctor-Patient Conversations" - https://aclanthology.org/2023.eacl-main.1681
Authors: Asma Ben Abacha, Wen-wai Yim, Yadan Fan, Thomas Lin
Dataset version from the GitHub repository: https://github.com/abachaa/MTS-Dialog
Source code in medplexity/benchmarks/mts_dialog/mts_dialog_dataset_builder.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103 | class MTSDialogDatasetBuilder(DatasetBuilder):
"""
MTS-Dialog (Medical Training Summarization Dialog) is a comprehensive dataset featuring 1.7k doctor-patient conversations, along with their corresponding summaries, including section headers and contents.
The dataset is structured as follows:
- Training set: Comprises 1,201 pairs of conversations and associated summaries, aimed at facilitating the training of models for medical dialogue understanding and summarization.
- Validation set: Contains 100 pairs of conversations and their summaries, used for model tuning and intermediate evaluation.
- Test sets: Includes two distinct test sets, each with 200 conversations and corresponding section headers and contents:
1. MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv: Serves as the official test set for the MEDIQA-Chat 2023 challenge (Task A), focusing on chat-based medical consultations.
2. MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv: Used as the official test set for the MEDIQA-Sum 2023 challenge (Task A & Task B), emphasizing the summary generation from medical dialogues.
Paper: "MTS-Dialog: A New Dataset for Medical Training Summarization in Doctor-Patient Conversations" - <https://aclanthology.org/2023.eacl-main.1681>
Authors: Asma Ben Abacha, Wen-wai Yim, Yadan Fan, Thomas Lin
Dataset version from the GitHub repository: <https://github.com/abachaa/MTS-Dialog>
"""
def __init__(self, loader: Loader = None):
if loader is None:
loader = MTSDialogGithubDatasetLoader()
super().__init__(loader)
def build_dataset(
self,
split_type: MTSDialogDatasetSplitType = "test",
config=None,
) -> Dataset[MTSDialogDataPoint]:
dialog_raw_data = self.loader.load(split_type)
dialog_entries = [
MTSDialogEntry(**dialog_raw) for dialog_raw in dialog_raw_data
]
data_points = [
MTSDialogDataPoint(
id=str(dialog_entry.ID),
input=MTSDialogInput(
dialog=dialog_entry.dialogue,
),
expected_output=None,
metadata=MTSDialogMetadata(
section_header=dialog_entry.section_header,
reference_summary=dialog_entry.section_text,
),
)
for dialog_entry in dialog_entries
]
return Dataset[MTSDialogDataPoint](
data_points=data_points, description=self.__doc__
)
|