Skip to content

Commit b44e1ee

Browse files
authored
Merge pull request #22 from florencejt/refactor/traintestsplit
Refactor/traintestsplit: Adding train test split customisability to the data loading
2 parents af26d5f + 0f5fdb8 commit b44e1ee

File tree

9 files changed

+353
-67
lines changed

9 files changed

+353
-67
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,8 @@ cython_debug/
164164

165165
# tracks the version
166166
_version.py
167+
168+
# rogue directories from example notebooks running in local space
169+
checkpoints/
170+
loss_figures/
171+
loss_logs/

docs/customising_training.rst

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ We will cover the following topics:
1111
* Number of epochs
1212
* Checkpoint suffix modification
1313
* Number of workers in PyTorch DataLoader
14+
* Train/test and cross-validation splitting yourself
1415

1516
Early stopping
1617
--------------
@@ -248,3 +249,65 @@ You can change the number of workers in the PyTorch DataLoader using the ``num_w
248249
fusion_model=example_model,
249250
)
250251
252+
253+
254+
-----
255+
256+
Train/test and cross-validation splitting yourself
257+
---------------------------------------------------
258+
259+
By default, fusilli will split your data into train/test or cross-validation splits for you randomly based on a test size or a number of folds you specify in the :func:`~.fusilli.data.prepare_fusion_data` function.
260+
261+
You can remove the randomness and specify the data indices for train and test, or for the different cross validation folds yourself by passing in optional arguments to :func:`~.fusilli.data.prepare_fusion_data`.
262+
263+
264+
For train/test splitting, the argument `test_indices` should be a list of indices for the test set. To make the test set the first 6 data points in the overall dataset, follow the example below:
265+
266+
.. code-block:: python
267+
268+
from fusilli.data import prepare_fusion_data
269+
from fusilli.train import train_and_save_models
270+
271+
test_indices = [0, 1, 2, 3, 4, 5]
272+
273+
datamodule = prepare_fusion_data(
274+
prediction_task="binary",
275+
fusion_model=example_model,
276+
data_paths=data_paths,
277+
output_paths=output_path,
278+
test_indices=test_indices,
279+
)
280+
281+
For specifying your own cross validation folds, the argument `own_kfold_indices` should be a list of lists of indices for each fold.
282+
283+
If you wanted to have non-random cross validation folds through your data, you can either specify the folds like so for 3 folds:
284+
285+
.. code-block:: python
286+
287+
own_kfold_indices = [
288+
([ 4, 5, 6, 7, 8, 9, 10, 11], [0, 1, 2, 3]), # first fold
289+
([ 0, 1, 2, 3, 8, 9, 10, 11], [4, 5, 6, 7]), # second fold
290+
([ 0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11]) # third fold
291+
]
292+
293+
Or to do this automatically, use the Scikit-Learn `KFold functionality <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html>`_ to generate the folds outside of the fusilli functions, like so:
294+
295+
.. code-block:: python
296+
297+
from sklearn.model_selection import KFold
298+
299+
num_folds = 5
300+
301+
own_kfold_indices = [(train_index, test_index) for train_index, test_index in KFold(n_splits=num_folds).split(range(len(dataset)))]
302+
303+
304+
datamodule = prepare_fusion_data(
305+
kfold=True,
306+
prediction_task="binary",
307+
fusion_model=example_model,
308+
data_paths=data_paths,
309+
output_paths=output_path,
310+
own_kfold_indices=own_kfold_indices,
311+
num_folds=num_folds,
312+
)
313+

0 commit comments

Comments
 (0)