1
1
Open MPI IO ("OMPIO")
2
2
=====================
3
3
4
- .. TODO How can I create a TOC just for this page here at the top?
4
+ OMPIO is an Open MPI-native implementation of the MPI I/O functions
5
+ defined in the MPI specification.
5
6
6
- /////////////////////////////////////////////////////////////////////////
7
-
8
- What is the OMPIO?
9
- ------------------
10
-
11
- OMPIO is an implementation of the MPI I/O functions defined in version
12
- two of the Message Passing Interface specification. The main goals of
13
- OMPIO are:
7
+ The main goals of OMPIO are:
14
8
15
9
#. Increase the modularity of the parallel I/O library by separating
16
10
MPI I/O functionality into sub-frameworks.
@@ -26,43 +20,34 @@ OMPIO are:
26
20
usage of optimized data type to data type copy operations.
27
21
28
22
OMPIO is fundamentally a component of the ``io `` framework in Open
29
- MPI. Upon opening a file, the OMPIO component initializes a number of
23
+ MPI. Upon opening a file, the OMPIO component initializes a number of
30
24
sub-frameworks and their components, namely:
31
25
32
26
* ``fs ``: responsible for all file management operations
33
- * ``fbtl ``: support for individual blocking and non-blocking
27
+ * ``fbtl ``: support for blocking and non-blocking individual
34
28
I/O operations
35
- * ``fcoll ``: support for collective blocking and non-blocking I/O
29
+ * ``fcoll ``: support for blocking and non-blocking collective I/O
36
30
operations
37
31
* ``sharedfp ``: support for all shared file pointer operations.
38
32
39
- /////////////////////////////////////////////////////////////////////////
40
-
41
- How can I use OMPIO?
42
- --------------------
43
-
44
- OMPIO is included in Open MPI and is used by default when invoking
45
- MPI IO API functions.
46
33
47
- /////////////////////////////////////////////////////////////////////////
34
+ MCA parameters of OMPIO and associated frameworks
35
+ -------------------------------------------------
48
36
49
- How do I know what MCA parameters are available for tuning the performance of OMPIO?
50
- ------------------------------------------------------------------------------------
51
-
52
- The ``ompi_info `` command can display all the parameters available for the
53
- OMPIO ``io ``, ``fcoll ``, ``fs ``, and ``sharedfp `` components:
37
+ The :ref: `ompi_info(1) <man1-ompi_info >` command can display all the
38
+ parameters available for the OMPIO ``io ``, ``fcoll ``, ``fs ``,
39
+ ``fbtl ``, and ``sharedfp `` components:
54
40
55
41
.. code-block :: sh
56
42
57
- shell$ ompi_info --param io ompio
58
- shell$ ompi_info --param fcoll all
59
- shell$ ompi_info --param fs all
60
- shell$ ompi_info --param sharedfp all
61
-
62
- /////////////////////////////////////////////////////////////////////////
43
+ shell$ ompi_info --param io ompio --level 9
44
+ shell$ ompi_info --param fcoll all --level 9
45
+ shell$ ompi_info --param fs all --level 9
46
+ shell$ ompi_info --param fbtl all --level 9
47
+ shell$ ompi_info --param sharedfp all --level 9
63
48
64
- How can I choose the right component for a sub-framework of OMPIO?
65
- ------------------------------------------------------------------
49
+ OMPIO sub-framework components
50
+ ------------------------------
66
51
67
52
The OMPIO architecture is designed around sub-frameworks, which allow
68
53
you to develop a relatively small amount of code optimized for a
@@ -86,35 +71,51 @@ mechanism available in Open MPI to influence a parameter value, e.g.:
86
71
87
72
``fs `` and ``fbtl `` components are typically chosen based on the file
88
73
system type utilized (e.g. the ``pvfs2 `` component is chosen when the
89
- file is located on an PVFS2 file system, the ``lustre `` component is
90
- chosen for Lustre file systems, etc.).
91
-
92
- The ``fcoll `` framework provides several different implementations,
93
- which provide different levels of data reorganization across
94
- processes. ``two_phase ``, ``dynamic `` segmentation, ``static ``
95
- segmentation and ``individual `` provide decreasing communication costs
96
- during the shuffle phase of the collective I/O operations (in the
97
- order listed here), but provide also decreasing contiguity guarantuees
98
- of the data items before the aggregators read/write data to/from the
99
- file. The current decision logic in OMPIO is using the file view
100
- provided by the application as well as file system level
101
- characteristics (stripe width of the file system) in the selection
102
- logic of the fcoll framework.
74
+ file is located on an PVFS2/OrangeFS file system, the ``lustre ``
75
+ component is chosen for Lustre file systems, etc.). The ``ufs `` ``fs ``
76
+ component is used if no file system specific component is availabe
77
+ (e.g. local file systems, NFS, BeefFS, etc.), and the ``posix ``
78
+ ``fbtl `` component is used as the default component for read/write
79
+ operations.
80
+
81
+ The ``fcoll `` framework provides several different components. The
82
+ current decision logic in OMPIO uses the file view provided by the
83
+ application as well as file system level characteristics (e.g. file
84
+ system, stripe width) to determine which component to use. The most
85
+ important ``fcoll `` components are:
86
+
87
+ * ``dynamic_gen2 ``: the default component used on lustre file
88
+ system. This component is based on the two-phase I/O algorithm with
89
+ a static file partioning strategy, i.e. an aggregator processes will
90
+ by default only write data to a single storage server.
91
+
92
+ * ``vulcan ``: the default component used on all other file
93
+ systems. This component is based on the two-phase I/O algorithm with
94
+ an even file partitioning strategy, i.e. each of the n aggregators
95
+ will write 1/n th of the overall file.
96
+
97
+ * ``individual ``: this components executes all collective I/O
98
+ operations in terms of individual I/O operations.
103
99
104
100
The ``sharedfp `` framework provides a different implementation of the
105
- shared file pointer operations depending on file system features, such
106
- as:
101
+ shared file pointer operations depending on file system features.
102
+
103
+ * ``lockfile ``: this component will be used on file system which
104
+ support for file locking.
105
+
106
+ * ``sm ``: component used in scenarios in which all processes of the
107
+ communicator are on the same physical node.
107
108
108
- * ``lockfile ``: support for file locking.
109
- * ``sm ``: locality of the MPI processes in the communicator that has
110
- been used to open the file.
111
- * ``individual ``: guarantees by the application on using only a subset
112
- of the available functionality (i.e. write operations only).
109
+ * ``individual ``: a component that can be used if neither of the other
110
+ two components are available. This component provides however only
111
+ limited functionality (i.e. write operations only).
113
112
114
- /////////////////////////////////////////////////////////////////////////
113
+ .. note :: See :ref:`the section on the individual sharedfp component
114
+ <label-ompio-individual-sharedfp>` to understand
115
+ functionality and limitations.
115
116
116
- How can I tune OMPIO parameters to improve performance?
117
- -------------------------------------------------------
117
+ Tuning OMPIO performance
118
+ ------------------------
118
119
119
120
The most important parameters influencing the performance of an I/O
120
121
operation are listed below:
@@ -147,16 +148,25 @@ operation are listed below:
147
148
regular 2-D or 3-D data decomposition can try changing this
148
149
parameter to 4 (hybrid) algorithm.
149
150
150
- /////////////////////////////////////////////////////////////////////////
151
+ #. ``fs_ufs_lock_algorithm ``: Parameter used to determing what part of
152
+ a file needs to be locked for a file operation. Since the ``ufs ``
153
+ ``fs `` component is used on multiple file systems, OMPIO
154
+ automatically chooses the value required for correctness on all
155
+ file systems, e.g. enforcing locking on an NFS file system, while
156
+ disabling locking on a local file system. Users can adjust the
157
+ required locking behavior based on their use case, since the
158
+ default value might often be too restrictive for their application.
151
159
152
- What are the main parameters of the `` fs `` framework and components?
153
- --------------------------------------------------------------------
160
+ Setting stripe size and stripe width on parallel file systems
161
+ -------------------------------------------------------------
154
162
155
- The main parameters of the ``fs `` components allow you to manipulate
156
- the layout of a new file on a parallel file system.
163
+ Many ``fs `` components allow you to manipulate the layout of a new
164
+ file on a parallel file system. Note, that many file systems only
165
+ allow changing these setting upon file creation, i.e. modifying these
166
+ values for an already existing file might not be possible.
157
167
158
168
#. ``fs_pvfs2_stripe_size ``: Sets the number of storage servers for a
159
- new file on a PVFS2 file system. If not set, system default will be
169
+ new file on a PVFS2/OrangeFS file system. If not set, system default will be
160
170
used. Note that this parameter can also be set through the
161
171
``stripe_size `` MPI Info value.
162
172
@@ -175,78 +185,24 @@ the layout of a new file on a parallel file system.
175
185
will be used. Note that this parameter can also be set through the
176
186
``stripe_width `` MPI Info value.
177
187
178
- ////////////////////////////////////////////////////////////////////////
179
-
180
- What are the main parameters of the ``fbtl `` framework and components?
181
- ----------------------------------------------------------------------
182
-
183
- No performance relevant parameters are currently available for the
184
- ``fbtl `` components.
185
-
186
- /////////////////////////////////////////////////////////////////////////
187
-
188
- What are the main parameters of the ``fcoll `` framework and components?
189
- -----------------------------------------------------------------------
190
-
191
- The design of the ``fcoll `` frameworks maximizes the utilization of
192
- parameters of the OMPIO component, in order to minimize the number of similar
193
- MCA parameters in each component.
194
-
195
- For example, the ``two_phase ``, ``dynamic ``, and ``static `` components
196
- all retrieve the ``io_ompio_bytes_per_agg `` parameter to define the
197
- collective buffer size and the ``io_ompio_num_aggregators `` parameter
198
- to force the utilization of a given number of aggregators.
199
-
200
- /////////////////////////////////////////////////////////////////////////
201
-
202
- What are the main parameters of the ``sharedfp `` framework and components?
203
- --------------------------------------------------------------------------
188
+ Using GPU device buffers in MPI File I/O operations
189
+ ----------------------------------------------------
204
190
205
- No performance relevant parameters are currently available for the
206
- ``sharedfp `` components.
191
+ OMPIO supports reading and writing directly to/from GPU buffers using
192
+ the MPI I/O interfaces. Using this feature simplifies managing buffers
193
+ that are exclusive used on GPU devices, and hence the necessity to
194
+ implement a staging through host memory for file I/O operations.
207
195
208
- /////////////////////////////////////////////////////////////////////////
196
+ Internally, OMPIO splits a user buffer into chunks for performing the
197
+ read/write operation. The chunk-size used by OMPIO can have a
198
+ significant influence on the performance of the file I/O operation
199
+ from device buffers, and can be controlled using the
200
+ ``io_ompio_pipeline_buffer_size `` MCA parameter.
209
201
210
- How do I tune collective I/O operations?
211
- ----------------------------------------
202
+ .. _label-ompio-individual-sharedfp :
212
203
213
- The most influential parameter that can be tuned in advance is the
214
- ``io_ompio_bytes_per_agg `` parameter of the ``ompio `` component. This
215
- parameter is essential for the selection of the collective I/O
216
- component as well for determining the optimal number of aggregators
217
- for a collective I/O operation. It is a file system-specific value,
218
- independent of the application scenario. To determine the correct
219
- value on your system, take an I/O benchmark (e.g., the IMB or IOR
220
- benchmark) and run an individual, single process write test. E.g., for
221
- IMB:
222
-
223
- .. code-block :: sh
224
-
225
- shell$ mpirun -n 1 ./IMB-IO S_write_indv
226
-
227
- For IMB, use the values obtained for AGGREGATE test cases. Plot the
228
- bandwidth over the message length. The recommended value for
229
- ``io_ompio_bytes_per_agg `` is the smallest message length which
230
- achieves (close to) maximum bandwidth from that process's
231
- perspective.
232
-
233
- .. note :: Make sure that the ``io_ompio_cycle_buffer_size`` parameter
234
- is set to -1 when running this test, which is its default
235
- value). The value of ``io_ompio_bytes_per_agg `` could be
236
- set by system administrators in the system-wide Open MPI
237
- configuration file, or by users individually. See :ref: `this
238
- FAQ item <label-running-setting-mca-param-values>` on setting
239
- MCA parameters for details.
240
-
241
- For more exhaustive tuning of I/O parameters, we recommend the
242
- utilization of the `Open Tool for Parameter Optimization (OTPO)
243
- <https://www.open-mpi.org/projects/otpo> `_, a tool specifically
244
- designed to explore the MCA parameter space of Open MPI.
245
-
246
- /////////////////////////////////////////////////////////////////////////
247
-
248
- When should I use the ``individual `` ``sharedfp `` component, and what are its limitations?
249
- ------------------------------------------------------------------------------------------
204
+ Using the ``individual `` ``sharedfp `` component and its limitations
205
+ -------------------------------------------------------------------
250
206
251
207
The ``individual `` sharedfp component provides an approximation of
252
208
shared file pointer operations that can be used for *write operations
@@ -257,33 +213,25 @@ support locking.
257
213
258
214
Conceptually, each process writes the data of a write_shared operation
259
215
into a separate file along with a time stamp. In every collective
260
- operation (latest in file_close), data from all individual files are
261
- merged into the actual output file, using the time stamps as the main
262
- criteria.
216
+ operation (or during the file_close operation ), data from all
217
+ individual files are merged into the actual output file, using the
218
+ time stamps as the main criteria.
263
219
264
220
The component has certain limitations and restrictions, such as its
265
- relience on the synchronization accuracy of the clock on the cluster
221
+ relience on the synchronization clocks on the individual cluster nodes
266
222
to determine the order between entries in the final file, which might
267
223
lead to some deviations compared to the actual calling sequence.
268
224
269
- /////////////////////////////////////////////////////////////////////////
225
+ Furthermore, the component only supports ``write `` operations, read
226
+ operations are not supported.
270
227
271
- What other features of OMPIO are available?
272
- -------------------------------------------
228
+ Other features of OMPIO
229
+ -----------------------
273
230
274
231
OMPIO has a number of additional features, mostly directed towards
275
232
developers, which could occasionally also be useful to interested
276
233
end-users. These can typically be controlled through MCA parameters.
277
234
278
- * ``io_ompio_sharedfp_lazy_open ``: By default, ``ompio `` does not
279
- establish the necessary data structures required for shared file
280
- pointer operations during file_open. It delays generating these data
281
- structures until the first utilization of a shared file pointer
282
- routine. This is done mostly to minimize the memory footprint of
283
- ``ompio ``, and due to the fact that shared file pointer operations
284
- are rarely used compared to the other functions. Setting this
285
- parameter to 0 disables this optimization.
286
-
287
235
* ``io_ompio_coll_timing_info ``: Setting this parameter will lead to a
288
236
short report upon closing a file indicating the amount of time spent
289
237
in communication and I/O operations of collective I/O operations
@@ -312,19 +260,3 @@ end-users. These can typically be controlled through MCA parameters.
312
260
all the column indexes. The fourth row lists all the values and the
313
261
fifth row gives the row index. A row index represents the position
314
262
in the value array where a new row starts.
315
-
316
- /////////////////////////////////////////////////////////////////////////
317
-
318
- Known limitations
319
- -----------------
320
-
321
- OMPIO implements most of the I/O functionality of the MPI
322
- specification. There are, however, two not very commonly used
323
- functions that are not implemented as of today:
324
-
325
- * Switching from the relaxed consistency semantics of MPI to stricter, sequential
326
- consistency through the MPI_File_set_atomicity functions
327
-
328
- * Using user defined data representations
329
-
330
- .. error :: TODO Are these still accurate?
0 commit comments