Discussion:
Write ordering in Ext4
Arul Selvan
2013-06-03 05:33:39 UTC
Permalink
Greetings. I am Arul Selvan works for Novell. I am exploring the Ext4 architecture, more specifically i would like to understand the write ordering, basically the same blocks is modified more than once, how the write is ordered. Could you point me the doc or the specific source file to look.
Andreas Dilger
2013-06-03 14:47:38 UTC
Permalink
Post by Arul Selvan
Greetings. I am Arul Selvan works for Novell. I am exploring the Ext4 architecture, more specifically i would like to understand the write ordering, basically the same blocks is modified more than once, how the write is ordered. Could you point me the doc or the specific source file to look.
Writes in memory to the same file are serialized by i_mutex, but may
modify the same page in memory repeatedly.

When that page us being written to disk, it will be marked with the
page writeback flag, in order to stabilize the content, and allow consistent
checksums (e.g. for MD RAID or disks with T10-DIF). This may block
any further writes from modifying the same page as it is being
submitted to disk, depending on the kernel version and the
requirements of the underlying storage. Once the disk write has been
finished, the writeback bit is cleared and the page can be modified again.

In all cases, the writes to a single page are ordered, but there is no
_guarantee_ about writes to different data blocks being ordered.
The ext4 journal will in fact impose some order on data writes,
by ensuring that the data from all writes associated with a transaction
are flushed before the data for the next transaction.

Since fsync() of any file commits the current transaction, this has
the side-effect that any fsync causes all older writes to be committed. This is NOT required by POSIX, and applications that depend on this behavior are not portable to/safe on other filesystems.

Cheers, Andreas
Arul Selvan
2013-06-04 17:17:38 UTC
Permalink
thanks that answered my question. One more question, is it possible to stop the delayed block allocation in ext4 ?
Post by Arul Selvan
Greetings. I am Arul Selvan works for Novell. I am exploring the Ext4 architecture, more specifically i would like to understand the write ordering, basically the same blocks is modified more than once, how the write is ordered. Could you point me the doc or the specific source file to look.
Writes in memory to the same file are serialized by i_mutex, but may
modify the same page in memory repeatedly.

When that page us being written to disk, it will be marked with the
page writeback flag, in order to stabilize the content, and allow consistent
checksums (e.g. for MD RAID or disks with T10-DIF). This may block
any further writes from modifying the same page as it is being
submitted to disk, depending on the kernel version and the
requirements of the underlying storage. Once the disk write has been
finished, the writeback bit is cleared and the page can be modified again.

In all cases, the writes to a single page are ordered, but there is no
_guarantee_ about writes to different data blocks being ordered.
The ext4 journal will in fact impose some order on data writes,
by ensuring that the data from all writes associated with a transaction
are flushed before the data for the next transaction.

Since fsync() of any file commits the current transaction, this has
the side-effect that any fsync causes all older writes to be committed. This is NOT required by POSIX, and applications that depend on this behavior are not portable to/safe on other filesystems.

Cheers, Andreas
Eric Sandeen
2013-06-04 17:33:25 UTC
Permalink
Post by Arul Selvan
thanks that answered my question. One more question, is it possible to stop the delayed block allocation in ext4 ?
If you mean turn off delayed allocation, look no further than the mount
options documented in the kernel tree, Documentation/filesystems/ext4.txt:

nodelalloc Disable delayed allocation. Blocks are allocated
when the data is copied from userspace to the
page cache, either via the write(2) system call
or when an mmap'ed page which was previously
unallocated is written for the first time.

Out of curiosity, why do you want to turn off delalloc?

-Eric
Post by Arul Selvan
Post by Arul Selvan
Greetings. I am Arul Selvan works for Novell. I am exploring the Ext4 architecture, more specifically i would like to understand the write ordering, basically the same blocks is modified more than once, how the write is ordered. Could you point me the doc or the specific source file to look.
Writes in memory to the same file are serialized by i_mutex, but may
modify the same page in memory repeatedly.
When that page us being written to disk, it will be marked with the
page writeback flag, in order to stabilize the content, and allow consistent
checksums (e.g. for MD RAID or disks with T10-DIF). This may block
any further writes from modifying the same page as it is being
submitted to disk, depending on the kernel version and the
requirements of the underlying storage. Once the disk write has been
finished, the writeback bit is cleared and the page can be modified again.
In all cases, the writes to a single page are ordered, but there is no
_guarantee_ about writes to different data blocks being ordered.
The ext4 journal will in fact impose some order on data writes,
by ensuring that the data from all writes associated with a transaction
are flushed before the data for the next transaction.
Since fsync() of any file commits the current transaction, this has
the side-effect that any fsync causes all older writes to be committed. This is NOT required by POSIX, and applications that depend on this behavior are not portable to/safe on other filesystems.
Cheers, Andreas
_______________________________________________
Ext3-users mailing list
https://www.redhat.com/mailman/listinfo/ext3-users
Theodore Ts'o
2013-06-04 19:08:59 UTC
Permalink
Post by Arul Selvan
thanks that answered my question. One more question, is it possible
to stop the delayed block allocation in ext4 ?
The fact that you are asking all of these questions is making me very
nervous. Why do you care? Application programmers should ***not***
be depending on low-level file system behavior.

If you care about what might happen after a crash, you need to use
fsync().

- Ted

Loading...