Discussion:
2GB memory limit running fsck on a +6TB device
s***@usansolo.net
2008-06-09 17:33:48 UTC
Permalink
Dear Srs,

That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running
Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version,
"1.39+1.40-WIP-2006.11.14+dfsg-2etch1".

Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean
with errors":

# tune2fs -l /dev/sda4
tune2fs 1.40.10 (21-May-2008)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal dir_index filetype needs_recovery
sparse_super large_file
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 792576000
Block count: 1585146848

It's a backup storage server, with more than 113 million files, this's the
output of "df -i":

# df -i /backup/
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda4 792576000 113385959 679190041 15% /backup


Running fsck.ext3 or fsck.ext2 I get:

# fsck.ext3 /dev/sda4
e2fsck 1.40.10 (21-May-2008)
Adding dirhash hint to filesystem.

/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error allocating directory block array: Memory allocation failed
e2fsck: aborted

With some straces:

================================================================================
gettimeofday({1213032482, 940738}, NULL) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 16001}, ...}) = 0
write(1, "Pass 1: Checking ", 17Pass 1: Checking ) = 17
write(1, "inode", 5inode) = 5
write(1, "s, ", 3s, ) = 3
write(1, "block", 5block) = 5
write(1, "s, and sizes\n", 13s, and sizes
) = 13
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x404fa000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x46376000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x4c1f2000
mmap2(NULL, 198148096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x5206e000
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x5dd66000
mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x63be2000
mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x77488000) = 0x80ab000
mmap2(NULL, 1866375168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
-1, 0) = 0x90615000
munmap(0x90615000, 962560) = 0
munmap(0x90800000, 86016) = 0
mprotect(0x90700000, 135168, PROT_READ|PROT_WRITE) = 0
mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
================================================================================

Appears that fsck is trying to use more than 2GB memory to store inode
table relationship. System has 4GB of physical RAM and 4GB of swap, is
there anyway to limit the memory used by fsck or any solution to check this
filesystem? Running fsck with a 64bit LiveCD will solve the problem?

I also tried with last e2fsprogs stable release 1.40.10, getting the same
error :-/

Regards,

--
Santi Saez
Theodore Tso
2008-06-09 21:33:20 UTC
Permalink
Post by s***@usansolo.net
It's a backup storage server, with more than 113 million files, this's the
Appears that fsck is trying to use more than 2GB memory to store inode
table relationship. System has 4GB of physical RAM and 4GB of swap, is
there anyway to limit the memory used by fsck or any solution to check this
filesystem? Running fsck with a 64bit LiveCD will solve the problem?
Yes, running with a 64-bit Live CD is one way to solve the problem.

If you are using e2fsprogs 1.40.10, there is another solution that may
help. Create an /etc/e2fsck.conf file with the following contents:

[scratch_files]
directory = /var/cache/e2fsck

...and then make sure /var/cache/e2fsck exists by running the command
"mkdir /var/cache/e2fsck".

This will cause e2fsck to store certain data structures which grow
large with backup servers that have a vast number of hard-linked files
in /var/cache/e2fsck instead of in memory. This will slow down e2fsck
by approximately 25%, but for large filesystems where you couldn't
otherwise get e2fsck to complete because you're exhausting the 2GB VM
per-process limitation for 32-bit systems, it should allow you to run
through to completion.

- Ted
s***@usansolo.net
2008-06-10 15:34:35 UTC
Permalink
Post by Theodore Tso
If you are using e2fsprogs 1.40.10, there is another solution that may
[scratch_files]
directory = /var/cache/e2fsck
(..)
Post by Theodore Tso
This will cause e2fsck to store certain data structures which grow
large with backup servers that have a vast number of hard-linked files
in /var/cache/e2fsck instead of in memory. This will slow down e2fsck
by approximately 25%, but for large filesystems where you couldn't
otherwise get e2fsck to complete because you're exhausting the 2GB VM
per-process limitation for 32-bit systems, it should allow you to run
through to completion.
I'm trying with fsck.ext3 v1.40.8, backported from Lenny's package to Etch,
instead of v1.40.10 because we have the same sceneario in all backup
servers running BackupPC, and package must be distributed. If needed, we
can make test with the latest version ;-)

fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes,
blocks, and sizes", that's normal knowing that the filesystem has +113
million inodes?

I will send more info as requested Ted in "Call for testers w/ using
BackupPC" [1], but now this is the scenario:

- fsck.ext3 is using more than 2GB of memory and no swap, server has 4GB
phisycal RAM + 2GB of swap, this's the output of "pmap -d" with memory
map:

# pmap -d 7014
7014: fsck.ext3 -y /dev/sda4
Address Kbytes Mode Offset Device Mapping
(..)
242fd000 1834768 rw--- 00000000242fd000 000:00000 [ anon ]
942c2000 582604 rw--- 00000000942c2000 000:00000 [ anon ]
(..)

All the output is available at: http://pastebin.com/f67115de2


- Files in "/var/cache/e2fsck" appears that grow very slow, I think, 300Kb
per hour aprox, now that's the size:

# ls -lh /var/cache/e2fsck/
total 170M
-rw------- 1 root root 76M 2008-06-10 17:24
7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP
-rw------- 1 root root 95M 2008-06-10 17:24
7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu


- fsck is using 100% of one CPU, it's dual processor motherboard, output of
strace available at:

http://pastebin.com/f68389cce


- More info:
* Kernel 2.6.25.4, i686 arch on a Debian Etch box.
* Storage: 3ware 9550SXU-16ML, 5.91 TB in a RAID-5 with 14 500GB SATA
disks (ST3500630AS), 64kB stripe size (array is in optimal state)


Thanks all for the advices :-)

[1] http://www.redhat.com/archives/ext3-users/2007-April/msg00017.html

--
Santi Saez
Theodore Tso
2008-06-10 18:38:55 UTC
Permalink
Post by s***@usansolo.net
fsck.ext3 started 4 hours ago, and still is in "Pass 1: Checking inodes,
blocks, and sizes", that's normal knowing that the filesystem has +113
million inodes?
It depends on a lot of things; how big are your files on average, the
speed of your hard drive, and whether /var/cache/e2fsck is on the same
disk as the partition which you are checking, or on a separate spindle
(guess which is better :-).

It's always a good idea when running e2fsck (aka fsck.ext3) directly
and/or on a terminal/console to include as command-line options "-C
0". This will display a progress bar, so you can gauge how it is
doing. (0 through 70% is pass 1, which requires scanning the inode
table and following all of the indirect blocks.)

- Ted
Santi Saez
2008-06-10 22:24:27 UTC
Permalink
Post by Theodore Tso
It's always a good idea when running e2fsck (aka fsck.ext3) directly
and/or on a terminal/console to include as command-line options "-C
0". This will display a progress bar, so you can gauge how it is
doing. (0 through 70% is pass 1, which requires scanning the inode
table and following all of the indirect blocks.)
Thanks for the tip! :-)

'/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI,
NFS, etc.. this directory will improve, we will work with this in other
test.

I have enabled progress bar sending SIGUSR1 signal to the process, and
it's still on 2% ;-(

"scratch_files" directory size is now 251M, it has grown 81MB in the
last 7 hours:

# ls -lh /var/cache/e2fsck/
total 251M
-rw------- 1 root root 112M 2008-06-11 00:09
7701b70e-f776-417b-bf31-3693dba56f86-dirinfo-VkmFXP
-rw------- 1 root root 139M 2008-06-11 00:09
7701b70e-f776-417b-bf31-3693dba56f86-icount-YO08bu

strace's output is the same, and also memory usage is the same.

I will let the process more time.. but I think it will take too much
time to complete, at least to finish the pass 1, perhaps more than 50
hours? According that now is only on 2% of the process + take 12 hours
to complete, and pass 1 is from 0% through 70%.. is there any other
solution to solve this?

ext4 will solve this problem? I have not tested ext4 already, but I have
read that it will improve fast filesytem checking...

Regards,

--
Santi Saez
Theodore Tso
2008-06-10 23:01:24 UTC
Permalink
'/var/cache/e2fsck' is in the _same_ disk, perhaps mounting via iSCSI, NFS,
etc.. this directory will improve, we will work with this in other test.
I have enabled progress bar sending SIGUSR1 signal to the process, and it's
still on 2% ;-(
"scratch_files" directory size is now 251M, it has grown 81MB in the last 7
hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can
run that command while e2fsck is running, since it's read-only. I'm
curious exactly how big the filesystem is, and how many directories
are in the first part of the filesystem.

How big is the filesystem(s) that you are backing up via BackupPC, in
terms of size (megabytes) and files (number of inodes)? And how many
days of incremental backups are you keeping? Also, how often do files
change? Can you give a rough estimate of how many files get modified
per backup cycle?

Thanks,

- Ted
Santi Saez
2008-06-10 23:48:35 UTC
Permalink
Post by Theodore Tso
hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can
run that command while e2fsck is running, since it's read-only. I'm
curious exactly how big the filesystem is, and how many directories
are in the first part of the filesystem.
Upsss... dumpe2fs takes about 3 minutes to complete and generates about
133MB output file:

dumpe2fs 1.40.8 (13-Mar-2008)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal dir_index filetype sparse_super
large_file
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 792576000
Block count: 1585146848
Reserved block count: 0
Free blocks: 913341561
Free inodes: 678201512
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16384
Inode blocks per group: 512
Filesystem created: Mon Nov 13 10:12:49 2006
Last mount time: Mon Jun 9 19:37:12 2008
Last write time: Tue Jun 10 12:18:25 2008
Mount count: 37
Maximum mount count: -1
Last checked: Mon Nov 13 10:12:49 2006
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: afabe3f6-4405-44f4-934b-76c23945db7b
Journal backup: inode blocks
Journal size: 32M

Some example output from group 0 to 5 is available at:

http://pastebin.com/f5341d121
Post by Theodore Tso
How big is the filesystem(s) that you are backing up via BackupPC, in
terms of size (megabytes) and files (number of inodes)? And how many
days of incremental backups are you keeping? Also, how often do files
change? Can you give a rough estimate of how many files get modified
per backup cycle?
Where are backing up several servers, near about 15 in this case, with
60-80GB data size to backup in each server and +2-3 millon inodes, with
15 day incrementals. I think near about 2-3% of the files changes each
day, but I will ask for more info to the backup administrator.

I have found and old doc with some build info for this server, the
partition was formated with:

# mkfs.ext3 -b 4096 -j -m 0 -O dir_index /dev/sda4
# tune2fs -c 0 -i 0 /dev/sda4
# mount -o data=writeback,noatime,nodiratime,commit=60 /dev/sda4 /backup

I'm going to fetch more info about BackupPC and backup cycles, thanks Ted!!

Regards,

--
Santi Saez
Theodore Tso
2008-06-11 02:18:00 UTC
Permalink
Post by Santi Saez
Post by Theodore Tso
hmm..... can you send me the output of dumpe2fs /dev/sdXX? You can
run that command while e2fsck is running, since it's read-only. I'm
curious exactly how big the filesystem is, and how many directories
are in the first part of the filesystem.
Upsss... dumpe2fs takes about 3 minutes to complete and generates about
True, but it compresses well. :-) And the aside from the first part
of the dumpe2fs, the part that I was most interested could have been
summarized by simply doing a "grep directories dumpe2fs.out".

But simply looking at your dumpe2fs, and take an average from the
first 6 block groups which you included in the pastebin, I can
extrapolate and guess that you have about 63 million directories, out
of approximately 114 million total inodes (so about 51 million regular
files, nearly all of which have hard link counts > 1). Unfortunately,
BackupPC blows out of the water all of our memory reduction
hueristics. I estimate you need something like 2.6GB to 3GB of memory
just for these data structures alone. (Not to mention 94 MB for each
inode bitmap, and 188 MB for each block bitmap.) The good news is
that 4GB of memory should do you --- just. (I'd probably put in a bit
more physical memory just to be on the safe side, or enable swap
before running e2fsck). The bad news is you really, REALLY need a
64-bit kernel on your system.

Because /var/cache/e2fsck is on the same disk spindle as the
filesystem you are checking, you're probably getting killed on seeks.
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you can
do is get a 64-bit kernel and not need to use the auxiliary storage in
the first place.

As far as what to advice to give you, why are you running e2fsck? Was
this an advisory thing caused by the mount count and/or length of time
between filesystem checks? Or do you have real reason to believe the
filesystem may be corrupt?

- Ted
s***@usansolo.net
2008-06-11 08:14:45 UTC
Permalink
Post by Theodore Tso
True, but it compresses well. :-) And the aside from the first part
of the dumpe2fs, the part that I was most interested could have been
summarized by simply doing a "grep directories dumpe2fs.out".
:D

"grep directories" is available at:

http://santi.usansolo.net/tmp/dumpe2fs_directories.txt.gz (317K)

Full "dumpe2fs" output compressed is 34M and available at:

http://santi.usansolo.net/tmp/dumpe2fs.txt.gz
Post by Theodore Tso
But simply looking at your dumpe2fs, and take an average from the
first 6 block groups which you included in the pastebin, I can
extrapolate and guess that you have about 63 million directories, out
of approximately 114 million total inodes (so about 51 million regular
files, nearly all of which have hard link counts > 1).
# grep directories dumpe2fs.txt | awk '{sum += $7} END {print sum}'
78283294
Post by Theodore Tso
BackupPC blows out of the water all of our memory reduction
hueristics. I estimate you need something like 2.6GB to 3GB of memory
just for these data structures alone. (Not to mention 94 MB for each
inode bitmap, and 188 MB for each block bitmap.) The good news is
that 4GB of memory should do you --- just. (I'd probably put in a bit
more physical memory just to be on the safe side, or enable swap
before running e2fsck). The bad news is you really, REALLY need a
64-bit kernel on your system.
Unfortunately, I have killed the process, in 21 hours only 2.5% of the fsck
is completed ;-(

'scratch_files' directory has grown to 311M

===================================================================
# time fsck -y /dev/sda4
fsck 1.40.8 (13-Mar-2008)
e2fsck 1.40.8 (13-Mar-2008)
Adding dirhash hint to filesystem.

/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

/dev/sda4: e2fsck canceled.

/dev/sda4: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda4: ********** WARNING: Filesystem still has errors **********

real 1303m19.306s
user 1079m58.898s
sys 217m10.130s
===================================================================
Post by Theodore Tso
Because /var/cache/e2fsck is on the same disk spindle as the
filesystem you are checking, you're probably getting killed on seeks.
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you can
do is get a 64-bit kernel and not need to use the auxiliary storage in
the first place.
I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
size=2048M", but appears that will take a long time to complete too.. so
the next test will be with a 64-bit LiveCD :)
Post by Theodore Tso
As far as what to advice to give you, why are you running e2fsck? Was
this an advisory thing caused by the mount count and/or length of time
between filesystem checks? Or do you have real reason to believe the
filesystem may be corrupt?
No, it's not related with mount count and/or length of time between
filesystem checks. When booting we get this error/warning:

EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
EXT3 FS on sda4, internal journal
EXT3-fs: mounted filesystem with writeback data mode.

And "tune2fs" returns that ext3 is in "clean with errors" state.. so, we
think that completing a full fsck process is a good idea; what means in
this case "clean with errors" state, running a fsck is not needed?

Thanks again for all the help and advices!!

--
Santi Saez
s***@usansolo.net
2008-06-11 11:51:17 UTC
Permalink
Post by s***@usansolo.net
Post by Theodore Tso
Because /var/cache/e2fsck is on the same disk spindle as the
filesystem you are checking, you're probably getting killed on seeks.
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you can
do is get a 64-bit kernel and not need to use the auxiliary storage in
the first place.
I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
size=2048M", but appears that will take a long time to complete too.. so
the next test will be with a 64-bit LiveCD :)
Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
times faster ;-)

Making some fast test with e2fsck v1.40.10 appears that is a bit faster
than v1.40.8, last version improves this feature? Anyway, finally I had to
cancel the process..

# ./e2fsck -nfvttC0 /dev/sda4
e2fsck 1.40.10 (21-May-2008)
Pass 1: Checking inodes, blocks, and sizes
/dev/sda4: e2fsck canceled.


/dev/sda4: ********** WARNING: Filesystem still has errors **********

Memory used: 260k/581088k (183k/78k)

Regards,

--
Santi Saez
Andreas Dilger
2008-06-11 14:59:08 UTC
Permalink
Post by s***@usansolo.net
Post by s***@usansolo.net
Post by Theodore Tso
Because /var/cache/e2fsck is on the same disk spindle as the
filesystem you are checking, you're probably getting killed on seeks.
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you can
do is get a 64-bit kernel and not need to use the auxiliary storage in
the first place.
I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs -o
size=2048M", but appears that will take a long time to complete too.. so
the next test will be with a 64-bit LiveCD :)
Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
times faster ;-)
...but, isn't the problem that you don't have enough RAM? Using tdb+ramfs
isn't going to be faster than using the RAM directly.

I suspect that the only way you are going to check this filesystem efficiently
is to boot a 64-bit kernel (even just from a rescue disk), set up some swap
just in case, and run e2fsck from there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Bryan Kadzban
2008-06-11 16:49:04 UTC
Permalink
Post by Andreas Dilger
Post by s***@usansolo.net
Post by s***@usansolo.net
Post by Theodore Tso
Moving /var/cache/e2fsck to another disk partition will help (or
better yet, battery backed memory device), but the best thing you
can do is get a 64-bit kernel and not need to use the auxiliary
storage in the first place.
I'm trying a fast test with "mount tmpfs /var/cache/e2fsck -t tmpfs
-o size=2048M", but appears that will take a long time to complete
too.. so the next test will be with a 64-bit LiveCD :)
Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox.
3 times faster ;-)
...but, isn't the problem that you don't have enough RAM? Using
tdb+ramfs isn't going to be faster than using the RAM directly.
It won't be faster, no, but it will be faster than tdb-on-disk, and much
faster than tdb on the same disk as the one that's being checked.

And it *might* allow e2fsck to allocate all the virtual memory that it
needs, depending on how the tmpfs driver works. If tmpfs uses the same
VA space as e2fsck and the rest of the kernel, then it probably won't
help. But if tmpfs can use a different pool somehow (whether that's
because the kernel uses a different set of pagetables, or whatever),
then it might.
Post by Andreas Dilger
I suspect that the only way you are going to check this filesystem
efficiently is to boot a 64-bit kernel (even just from a rescue disk),
set up some swap just in case, and run e2fsck from there.
And try to run a 64-bit e2fsck binary, too. The virtual address space
usage estimate that someone (Ted?) came up with earlier in this thread
was close to 4G, which means that even with a 64-bit kernel, a 32-bit
e2fsck binary might still run out of virtual address space. (It will
need to map lots of disk, plus any real RAM usage, plus itself and any
libraries. That last bit *might* push it over 4G, depending on how
accurate the estimate of 4G turns out to be.)

The easiest way to do this is probably run the e2fsck from the LiveCD
itself; don't try to run the 32-bit version that the system has
installed. That version *might* work, but it'll be tight; a 64-bit
version that can use 40-odd bits in its virtual addresses (44? 48? I
think it depends on the exact CPU model -- and the kernel, of course)
will have a *lot* more headroom.
Theodore Tso
2008-06-12 05:24:29 UTC
Permalink
Post by Andreas Dilger
Post by s***@usansolo.net
Note that putting '/var/cache/e2fsck' in a memory filesystem is aprox. 3
times faster ;-)
...but, isn't the problem that you don't have enough RAM? Using tdb+ramfs
isn't going to be faster than using the RAM directly.
Tmpfs is swap backed, if swap has been configured. So it can help.

Another possibility is to use a statically linked e2fsck, since the
shared libraries chew up a lot of VM address space. But in this
particular case, it probably wouldn't be enough.

I think the best thing to do is this case to use a 64-bit kernel and a
64-bit compiled e2fsck binary.

- Ted

Andreas Dilger
2008-06-09 21:50:32 UTC
Permalink
Post by s***@usansolo.net
That's the scenario: +6TB device on a 3ware 9550SX RAID controller, running
Debian Etch 32bits, with 2.6.25.4 kernel, and defaults e2fsprogs version,
"1.39+1.40-WIP-2006.11.14+dfsg-2etch1".
Running "tune2fs" returns that filesystem is in EXT3_ERROR_FS state, "clean
# tune2fs -l /dev/sda4
tune2fs 1.40.10 (21-May-2008)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 7701b70e-f776-417b-bf31-3693dba56f86
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal dir_index filetype needs_recovery
sparse_super large_file
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 792576000
Block count: 1585146848
It's a backup storage server, with more than 113 million files, this's the
# df -i /backup/
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda4 792576000 113385959 679190041 15% /backup
# fsck.ext3 /dev/sda4
e2fsck 1.40.10 (21-May-2008)
Adding dirhash hint to filesystem.
/dev/sda4 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
I recall that e2fsck allocates on the order of 3 * block_count / 8 bytes,
and 5 * inode_count / 8 bytes, so in your case this is about:

(5 * 1585146848 + 3 * 792576000) / 8 = 1287932780 bytes = 1.2GB

at a minimum, but my estimates might be incorrect.
Post by s***@usansolo.net
mmap2(NULL, 99074048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x404fa000
Judging by the return values of these functions, this is a 32-bit system,
and it is entirely possible that you are exceeding the per-process memory
allocation limit.
Post by s***@usansolo.net
mmap2(NULL, 748892160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x63be2000
mmap2(NULL, 1866240000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = -1 ENOMEM (Cannot allocate memory)
Hmm, it seems a bit excessive to allocate 1.8GB in a single chunk.
Post by s***@usansolo.net
Error allocating directory block array: Memory allocation failed
e2fsck: aborted
This message is a bit tricky to nail down because it doesn't exist anywhere
in the code directly. It is encoded into "e2fsck abbreviations", and
the expansion that is normally in the corresponding comment is different.
It is PR_1_ALLOCATE_DBCOUNT returned from the call chain:
ext2fs_init_dblist->
make_dblist->
ext2fs_get_num_dirs()

which is counting the number of directories in the filesystem, and allocating
two 12-byte array element for each one. This implies you have 77M directories
in your filesystem, or an average of only 10 files per directory?
Post by s***@usansolo.net
Appears that fsck is trying to use more than 2GB memory to store inode
table relationship. System has 4GB of physical RAM and 4GB of swap, is
there anyway to limit the memory used by fsck or any solution to check this
filesystem?
I don't know offhand how important the dblist structure is, so I'm not
sure if there is a way to reduce the memory usage for it. I believe
that in low-memory situations it is possible to use tdb in newer versions
of e2fsck for the dblist, but I don't know much of the details.
Post by s***@usansolo.net
Running fsck with a 64bit LiveCD will solve the problem?
Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
for e2fsck and be able to check the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Carlo Wood
2008-06-09 22:08:56 UTC
Permalink
Post by Andreas Dilger
Post by s***@usansolo.net
Running fsck with a 64bit LiveCD will solve the problem?
Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
for e2fsck and be able to check the filesystem.
We had a simular problem with ext3grep.
You have to realize that every mmap uses memory
address space, even if it's a map to disk.
Therefore, on a 32bit machine, if the total
of all normal allocations plus all simultaneous
mmap's exceeds 4GB then you "run out of memory",
even if -say- only 1 GB is really allocated
and >3GB of the disk is mmap-ed.

In that case a 64bit machine would solve the
problem because then all ram (2 GB I read in
the Subject) can be used for normal allocations
while any disk mmap has cazillion address space
left for itself.
--
Carlo Wood <***@alinoe.com>
Theodore Tso
2008-06-09 22:37:36 UTC
Permalink
Post by Andreas Dilger
This message is a bit tricky to nail down because it doesn't exist anywhere
in the code directly. It is encoded into "e2fsck abbreviations", and
the expansion that is normally in the corresponding comment is different.
ext2fs_init_dblist->
make_dblist->
ext2fs_get_num_dirs()
which is counting the number of directories in the filesystem, and allocating
two 12-byte array element for each one. This implies you have 77M directories
in your filesystem, or an average of only 10 files per directory?
There are a number of backup solutions that use hardlinks to conserve
space between increment snapshots. So yeah, with these worklodas
you'll see something like 80-85M inodes, of which 77M-odd will be
directories. When you combine the vast number of directories used by
these filesystems, and the fact that e2fsck tries to opimize memory
use by observing that on most normal filesystems, most files have
n_link count of 1, which is NOT true on these filesystems used for
backups, e2fsck's tricks to optimize for speed by caching information
to avoid re-reading them from disk end up costing a large amount of
memory.
Post by Andreas Dilger
I don't know offhand how important the dblist structure is, so I'm not
sure if there is a way to reduce the memory usage for it. I believe
that in low-memory situations it is possible to use tdb in newer versions
of e2fsck for the dblist, but I don't know much of the details.
Yep, please see [scratch_files] section in e2fsck.conf. It is
described in the e2fsck.conf(5) man page.

- Ted
Andreas Dilger
2008-06-09 22:57:59 UTC
Permalink
Post by Theodore Tso
Post by Andreas Dilger
I don't know offhand how important the dblist structure is, so I'm not
sure if there is a way to reduce the memory usage for it. I believe
that in low-memory situations it is possible to use tdb in newer versions
of e2fsck for the dblist, but I don't know much of the details.
Yep, please see [scratch_files] section in e2fsck.conf. It is
described in the e2fsck.conf(5) man page.
Hmm, maybe if the ext2fs_init_dblist() function returns PR_1_ALLOCATE_DBCOUNT
this should be a user-fixable problem that asks if the user wants to use
an on-disk tdb file in /var/tmp, and if that is a "no" then point them at
the right section in /etc/e2fsck.conf?

I don't think it is reasonable to default to using /tmp, because it might
be a RAM-backed filesystem, and I suspect in most cases the root filesystem
will not run out of memory in this way... Even if it fails because /var/tmp
is read-only, or too small, it is no worse off than it is today.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Greg Trounson
2008-06-10 03:36:52 UTC
Permalink
...
Post by Andreas Dilger
Post by s***@usansolo.net
Running fsck with a 64bit LiveCD will solve the problem?
Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
for e2fsck and be able to check the filesystem.
Couldn't you achieve the same thing just by enabling PAE on your 32-bit kernel?

Greg
Theodore Tso
2008-06-10 13:18:28 UTC
Permalink
Post by Greg Trounson
...
Post by Andreas Dilger
Post by s***@usansolo.net
Running fsck with a 64bit LiveCD will solve the problem?
Yes, I suspect with a 64-bit kernel you could allocate the full 4GB of RAM
for e2fsck and be able to check the filesystem.
Couldn't you achieve the same thing just by enabling PAE on your 32-bit kernel?
No, that doesn't increase the amount address space available to the
user process, which is the limitation here. You can have 16 GB of
physical memory, but 2**32 is still 4GB, and the kernel needs address
space, so that means userspace will have at most 3GB of space to
itself.

- Ted
Loading...