[NILFS users] NILFS hanging SLES 11 - advise on diagnosis needed
Barham, David
david.barham at siemens.com
Fri Oct 2 19:11:14 JST 2009
Hi
I'm running SLES 11, 2.6.27.19-5-default with NILFS2 nilfs-2.0.16. I have a 1.5Tb NILFS2 partition which I am setting up with the intention of using Robocopy from various PCs via samba. The robocopy scripts run nightly and a checkpoint is taken once night. A script stops samba, unmounts the previous weeks checkpoint, deletes the checkpoint, creates a new one and then mounts it and restarts samba. This should mean that at any time the user can go back to 'snapshot_{DAY}' to get their files back.
So far so good.
However as I copy the previously backed up files from the previous linux machine where I was doing this (only giving a 'current' copy with reiserfs). I'm finding that the new machine is occasionally hanging. The OS just locks up, screen on console frozen but host still responds to ping.
I'm trying to work out what is causing the hang, I'm getting various messages in the log from smartd relating to the disk which houses the NILFS along the lines of:
Oct 2 09:56:59 cpli6008 syslog-ng[1933]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=947', processed='center(received)=478', processed='destination(newsnotice)=0', processed='destination(acpid)=0', processed='destination(firewall)=0', processed='destination(mail)=12', processed='destination(mailinfo)=12', processed='destination(console)=151', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=466', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=0', processed='destination(mailerr)=0', processed='destination(xconsole)=151', processed='destination(warn)=155', processed='source(src)=478'
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 to 112
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 117
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 189 High_Fly_Writes changed from 88 to 87
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 61
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 39
Oct 2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51
{machine stops responding and gets power cycled}
Oct 2 10:10:58 cpli6008 syslog-ng[1948]: syslog-ng starting up; version='2.0.9'
Do folks think that the hang is NILFS or dodgy hardware/reporting from smartd? Is there any advise on getting some debug or status information from NILFS to help show it isn't the cause of the problem. I would have expected that if it went bang I'd have seen something 'worrying' in the log.
For information the hardware is a Dell Precision 380.
Many thanks
David Barham
Siemens PLM Software
More information about the users
mailing list