[NILFS users] NILFS hanging SLES 11 - advise on diagnosis needed

Barham, David david.barham at siemens.com
Fri Oct 2 19:11:14 JST 2009


Hi
I'm running SLES 11, 2.6.27.19-5-default with NILFS2 nilfs-2.0.16. I have a 1.5Tb NILFS2 partition which I am setting up with the intention of using Robocopy from various PCs via samba. The robocopy scripts run nightly and a checkpoint is taken once night. A script stops samba, unmounts the previous weeks checkpoint, deletes the checkpoint, creates a new one and then mounts it and restarts samba. This should mean that at any time the user can go back to 'snapshot_{DAY}' to get their files back.

So far so good.

However as I copy the previously backed up files from the previous linux machine where I was doing this (only giving a 'current' copy with reiserfs). I'm finding that the new machine is occasionally hanging. The OS just locks up, screen on console frozen but host still responds to ping. 

I'm trying to work out what is causing the hang, I'm getting various messages in the log from smartd relating to the disk which houses the NILFS along the lines of:

 Oct  2 09:56:59 cpli6008 syslog-ng[1933]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=947', processed='center(received)=478', processed='destination(newsnotice)=0', processed='destination(acpid)=0', processed='destination(firewall)=0', processed='destination(mail)=12', processed='destination(mailinfo)=12', processed='destination(console)=151', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=466', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=0', processed='destination(mailerr)=0', processed='destination(xconsole)=151', processed='destination(warn)=155', processed='source(src)=478'
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 to 112
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 117
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 189 High_Fly_Writes changed from 88 to 87
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 61
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 39
Oct  2 09:57:25 cpli6008 smartd[3473]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51

{machine stops responding and gets power cycled}

Oct  2 10:10:58 cpli6008 syslog-ng[1948]: syslog-ng starting up; version='2.0.9'

Do folks think that the hang is NILFS or dodgy hardware/reporting from smartd? Is there any advise on getting some debug or status information from NILFS to help show it isn't the cause of the problem. I would have expected that if it went bang I'd have seen something 'worrying' in the log. 

For information the hardware is a Dell Precision 380.

Many thanks
David Barham
Siemens PLM Software




More information about the users mailing list