Friday, May 11, 2012

Scratch Pad for Understanding how to enforce fsck.

There are two variables Mount Count and Max Mount Count:
If Mount Count  > Max Mount Count, next boot up fsck runs!
To change Maximum mount count, use the utility tune2fs
$tune2fs -c COUNT  /dev/sdaX
Different devices can have different maximum mount count value.

This happens when in the /etc/fstab you have a value 1 (against rootfs entry) at the last entry. like

/dev/hda2/ext2defaults1 1
for all other devices this last entry should be 2 if you want to force fsck! else 0. fsck will be still wait for mount count to get above max mount count for all these devices.

Problem :

Damage the file system just enough that:
On next bootup - fsck runs.
And fixes the problem.

The first problem is to damage the filesystem. There is a superblock to each filesystem partition. This superblock contains metadata about the filesystem like block size, inode table size and location, empty blocks location and size of block groups. File system is smart enough to have multiple copies of this superblock so that  if one gets corrupted it still can recover.

To find out the location of superblock we use dumpe2fs utility.
$dumpe2fs  /dev/sdaX |grep superblock


  Primary superblock at 1, Group descriptors at 2-9
  Backup superblock at 8193, Group descriptors at 8194-8201
  Backup superblock at 24577, Group descriptors at 24578-24585
  Backup superblock at 40961, Group descriptors at 40962-40969
  Backup superblock at 57345, Group descriptors at 57346-57353
  Backup superblock at 73729, Group descriptors at 73730-73737
  Backup superblock at 204801, Group descriptors at 204802-204809
  Backup superblock at 221185, Group descriptors at 221186-221193
  Backup superblock at 401409, Group descriptors at 401410-401417
  Backup superblock at 663553, Group descriptors at 663554-663561
  Backup superblock at 1024001, Group descriptors at 1024002-1024009
  Backup superblock at 1990657, Group descriptors at 1990658-1990665

Now, its easy to corrupt block 1 in filesystem /dev/sdaX
run :
dd if=/dev/zero of=/dev/sdaX bs=4k count=1 skip=1
essentially, we are deleting  a block of size 4k (check your filesystem block size before you do),  by skipping to  the location of block 1. 

Now, if we tried to reboot the system and then tried to mount  /dev/sdaX, we would fail, if fsck did not care to correct it.
If we had to manually correct the issue: we can replace superblock with a backup copy. fsck will do it for us as :

fsck -b 8193 /dev/sdaX and now we could mount it. 

if we instead just ran fsck -f /dev/sdaX, forcing fsck to correct /dev/sdaX, it would also do the same thing. though spill a lot of questions. To answer all questions as yes. 
do fsck -f -y /dev/sdaX and you should have recovered the filesystem. 

The problem we still face is :
what exactly should be done to force this check during bootup. The check should only happen if the mount did fail. fsck should not try to fix  a problem that had not occurred by running at each bootup. 
.....



Thursday, May 10, 2012

NTP Errors.

I had hard time fixing an NTP bug that I came across.

The setup had hardware clock and software clock in sync.
We configured the NTP servers for the machine and in few minutes the system time would go around 7 hours back.
On doing: ntpq -p :

   remote           refid            st t when poll reach   delay   offset  jitter
=========================================================================
 ntp-1****. .GPS.                 1 u   28   64    3   49.992  2611013 2611013
 ntp-2****.   IP Address       2 u   51   64    3  570.113  2611038 2611039

Both the offset and jitters are high. And was not sure if one is caused by another and what might be issues if one of them is high and the other is not. And can i run into the same problems.

Well I tried a fresh machine, made the hardware clock back by 7 hours. Synced the system clock to the same time and then tried the same ntp servers. (( I could have only changed the system time)) I could see a great offset value and no jitter. In 5 minutes time the clock was synced to NTP.

If jitter is the problem, we need to find out why jitter might get high.
There could be several issues why jitter is high:
1. I am still to know if jitter could be high because NTP servers are indicating different offset values.
2. It can be because the system tick is too low or too high.Pasky's Blog  gives a good introspection of the same. 

Related Info:
the log files are in /var/log/daemon.log
Ended in  a condition: time correction of 26108 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time