A few imaginative titles:
- strangled for space by a ghost
- something that does not exist but occupies space
- Ghost files
I can do this all day long... :D
The story:
Ghost story the beginning !!
(like bahubali: the begin...)
RabbitMQ starts doing this everyday !!
2021-04-19 10:03:03.361 [info] <0.438.0> Free disk space is insufficient. Free bytes: 47263744. Limit: 50000000 2021-04-19 10:03:03.362 [warning] <0.434.0> disk resource limit alarm set on node 'rabbit@ip-xx-xx-xx-xxx'. ********************************************************** *** Publishers will be blocked until this alarm clears *** ********************************************************** 2021-04-19 10:10:33.505 [info] <0.438.0> Free disk space is sufficient. Free bytes: 887422976. Limit: 50000000 2021-04-19 10:10:33.505 [warning] <0.434.0> disk resource limit alarm cleared on node 'rabbit@ip-xx-xx-xx-xxx' 2021-04-19 10:10:33.505 [warning] <0.434.0> disk resource limit alarm cleared across the cluster
Result:
hush, (sigh) !
All producers of the queues crumbled under this embargo !
Mass service failure.
Much like the second wave of COVID-19 in India, around the time this was written (29 Apr 2021)
hush, (sigh) !
All producers of the queues crumbled under this embargo !
Mass service failure.
Much like the second wave of COVID-19 in India, around the time this was written (29 Apr 2021)
First Investigations:
grep --> disk space insufficient
in rabbitmq logs
grep --> disk space insufficient
in rabbitmq logs
Co-incidentally timings were very close:
2021-04-19 10:03:03.361 [info] <0.438.0> Free disk space is insufficient. Free bytes: 47263744. Limit: 50000000
2021-04-19 10:30:51.071 [info] <0.438.0> Free disk space is insufficient. Free bytes: 49147904. Limit: 50000000
2021-04-20 12:35:05.771 [info] <0.440.0> Free disk space is insufficient. Free bytes: 48472064. Limit: 50000000
2021-04-22 04:51:43.720 [info] <0.440.0> Free disk space is insufficient. Free bytes: 978644992. Limit: 1000000000
2021-04-23 10:29:39.702 [info] <0.434.0> Free disk space is insufficient. Free bytes: 3999858688. Limit: 4000000000
2021-04-26 04:43:31.479 [info] <0.434.0> Free disk space is insufficient. Free bytes: 3999997952. Limit: 4000000000
- A cron queuing up so many messages that it was taking up much space on the disk !!
(here the inherent assumption was that rabbitMQ was the one hogging space, since it was the making all the noise)
also, our understanding of the disk resource limit alarm, due to a skim through of the documentation, was flawed.
All our crons were scattered, so it was time taking to bring them all in one place and correlate them with the spikes in memory consumption. So we needed intermediate relief when we were scouring all the crons..
So well ?!
What did we do ?
What did we do ?
We put rabbitmq under house arrest.
How you ask ?
- monitoring api called at a 10 second interval and collected:
- menesia disk consumption for 1.5 days...
To our surprise..
It never went above a few MBs
How you ask ?
http://rabbitmq-host:port/api/nodes/rabbit@ip-xx-xx-xx-xxx/memory
- monitoring api called at a 10 second interval and collected:
- menesia disk consumption for 1.5 days...
To our surprise..
It never went above a few MBs
and
Still the free space param on the node health monitoring api :
Still the free space param on the node health monitoring api :
http://127.0.0.1:45672/api/nodes/rabbit@ip-xx-xx-xx-xxx
kept on shrinking !!Someone was bullying our Rabbit (MQ)
So further down the rabbit hole we went..
df -TH
- agreed with the RabbitMQ node health api and the RabbitMQ monitoring dashboard
("df" and and "the api" did get physical over one minor difference:
in converting bytes to GB with 10^9 or (1024)^3; still they agreed
such immature people I tell you... poh-té-to poh-ta-to)
("df" and and "the api" did get physical over one minor difference:
in converting bytes to GB with 10^9 or (1024)^3; still they agreed
such immature people I tell you... poh-té-to poh-ta-to)
But I digress.
The space was indeed shrinking..
the walls were indeed closing in on our little Rabbit(MQ)
But,
Much like government reported figures...
"du" told us that nothing was amiss.. nobody was amassing huge amounts of disk space
sudo du -sh /* --exclude some-other-partition-mounted-on-root
- had a different story to tell.. Much like government reported figures...
"du" told us that nothing was amiss.. nobody was amassing huge amounts of disk space
("No damaad was taking up inordinate amount of prime disk real estate")
("Nobody is dying of corona due to scarcity of medicines and oxygen")
("Nobody is dying of corona due to scarcity of medicines and oxygen")
We decided to believe the little guy who was getting squeezed.
So who is writing to disk anyway ?
sudo iotop -ao
comes to the rescue.
Like a honest police officer,
It showed us that one of our own was writing to disk too fast..
Let's call the process "Birju"
(why Birju ?, so that the writer can segway into using Mother India references, duh !)
and like Mother India we killed the poor bastard !!
Lo and behold !
All the disk space he had confiscated from "Sukhilala" was returned to the free disk world.
But that did not solve the problem ! We cannot kill our "Birju" everyday, we are not that kind of mother india :D
and technically speaking we can restore the process again to its complete form after killing it, but having to do that means we did not build the process well and we need to fix that.
Still...
where was this bugger writing to ?
we instinctively knew this was file i/o not handled well...
But which file was this guy writing to was not known..
So like Neo in the Matrix, we went back to the Source !!
But like Neo it turned out to be a bust the first time around...
All the poor guy way doing in the name of file-io was logging. And not adhoc logging to file, but using a proper logger with nice file transports..
and technically speaking we can restore the process again to its complete form after killing it, but having to do that means we did not build the process well and we need to fix that.
Still...
where was this bugger writing to ?
we instinctively knew this was file i/o not handled well...
But which file was this guy writing to was not known..
So like Neo in the Matrix, we went back to the Source !!
But like Neo it turned out to be a bust the first time around...
All the poor guy way doing in the name of file-io was logging. And not adhoc logging to file, but using a proper logger with nice file transports..
So what now ?
Well we have the process id ! and this is Linaaaaaaaaux !!!
References in this investigation:
- freeing up space from a deleted file still kept alive by a process
https://unix.stackexchange.com/a/68532
- freeing up space from a deleted file still kept alive by a process
https://unix.stackexchange.com/a/68532
> "/proc/$pid/fd/$fd"
WARNING: such stunts to be performed by experts only
- find all procs still dealing with deleted files
same article as above
same article as above
find /proc/*/fd -ls | grep '(deleted)'
- Red hat chiming in and explaining what the issue is: -- Why is space not being freed from disk after deleting a file in Red Hat Enterprise Linux? https://access.redhat.com/solutions/2316
-- Why does df show greater disk usage than du in Red Hat Enterprise Linux? https://access.redhat.com/solutions/2794
- some useful commands: lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head lsof -s -p <pid> | sort -k7 -r | awk '{$7=$7/1048576 "MB"}1'
lsof | grep deleted
find /proc/*/fd -ls | grep '(deleted)' | grep deleted
- various ways to truncate a file: https://linuxize.com/post/truncate-files-in-linux/
Logrotate
https://linux.die.net/man/8/logrotate
- open files by size: https://unix.stackexchange.com/a/332340
sudo lsof -F sn0 | tr -d '\000' | grep deleted | sed 's/^[a-z]*\([0-9]*\)n/\1 /' | sort -n
- Why does lsof show the same file descriptor for different processes?
https://askubuntu.com/questions/158988/why-does-lsof-show-the-same-file-descriptor-for-different-processes
- how to display human readable sizes when option is not available in the command
https://unix.stackexchange.com/questions/253321/how-to-display-size-human-readable-in-lsof-grep-var
- lsof uses:
https://unix.stackexchange.com/questions/235979/how-do-i-find-out-more-about-socket-files-in-proc-fd
- who is using s a file in linux: https://www.tecmint.com/find-out-who-is-using-a-file-in-linux/
https://www.cyberciti.biz/faq/howto-linux-get-list-of-open-files/
Comments
Post a Comment