Skip to main content

The open case of closed files

 A few imaginative titles:
- strangled for space by a ghost

- something that does not exist but occupies space

- Ghost files

I can do this all day long... :D



The story:

Ghost story the beginning !!
(like bahubali:  the begin...)


RabbitMQ starts doing this everyday !!

2021-04-19 10:03:03.361 [info] <0.438.0> Free disk space is insufficient. Free bytes: 47263744. Limit: 50000000
2021-04-19 10:03:03.362 [warning] <0.434.0> disk resource limit alarm set on node 'rabbit@ip-xx-xx-xx-xxx'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
2021-04-19 10:10:33.505 [info] <0.438.0> Free disk space is sufficient. Free bytes: 887422976. Limit: 50000000
2021-04-19 10:10:33.505 [warning] <0.434.0> disk resource limit alarm cleared on node 'rabbit@ip-xx-xx-xx-xxx'
2021-04-19 10:10:33.505 [warning] <0.434.0> disk resource limit alarm cleared across the cluster

Result:

hush, (sigh) ! 
All producers of the queues crumbled under this embargo !
Mass service failure.

Much like the second wave of COVID-19 in India, around the time this was written (29 Apr 2021)


First Investigations:

grep --> disk space insufficient
in rabbitmq logs 

Co-incidentally timings were very close:
2021-04-19 10:03:03.361 [info] <0.438.0> Free disk space is insufficient. Free bytes: 47263744. Limit: 50000000
2021-04-19 10:30:51.071 [info] <0.438.0> Free disk space is insufficient. Free bytes: 49147904. Limit: 50000000
2021-04-20 12:35:05.771 [info] <0.440.0> Free disk space is insufficient. Free bytes: 48472064. Limit: 50000000
2021-04-22 04:51:43.720 [info] <0.440.0> Free disk space is insufficient. Free bytes: 978644992. Limit: 1000000000
2021-04-23 10:29:39.702 [info] <0.434.0> Free disk space is insufficient. Free bytes: 3999858688. Limit: 4000000000
2021-04-26 04:43:31.479 [info] <0.434.0> Free disk space is insufficient. Free bytes: 3999997952. Limit: 4000000000



The usual suspect !!

- A cron queuing up so many messages that it was taking up much space on the disk !!
(here the inherent assumption was that rabbitMQ was the one hogging space, since it was the making all the noise)
also, our understanding of the disk resource limit alarm, due to a skim through of the documentation, was flawed.

All our crons were scattered, so it was time taking to bring them all in one place and correlate them with the spikes in memory consumption. So we needed intermediate relief when we were scouring all the crons.. 

So well ?!
What did we do ?

We put rabbitmq under house arrest.



How you ask ?

http://rabbitmq-host:port/api/nodes/rabbit@ip-xx-xx-xx-xxx/memory

- monitoring api called at a 10 second interval and collected:
- menesia disk consumption for 1.5 days...

To our surprise..
It never went above a few MBs
and 
Still the free space param on the node health monitoring api :
http://127.0.0.1:45672/api/nodes/rabbit@ip-xx-xx-xx-xxx
kept on shrinking !!

Someone was bullying our Rabbit (MQ)



So further down the rabbit hole we went..

df -TH
- agreed with the RabbitMQ node health api and the RabbitMQ monitoring dashboard
("df" and and "the api" did get physical over one minor difference:
in converting bytes to GB with 10^9 or (1024)^3; still they agreed
such immature people I tell you... poh-té-to  poh-ta-to)

But I digress.
The space was indeed shrinking.. 
the walls were indeed closing in on our little Rabbit(MQ)


But,
sudo du -sh /* --exclude some-other-partition-mounted-on-root
- had a different story to tell.. 

Much like government reported figures...
"du" told us that nothing was amiss.. nobody was amassing huge amounts of disk space 
("No damaad was taking up inordinate amount of prime disk real estate")  
("Nobody is dying of corona due to scarcity of medicines and oxygen")

We decided to believe the little guy who was getting squeezed.


So who is writing to disk anyway ?

sudo iotop -ao
comes to the rescue.

Like a honest police officer, 
It showed us that one of our own was writing to disk too fast..
Let's call the process  "Birju"
(why Birju ?, so that the writer can segway into using Mother India references, duh !)

and like Mother India we killed the poor bastard !!
Lo and behold !

All the disk space he had confiscated from "Sukhilala" was returned to the free disk world.

But that did not solve the problem ! We cannot kill our "Birju" everyday, we are not that kind of mother india :D
and technically speaking we can restore the process again to its complete form after killing it, but having to do that means we did not build the process well and we need to fix that.


Still...
where was this bugger writing to ?


we instinctively knew this was file i/o not handled well... 
But which file was this guy writing to was not known.. 

So like Neo in the Matrix, we went back to the Source !! 
But like Neo it turned out to be a bust the first time around...

All the poor guy way doing in the name of file-io was logging. And not adhoc logging to file, but using a proper logger with nice file transports..

So what now ?



Well we have the process id ! and this is Linaaaaaaaaux !!!






References in this investigation:

- freeing up space from a deleted file still kept alive by a process
https://unix.stackexchange.com/a/68532

> "/proc/$pid/fd/$fd"
truncate the file.
WARNING:  such stunts to be performed by experts only

- find all procs still dealing with deleted files
same article as above
find /proc/*/fd -ls | grep  '(deleted)'

- Red hat chiming in and explaining what the issue is:
-- Why is space not being freed from disk after deleting a file in Red Hat Enterprise Linux? 
https://access.redhat.com/solutions/2316

-- Why does df show greater disk usage than du in Red Hat Enterprise Linux? 
https://access.redhat.com/solutions/2794

- some useful commands:

lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head

lsof -s -p <pid> | sort -k7 -r | awk '{$7=$7/1048576 "MB"}1'
lsof | grep deleted
find /proc/*/fd -ls | grep  '(deleted)' | grep deleted

- various ways to truncate a file:
https://linuxize.com/post/truncate-files-in-linux/
Logrotate
https://linux.die.net/man/8/logrotate


- open files by size:
https://unix.stackexchange.com/a/332340
sudo lsof -F sn0 | tr -d '\000' | grep deleted | sed 's/^[a-z]*\([0-9]*\)n/\1 /' | sort -n

- Why does lsof show the same file descriptor for different processes?
https://askubuntu.com/questions/158988/why-does-lsof-show-the-same-file-descriptor-for-different-processes


- how to display human readable sizes when option is not available in the command
https://unix.stackexchange.com/questions/253321/how-to-display-size-human-readable-in-lsof-grep-var

- lsof uses:
https://unix.stackexchange.com/questions/235979/how-do-i-find-out-more-about-socket-files-in-proc-fd


- who is using s a file in linux:
https://www.tecmint.com/find-out-who-is-using-a-file-in-linux/
https://www.cyberciti.biz/faq/howto-linux-get-list-of-open-files/

Comments

Popular posts from this blog

Humour collection

0. - lol - oh ! my god she has her wires crossed https://www.youtube.com/shorts/0jxoP4qtNZ4 1. https://www.youtube.com/shorts/iYPxu6JaVhA - north vs south - they both need to reconcile differences and realise each other's importance https://www.youtube.com/watch?v=49D1TLCDXzM - ####### Thursday 02 March 2023 08:30:55 PM IST ####### in the time when it is claimed that India has been "radicalised", "hindutva" has become dominant Pathaan is the highest grossing film - starring a Khan. Hindu religion is inherently tolerant https://www.youtube.com/shorts/xSM5xAXhcdM - celebration of fat people ? - I agree - we got too much food - have you seen a fat 90 year old ?! - eye opener  https://www.youtube.com/shorts/Y9dDyGLKjQM - the peaceful minorities were irrelevant - most of germans were peaceful - people who drive the agenda are important https://www.youtube.com/shorts/aRNx3qftIzE - eye witness testimony - unreliable - unless backed by  - judge got what he said wrong - ...

The Anecdote collection - page 1

Anecdote Collection - page 1 -x-x-x- ####### Tuesday 19 March 2024 04:37:02 AM IST ####### 40. Success is not bought, it is rented. and the rent is due everyday. 41. FrshNFit - all women are gold diggers, some are better at hiding the shovel ####### Wednesday 03 May 2023 12:50:38 PM IST ####### 42. Why touching feet is behavioral engineering  https://www.youtube.com/watch?v=eFjzPFC3ILw&lc=UgzfxWqMRn2wt-D9K594AaABAg my 2 paise on this: Most people in India don't understand this but touching the feet of elders when you meet them -- invokes this same "programming" of the brain. People equate it to putting them on a pedestal... it is just behavioural engineering. So a person who is doing it has more likelihood of getting a better response. ####### Wednesday 03 May 2023 12:02:51 PM IST ####### 43. Your freedom ends, where someone's rights begin ####### Wednesday 19 April 2023 07:53:20 PM IST ####### 44. Thomas Sowell  - cultural issues -  - racism or victimhood  #...