Troubleshooting SSH and LDAP Desync on a Research Cluster#
Blog Post System Administration HPC
Note
This is the first in a series of blog posts that document my experience fixing, maintaining, up and downgrading, and eventually improving my departmentâs research cluster computer.
The Legend of dijkstra#
Within my department, we have a piece of âold reliableâ infrastructure: a
high performance cluster (honestly, more medium-performance these days) lovingly
named dijkstra. Since at least 2019, dijkstra has survived its fair
share of hardware failures, software rot, and networking troubles.
It has served as the ultimate sandbox for our department, touched by everyone from cybersecurity students and AI researchers to database engineers. It is one of those classic systems that everyone depends on, yet is simultaneously the primary reason most research groups eventually give up and purchase their own private clusters.
The Fledgling Phoenix#
Today, dijkstra is attempting a âfledgling phoenixâ act â it is trying to
be reborn. The problem is that the core system functionality has deteriorated to
the point where even the simplest task, like logging in, has become a multi-hour
troubleshooting ordeal. It has reached a level of technical debt that almost no
one in the department is willing to touch.
This, of course, made it the perfect time for me to jump in. My goal isnât just to fix the machine, but to document the process for the sake of the administrators who will eventually inherit this system after me.
Enter The New Guy#
While Iâm not entirely new to cluster management, my background has mostly been
in controlled or âlabâ environments. In high school, I was fortunate enough to
work with physical systems managing remote storage, VoIP protocols, and
district wide databases. Later, I built and deployed several Raspberry Pi and
NVIDIA Jetson Nano clusters for computer vision research. Iâve even had the
privilege of interning at Argonne National Lab, leveraging super computers
including polaris, sophia, and crux.
However, there is a massive difference between using a supercomputer and rescuing a production system. Aside from small side projects, Iâve never been the one responsible for an in production environment with active users.
dijkstra is in dire need of maintenance, and itâs being overhauled by
someone just starting to âcut their teethâ on production grade sysadmin work.
What could possibly go wrong?
Challenge 1: Let Me In!#
Logging into dijkstra was the first challenge. In a typical cluster
architecture, users first ssh into a Head Node (the internet facing
gateway). From there, they should be able to hop into any of the internal
Compute Nodes to run their workloads.
The sheer act of logging into dijkstra was a challenge. While the head node,
the internet facing device that users initially ssh into, allowed for
users to login to it, all other nodes on the system were not accessible.
Well, all but two: the two newest GPU nodes that were installed on the system
did allow for users to ssh into them from the head node, but these are
already exceptions within the cluster. Their hardware and software
configurations differed so significantly from the legacy nodes that simply
âcopy-pastingâ their configurations was an unreliable, and likely dangerous,
solution.
Thankfully, I was able to login to all nodes as root via ssh. A security
oversight for sure, but I am grateful that I was able to at least get access to
these compute nodes for testing solutions.
So this was my first task: needed to be able to authenticate at the head node and then move freely to any other node in the system using my standard user account, without being met by a brick wall of password prompts or âPermission Deniedâ errors.
It Wasnât DNS#
The very first step I took was to simply check that the /etc/hosts file
correctly referenced the node aliases on the head node. dijkstra uses this
file to provide aliases for all nodes in the system so that users do not need to
remember specific device host names or IP addresses when jumping between nodes.
If this file was tampered with, then it would make sense as to why I wasnât able
to ssh out of the login node to any of the compute nodes. Thankfully, the
file only had a typo bug which pointed two different IP addresses at the same
node which I was able to squash.
It Has To Be Slurm.. Right?#
In a previous attempt to modernize dijkstra, the department installed Slurm 1https://slurm.schedmd.com/overview.html,
a workload manager designed to queue and schedule high-performance computing
jobs. To ensure the cluster remained stable, Slurm had been configured with a
strict âno-direct-entryâ policy: users could only execute code on compute nodes
via a scheduled job, effectively blocking direct ssh access to prevent unmanaged
resource consumption.
At some point, Slurm was disabled for reasons unknown to me. However, because the clusterâs previous state was so restrictive, I assumed a âghostâ configuration was still haunting the system. I suspected a lingering PAM (Pluggable Authentication Module) 2https://en.wikipedia.org/wiki/Pluggable_Authentication_Module or a configuration flag was still enforcing that âSlurm-onlyâ access rule.
After a quick consultation with Google Gemini and a deep dive into the Slurm
documentation, I identified several candidate files that might be causing the
lockout. I took the aggressive step of stopping all remaining Slurm daemons and
temporarily moving the configuration files out of the system directories and
into my home folder.
The result? Nothing. Even with Slurm effectively erased from the active environment, the compute nodes remained a black box.
This troubleshooting was complicated by a fundamental architectural hurdle:
Node Isolation. While the usersâ /home directories are shared across the
cluster via nfs (Network File System), the system configurations are not.
dijkstra does not share a common system memory or a unified configuration
directory. If I changed a setting on the head node, it had zero effect on the
compute nodes. If I fixed a config on compute-node1, compute-node2
remained broken. Each node was its own island, requiring me to diagnose and
repair them one by one.
Maybe Itâs ssh?#
Assuming that I had disabled Slurm correctly on the head and my testing compute
node, I then decided to diagnose if maybe my user config and system config for
the ssh client was incorrect. To cut to the chase, it was not. Additionally,
on the compute node the sshd_config file (the one used to configure the
ssh server and not the client) was also correct.
This is where I truly got stuck for a minute. If it wasnât DNS, PAM, or an
ssh config, I didnât really know where to go next. It wasnât until I decided
to ssh back into the compute node with my standard user account with verbose
output enabled (ssh -v) that I began to have some clarity as to what the
actual problem is.
To be frank, this is one hundred percent where I shouldâve started. Actually
reviewing the logs and the connection status as the client and server
communicate would inform me that DNS was functioning correctly (for all but one
node), PAM wasnât blocking traffic, and that my ssh configs on the client
and server were correct enough to not be causing this issue. But hindsight is
20/20, so Iâll just remember this in the future.
I
******************
Itâs an Identity Issue
******************
This log entry shifted the entire investigation. The compute node wasnât
rejecting my user key; it was rejecting my existence. In a cluster like
dijkstra, nodes donât keep local lists of users; they consult a central
âphonebookâ using LDAP (Lightweight Directory Access Protocol).
The LDAP Bouncer Analogy#
To understand why this failed, think of the login process as a club:
The
sshKey is your valid ticket to the show.LDAP is the bouncer at the door with the guest list.
The Issue: If the bouncer canât open the guest list because he has the wrong password to the safe, he doesnât care how valid your ticket is. He doesnât even know youâre supposed to be on the list.
Configuration Drift As The Root Of All Evils#
Upon comparing the configuration files across the cluster, I discovered that the compute nodes LDAP configuration had drifted away from what was expected. While the head node used the current âbind passwordâ to talk to the LDAP server, my testing compute node was still configured with an old, incorrect password from a previous era of system.
Because of this mismatch, the compute node couldnât verify my identity, deemed
me an âinvalid user,â and ignored my ssh keys entirely.
Manual Syncing To Keep dijkstra Afloat#
Since we were not yet using automated configuration management like Ansible, I had to resolve this manually:
Logged into the root account to bypass user-level LDAP issues.
Updated LDAP configuration file on the faulty nodes with the correct password.
Restarted the LDAP daemon.
Verified the fix by logging in with a standard user accountâsuccess!
Takeaways#
Log Veracity is King: Client-side logs (
ssh -v) are helpful, but the target nodeâs logs are the final authority.Think in Tiers: Start troubleshooting at the local level (your account/config), scale up to the system level (SSH config), and finally look at global identity management (LDAP/AD).
Automate to Avoid Drift: Manual configurations inevitably drift. We are now moving toward an Ansible based âInfrastructure as Codeâ approach to ensure every node in the cluster has the exact same software and configurations.