Every time we press a key in an interactive SSH session, the SSH Client sends that keystroke as a TCP segment to the SSH Server. Here is a Wireshark capture of I having SSH-ed into my own machine and pressing a single key. I have not included packets indicating the setup and teardown of the TCP and SSH connections.
You can differentiate between the client and server from the source and destination port numbers. SSH servers usually work on port 22.
After the client has transmitted a character over a TCP segment, the server acknowledges that it has received it. Acknowledgements of data enable TCP to provide a reliable transport service to higher level protocols like HTTP, SMTP and SSH.
After this, the server actually processes the character by sending it
to the program, which is typically bash
, but could be anything –
sh
, zsh
or even emacs
. The shell will interpret the character
and send back the result over another TCP segment. This enables the
client to echo the character on the screen and send back an
acknowledgement to the server.
I have depicted this in a timeline –
I tried this out and I can see only three segments per character I type
To be minimal about the traffic that sending acknowledgements for data
causes, TCP piggybacks ACKs on data that it has to send anyway. This
is implementation specific, though if you were using the socket API,
you’d play around with the TCP_NODELAY
(also called the Nagle
Algorithm) and the TCP_QUICKACK
options to reduce or disable delayed acknowledgments.
Here is an example of this happening when I SSH-ed into an AWS EC2 server.
Note that the server is sending the ACK to the character it received and the response together in one TCP segment.
Why not send whole commands?
One may wonder why does SSH not transmit every command or every line
rather than sending individual characters. The
answer lies in the fact that the program on the server may
have commands that are only a character long, without requiring a
newline character. Think ESC in
vi
, or M-x in emacs
, or SPACE in more
. Even while sending
commands to bash
, there could be readline-specific keystrokes like
C-e or C-a or even TAB that need to be sent as they are pressed
instead of waiting for a newline.
So, SSH chooses not to try to understand what sequence of characters
constitute a command and simply sends across characters as they are
typed. In fact, it doesn’t even assume if and how the character
pressed will echo on the screen, it finds it out from the server program.
What happens when I use SSH with other software like git?
One the most frequent ways that I use SSH is when I do any remote
git
operations like pull
, push
, or clone
. When actual data is
being sent, the SSH software understands that it isn’t an interactive
invocation and TCP utilizes all the available capacity of a each
segment to help SSH send all that data.
Deciding the sizes of segments is left to TCP and would warrant a blog post by itself, but here is a quick primer – The Maximum Segment Size (MSS) that TCP calculates is such that there will be no IP layer fragmentation of segments. In other words, TCP sets its MSS lesser than or equal to the Path MTU (Maximum Transmission Unit). TCP also sets the Don’t Fragment (DF) Flag to ensure that Segments don’t get fragmented on the IP.
Here is a capture of Wireshark sniffing data when I cloned a repository from GitHub over SSH.
Note how the data in each segment plus the TCP and IP headers equal the exact MTU of my Ethernet interface – 1500
Update
There is also a discussion about this on Hacker News
Nikhil Mungel writes blogs on networking, ruby and GNU/Linux. If you’d like to see more, follow him on twitter.com/hyfather