Deploying to Private AWS EC2 Without SSH or Bastion Hosts
This piece describes how to deploy to EC2 instances without SSH or public network access by using AWS SSM and CircleCI. It highlights the async behaviour of send-command, the operational risk of CI false positives, and the pattern used to ensure deterministic deployments through blocking and log inspection.
Modern DevOps security prioritizes the Principle of Least Privilege. Often, this means moving away from long-lived SSH keys and closing port 22 entirely. But when your deployment target is an Amazon EC2 instance sitting in a private subnet, how do you trigger a Docker Compose update without a direct tunnel?
The answer lies in AWS Systems Manager (SSM)
The Challenge: The “Invisible” Instance
Sometimes, in a secure production environment, following hurdles are faced:
- No SSH Access: The pipeline doesn’t (and shouldn’t) store sensitive
.pemfiles. - Private Networking: The instance has no public IP and is only reachable via VPN.
- Local Context: Docker Compose must run on the machine to access the local daemon, volumes, and internal network.
SSM solves this by using an outbound agent.
The instance “checks in” with AWS, pulls the command, and reports back the results.
Infrastructure Requirements
This components used in this demostration are AWS EC2 behind VPN, ECR, Docker, CircleCI
- SSM Agent: Must be active on the Amazon EC2 instance. Verify agent status.
- IAM Instance Profile: The instance role must include the
AmazonSSMManagedInstanceCorepolicy. - CI Permissions: The IAM user/role needs:
ssm:SendCommand,ssm:GetCommandInvocation,ssm:Waitand other necessary privileges as per usecase.
Phase 1: Initial Trigger
The first step is to use the AWS CLI to push the deployment instructions to the instance by replacing the ssh ec2-user@ip ... workflow via aws ssm send-command. To ensure debugging failures effectively, shell traps are used for labeling exactly which part of the Docker lifecycle fails.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
command_id=$(aws ssm send-command \
--instance-ids "<< parameters.instance-id >>" \
--document-name "AWS-RunShellScript" \
--comment "Executing Docker operations" \
--parameters commands='[
"set -e",
"cd /home/ec2-user",
"trap '\''echo DOCKER_LOGIN_FAIL'\'' ERR; aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <id>.dkr.ecr.region.amazonaws.com",
"trap '\''echo DOCKER_PULL_FAIL'\'' ERR; docker-compose -f << parameters.compose-file >> pull",
"trap '\''echo DOCKER_UP_FAIL'\'' ERR; docker-compose -f << parameters.compose-file >> up -d"
]' \
--region <region> --query "Command.CommandId" --output text)
echo "export command_id=$command_id" >> $BASH_ENV
Major Caveat: The “False Success” Problem
The above send-command command does the work but does not do the justice during CI/CD because it is asynchronous; it returns a CommandId immediately after AWS accepts the request.
Learnt the hard way after running only the send-command in production for a couple of times and struggling to understand why the pipeline wasn’t behaving correctly with heavier Docker images. I blamed the agent. Turned out it was me who did not figure it out the right way in the first place.
In a CircleCI environment, the runner fires the command and marks the job as Success instantly because, ideally, as mentioned, the work is done. Meanwhile, the EC2 instance might still be pulling images in the background. If it fails later, CircleCI never knows.
Phase 2: Wait and Verify
To bridge this gap, the pipeline must poll the instance status and retrieve the actual execution logs. In this example, jq, a lightweight command-line JSON processor, is used.
1. Blocking with wait
The aws ssm wait command-executed blocks the runner until the instance reports a terminal state (Success, Failed, or TimedOut). If the remote execution fails, the wait utility exits with a non-zero code. Bypassing this immediate exit with || true allows the script to proceed to the log-retrieval phase, ensuring the failure details are printed to the CI console.
1
2
3
4
aws ssm wait command-executed \
--command-id $command_id \
--instance-id "<< parameters.instance-id >>" \
--region <region> || true
2. Parsing the Final Outcome with jq
Once the wait is over, use get-command-invocation to parse the JSON results and exit the pipeline as intended.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
result=$(aws ssm get-command-invocation \
--command-id $command_id \
--instance-id "<< parameters.instance-id >>" \
--region <region>)
# Extract values cleanly from JSON using jq
status=$(echo "$result" | jq -r '.Status')
echo "Remote Stdout $(echo "$result" | jq -r '.StandardOutputContent')"
echo "Remote Stderr $(echo "$result" | jq -r '.StandardErrorContent')"
if [ "$status" != "Success" ]; then
echo "Deployment failed on the instance. Check the Standard Error above."
exit 1
fi
By following this “trigger-wait-verify” pattern, CI/CD pipeline correctly orchestrates the update without requiring direct network access. On the broader aspect, it also makes a way for following pointers:
- Eliminates Bastion Hosts: No longer need to maintain jump boxes just for deployments.
- Centralized Auditing: All command execution history and output can be logged in the AWS console and can be streamed to CloudWatch.
- Role-Based Access: Permissions are managed via IAM, meaning one can precisely control who (or what) can run specific commands on specific instances.
Therefore, a higher security posture can be achieved by treating private EC2 instances as managed nodes while maintaining the visibility and reliability required for professional-grade production environments.