- I have overall 4 years of experience in devops and cloud.
- I Worked for various tools for build and release engineering, even automation and orchestration
- My Primary responsibility is Building CICD pipeline, so in my project we used Jenkins and GitHub as ci tool
- So, I deal with many teams, and orchestrating CI/CD pipelines for applications. Our pipelines include multiple stages such as unit testing, static code analysis, integration testing, security scanning, and deployment.
- We create Docker images, scan these images for vulnerabilities using tools like Trivy, and then publish them to docker hub.
- we automate the deployment of these images to Amazon EC2 instances and Amazon EKS using Argo CD for continuous delivery.
Role and responsibility
- We have to create infrastructure for environment like dev, QA, stage and production
- Basically developer will write the code and we have to write the dockerfile for entire code and we have to deploy that.
- we used docker hub for storing the images
- we are using the ansible for downloading and updating multiple dependencies in the host machines at the same time
How to Backup and restore kubernetes cluster in disaster ?
- Install etcdctl first
- Take backup of ETCD: etcd is the primary data store for Kubernetes, storing all cluster state and configuration. Backing it up is crucial.
- ETCDCTL_API=3 etcdctl –endpoints= <etcd-server> snapshot save> snapshot save <backup-file>
- Restore: Start a new etcd instance and restore the snapshot with the
snapshot restore
command. ETCDCTL_API=3 etcdctl snapshot restore <backup-file>
How pods are communicating to each other ?
By default, pods in Kubernetes can communicate with each other using their IP addresses. Kubernetes assigns a unique IP to each pod, and they can communicate directly by referring to these IPs.
However, since Kubernetes is ephemeral in nature, the IP address of a pod can change over time (e.g., when a pod is rescheduled or recreated). This makes relying solely on IP addresses impractical for long-term communication.
To address this, Services are used in Kubernetes. A service provides a stable endpoint (IP address) to which pods can communicate. The service uses a selector to match specific pods and forward traffic to them. This allows pods to communicate with each other using the service’s stable IP address, even if individual pod IPs change.
Simple pod-to-pod communication
- Pods can communicate directly by IP.
- Since pod IPs are ephemeral, they can change frequently.
- Services are used to provide a stable communication endpoint.
- Services use selectors to forward traffic to the appropriate pods.
How does a frontend pod communicate with a backend pod?
In a typical web application architecture, we might have a frontend application which talks to a backend. The backend could be something like an API or a database. In Kubernetes, we would realize this as two Pods, one for the frontend, one for the backend.
We could configure the front-end to talk to the backend directly by its IP address. But the frontend would need to keep track of the backend’s IP address, which can be tricky, because the IP address changes as the Pod is restarted or moved onto a different Node. So, to make our solution less brittle, we link the two applications using a Service.
- Create the Pod and give it a label – The Pod is usually created through another object like a Deployment, a StatefulSet, or, in OpenShift, a DeploymentConfig. The Pod is assigned a the label in its JSON or YAML, like
app: api
. - Create a Service which selects the backend Pod – A Service is created which selects all Pods that match the given label. This is done by specifying a
selector
in the Service definition. In this example, the Service is calledmy-api
and it selects Pods with the labelapp=api
. - The app communicates with the backend using the service name as the hostname – The app can now address the API using the Service name as the hostname. So if the app talks to the backend over HTTP then it would look like
http://my-api:8080
.
How we can do rolling update without down time ?
- Enter the command Kubectl edit deployment <app-name>
- Edit the maxSurge and maxUnavailable to the percent you need and the new image version
- apply the yaml kubectl apply -f my-pod-update.yaml
How Kubernetes services work ?
In Kubernetes, Services are a way to expose and enable communication between groups of Pods. They act as an abstraction that connects a set of Pods (usually defined by a label selector) to other components inside or outside the Kubernetes cluster. Services ensure reliable communication by providing a stable endpoint (IP address and port) that doesn’t change, even if the underlying Pods are replaced.
Label Selectors: Services use label selectors to identify which Pods should receive traffic. When a Pod’s labels match the selector in the Service definition, it becomes part of the service’s pool.
Endpoints: Kubernetes automatically creates an Endpoints object associated with the Service, which lists the IP addresses of the selected Pods. This ensures the Service routes traffic to the correct Pods.
What if we want to give one particular access to a user in kubernetes ?
Create a service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: user1
namespace: webapps
Create a Role or ClusterRole:Use this for granting access to resources within a specific namespace.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: development
name: pod-reader
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "list"]
Bind the Role/ClusterRole to a User: Assigns the Role to a user within a namespace.
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods-binding
namespace: development
subjects:
- kind: User
name: johndoe # This should match the user's name
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
What if we want to restrict access a user from Kubernetes cluster ? like whenever he want to connect he will get error
Using IAM: We can remove the eks access from that iam role, so that whenever user try to connect to eks he will not able to connect
Revoke RBAC Permissions: If the user has role-based access control (RBAC) permissions, you can remove their role bindings or cluster role bindings. Example: Remove a RoleBinding
Remove User Credentials: If the user is using a certificate, token, or other credentials to access the cluster, you should invalidate those credentials.
Network Policies: You can also apply network policies to restrict access at the network level, ensuring that the user cannot communicate with the cluster.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: default
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EFK Stack for log
- E= Elasticsearch
- F= Fluentd
- K= Kibana
Fluentd (Demon sets): An open-source data collector and log forwarder. Fluentd collects logs from various sources and sends them to Elasticsearch for indexing.(Log collector)
Elastic search: A distributed search and analytics engine used to store and index logs. It allows you to search and analyze large volumes of log data quickly. (will store the logs from fluentd with a volume like EBS)
Kibana: A visualization tool that provides a user-friendly web interface for searching, analyzing, and visualizing logs stored in Elasticsearch. (will visualize the logs, UI)
What is immutable infrastructure ?
- Immutable infrastructure is a software management approach that involves replacing components instead of changing them. This means that once an application or service is deployed, it cannot be changed. Instead, a new instance is created and replaced with the existing one when a change is required.
- Immutable infrastructure is often used in cloud-based systems, containerization, and continuous integration and continuous deployment (CI/CD) pipelines.
Have you ever faced service outrage in your application ?
Recently we had a critical incident, like the whole application is crashed because of latest version, when we check we quickly implemented our rollback plan to our previous stable version to activate our application service, Then we found the crush application communication we find out the root cause like is database misconfiguration, then we fixed that issue and tested it in staging environment multiple time then pushed to production and it worked fine
How would you handle a situation where a deployment script fails mid-way? What rollback mechanisms would you implement ?
- Monitor and Detect: Use tools like
kubectl rollout status
or monitoring solutions (e.g., Prometheus, Grafana) to identify deployment issues early. - Pause the Rollout: Use
kubectl rollout pause
to halt the deployment and prevent further application of faulty updates. - Use
kubectl rollout undo deployment <deployment-name>
to revert to the last working state. - If using a blue-green strategy, switch traffic back to the “blue” (stable) version while debugging the “green” (new) deployment.
- If deploying incrementally, halt the rollout and revert to the stable version while directing traffic away from the faulty deployment.
What are some common challenges in CI/CD, and how have you solved them in your projects?
- Challenge: Slow build and deployment pipelines impacted productivity.
- Solution: Optimized pipelines by caching dependencies, parallelizing jobs, and leveraging incremental builds for large projects.
Explain how you would troubleshoot a pipeline error in Jenkins related to dependency mismatches ?
- Ensure required dependencies are installed on the system or managed through package managers within the pipeline. Verify plugin compatibility and version requirements. For example, if a plugin expects a later Java version than installed, the build will fail.
- Depending on the technology stack (e.g., Maven, Gradle, npm), check the corresponding dependency manager files (pom.xml, build.gradle, package.json) to confirm that the correct versions and dependencies are specified.
- Ensure there are no version conflicts or unresolvable dependencies between libraries.
What are the differences between Jenkins and GitHub Actions, and when would you choose one over the other?
Jenkins: Jenkins is a self-hosted CI/CD tool that requires setup and management on your own infrastructure (either on-premise or cloud). It is highly customizable and has a vast ecosystem of plugins. However, it requires regular maintenance, such as handling plugins, updates, and scaling.
GitHub Actions: GitHub Actions is a cloud-native CI/CD solution that is fully integrated into GitHub. You don’t need to worry about setting up or managing servers as GitHub takes care of the infrastructure. It is tightly coupled with GitHub repositories and uses YAML-based configuration files for workflow definitions.
Explain the role of playbooks and inventory files in Ansible for deployment automation ?
Playbook: Playbooks are YAML files that define the automation tasks and workflows in Ansible. They specify what actions need to be performed on the target systems.
Inventory files: Inventory files define the list of managed nodes (target systems) where Ansible will execute the tasks. They group hosts and associate them with variables.
What kind of vulnerabilities we can find in SonarQube in ci/cd pipeline ?
What are the security measures we can implement in aws ?
To secure an AWS environment, a multi-layered approach is essential. Begin with robust Identity and Access Management (IAM) by enforcing the principle of least privilege, enabling multi-factor authentication (MFA), and using roles for applications instead of embedding credentials. Protect data with encryption at rest using AWS KMS and in transit with TLS/SSL, while securely managing secrets through AWS Secrets Manager or Parameter Store. Strengthen network security by configuring security groups and NACLs, restricting public access to sensitive resources, and using private connectivity with VPC Peering or AWS PrivateLink. Monitoring and logging are critical; enable CloudTrail for auditing, CloudWatch for metrics and alarms, and AWS Config for compliance tracking. Safeguard applications with AWS WAF to prevent web exploits, and use Amazon Shield for DDoS protection. Storage and database security can be achieved by implementing S3 bucket policies, enabling block public access, and securing RDS with encryption and backups. For instance-level security, harden EC2 instances, use bastion hosts for access, and restrict access to the instance metadata service. Disaster recovery measures, such as regular backups, cross-region replication, and failover strategies with Route 53, ensure resilience. Regular security assessments, including vulnerability scans with AWS Inspector and centralized findings in Security Hub, help maintain compliance. Automate security configurations and enforcement with tools like CloudFormation or Terraform, and use Lambda for auto-remediation of security events. By combining these practices, you can ensure a secure and compliant AWS infrastructure.
How you handle security in AWS ?
To ensure security in AWS, implement IAM best practices like least privilege access, MFA, and using roles instead of access keys. Encrypt data at rest with AWS KMS and in transit using TLS/SSL. Use AWS Secrets Manager to manage sensitive credentials securely. Strengthen network security with security groups, private subnets, and VPC endpoints. Protect applications with AWS WAF and Amazon Shield against web-based and DDoS attacks. Enable CloudTrail, AWS Config, and CloudWatch for monitoring and auditing. Secure EC2 instances with minimal open ports, apply patches regularly, and use Amazon Inspector for vulnerability assessments. Regular backups, cross-region replication, and automated failover ensure disaster recovery. Automating security tasks with AWS Lambda and educating the team on best practices further strengthens the security posture.
How Can we copy the build and other info to one jenkins server to another ?
- Jenkins stores all its configuration, build history, and plugin data in the Jenkins home directory (commonly
/var/lib/jenkins
). - To migrate, copy the entire Jenkins home directory from the source server to the target server using tools like
rsync
orscp
.rsync -avz /var/lib/jenkins/ target-server:/var/lib/jenkins/
- Job Configurations: Copy the job directories from
$JENKINS_HOME/jobs/
. - Build Data: Include subfolders like
/builds/
under each job directory.
How to backup and restore jenkins data ?
Backup
- Plugins like ThinBackup or Backup Plugin allow you to create backups through the Jenkins UI. These plugins also provide scheduling options for automated backups.
- You can back up the Jenkins home directory to an external storage location (e.g., AWS S3, Google Drive) for added safety:
aws s3 cp jenkins_backup.tar.gz s3://your-backup-bucket/
Restore
- Extract the backup and replace the contents of the Jenkins home directory on the target server:
tar -xzvf jenkins_backup.tar.gz -C /var/lib/jenkins/
- Update ownership and permissions to match the Jenkins user:
sudo chown -R jenkins:jenkins /var/lib/jenkins/
- Restart the jenkins
How to manage Farewall in aws ?
- Security Groups (Instance-Level Firewall)
- Network ACLs (Subnet-Level Firewall)
- AWS WAF (Web Application Firewall):
- AWS Firewall Manager:
- AWS Network Firewall:
- Route 53 Resolver DNS Firewall
Difference between service and microservice ?
Service: A service is a standalone unit of functionality in a software application. It can be part of a monolithic or service-oriented architecture (SOA). Services may handle multiple functionalities or modules.
Microservice: A microservice is a specific type of service designed to perform one single function or small set of tightly related functions. It is an independently deployable unit within a microservices architecture.
If you create a 5GB Docker image but need to deploy it on an EC2 instance with only 2GB of RAM, what solutions or suggestions would you have?
- Replace the base image with a smaller alternative, such as
alpine
, to reduce the image size. Example: Switch fromubuntu
toalpine
. - Consolidate or clean up commands in the Dockerfile to minimize layers.
- Use multi-stage builds to exclude tools and dependencies not required in the final image. dockerfile
What challenge you faced in your project ?
- Once our new version of Application Down: Debugging in Kubernetes
- The pods goes to crashloopbackup error
- I describe the pod it shows me OMM killed error
- I have already setup the resource quota and limit 8GB RAM as per the performance bench mark of development team but still getting the error
- So what i have done, our application is java application
- I login to the application and get the Thread Dump and heap dump
- Then share it with development team
What is the use of Thread Dump and heap dump?
Thread Dump: used for Investigating issues like high CPU usage or application freezes by examining what the threads are doing.
Heap Dump: Investigating memory-related problems such as OutOfMemoryErrors by examining the state of the heap.
What is Dark launching is a software development technique ?
Dark launching is a software development technique that allows developers to release new features to a small group of users while keeping them hidden from the majority of users
What are the key components of devops workflow ?
- Planning and Collaboration
- Source Code Management
- Continuous Integration (CI)
- Continuous Delivery (CD)
- Continuous Monitoring
What if accidently we are out of the container, shall i lose the file ?
Yes, data stored in a Docker container’s filesystem will be lost when the container exits, as Docker containers are designed to be ephemeral. When a container stops or is removed, all data within the container is deleted unless you take steps to persist the data.
Solutions to persist data:
Docker Volumes:
- Volumes are managed by Docker and are the preferred way to persist data.
- You can create a volume and mount it to a container directory, so the data persists even if the container is stopped or removed.
- Example:
docker run -v myvolume:/path/in/container myimage
Bind Mounts:
- Bind mounts allow you to mount a host directory into the container. Any changes inside the container will also affect the host.
- Example:
docker run -v /host/path:/container/path myimage
Create a volume: docker volume create my-volume
One application i hosted in s3 with frontend and backed, how to secure the application ?
Use HTTPS with CloudFront (Frontend and Backend)
- Frontend:
- Set up an Amazon CloudFront distribution in front of the S3 bucket.
- Use an SSL/TLS certificate via AWS Certificate Manager (ACM) to enable HTTPS.
- Configure CloudFront to serve content securely and cache public files for performance.
- Backend:
- If the backend is hosted elsewhere (e.g., on EC2 or Lambda):
- Secure the API with HTTPS (use AWS API Gateway if needed).
- Add an origin policy in CloudFront to allow only specific backend requests.
- If the backend is hosted elsewhere (e.g., on EC2 or Lambda):
How can i secure user permission in Jenkins ?
What is metadata in terraform ?
Metadata in Terraform refers to information associated with resources, modules, or configurations that describes their properties or characteristics. This data is not directly part of the resource’s functionality but provides contextual or descriptive information to facilitate infrastructure management.
What is the role of kube-proxy in Kubernetes ?
kube-proxy
is a network proxy and load balancer that runs on each node in a Kubernetes cluster. Its primary role is to manage network rules and enable communication between services and pods, both within and outside the cluster.
kube-proxy
ensures that requests to a Kubernetes Service are routed to the appropriate backend pods.- It performs simple round-robin load balancing by distributing traffic among all healthy pods backing a Service.
IPTables and IPVS Management
kube-proxy
sets up the network rules using either:- IPTables: Traditional Linux firewall rules to manage routing and forwarding.
- IPVS (IP Virtual Server): More efficient and scalable alternative to
iptables
for handling large-scale traffic.
Pod-to-Pod Communication: Facilitates communication between pods across nodes by forwarding traffic to the appropriate pod’s IP and port.
External Traffic Management: Handles external requests to Services by routing them to appropriate pods when using NodePort or LoadBalancer types.
What is init container ?
An Init Container in Kubernetes is a special type of container that runs and completes its task before the main application containers start. if the init container execute successfully then the main container will start. Init containers are primarily used to perform setup operations or pre-conditions that the application containers require to function correctly.
What is Network polices in kubernetes ?
A Network Policy in Kubernetes is a resource used to control and secure network traffic within a cluster. It defines rules for ingress (incoming) and egress (outgoing) traffic between pods, namespaces, and external networks. Network policies enable you to enforce fine-grained communication restrictions in a cluster.
Key Features
- Traffic Control: Specify which pods can communicate with each other and which external networks they can reach.
- Pod Selector: Rules are applied based on labels assigned to pods.
- Namespace Selector: Policies can restrict traffic across namespaces.
- Ingress and Egress Rules: Policies can control both inbound and outbound traffic.
You are running multiple containers on a host, and one of them is consuming excessive CPU and memory, affecting others. How would you identify and limit its resource usage?
- To identify and limit the resource usage of a container that’s consuming excessive CPU and memory, check the real-time statistics of all the docker container using: docker stats
- Limit the Container’s Resource Usage: Once you’ve identified the container, you can limit its CPU and memory usage using Docker’s resource limitation options.
- Limit CPU Usage:
docker run --cpus=".5" <image-name>
- Limit Memory Usage:
docker run -m 512m <image-name>
- Updating Running Containers:
docker update --cpus 1 --memory 1g <container-id>
Write a script to clean up unused Docker images, containers, and volumes.
#!/bin/bash
echo "Starting Docker cleanup..."
# 1. Remove stopped containers
echo "Removing stopped containers..."
docker container prune -f
# 2. Remove dangling images (untagged images)
echo "Removing dangling images..."
docker image prune -f
# 3. Remove unused images (not associated with any containers)
echo "Removing unused images..."
docker image prune -a -f
# 4. Remove unused volumes
echo "Removing unused volumes..."
docker volume prune -f
# 5. Remove unused networks
echo "Removing unused networks..."
docker network prune -f
echo "Docker cleanup completed!"
What you monitor in grafana in your project ?
- CPU Metrics(CPU Usage, CPU Load, CPU Saturation:)
- Memory Metrics(Memory Usage, Swap Memory, Memory Pressure)
- Disk Metrics(Disk Usage, Disk I/O (Input/Output), Disk Utilization)
- Network Metrics(Network Traffic, Network Errors, Network Latency, Network Utilization)
- Database Metrics(Query Performance, Database Throughput, Query Errors)
- Node Metrics(CPU Usage, Memory Usage, Disk Usage)
- Pod Metrics(Pod Status, Pod Resource Usage, Pod age)
In the ec2 instance we created one user, now how to login with that user without password ?
- generate a new SSH key:
ssh-keygen -t rsa -b 2048 -f newuser-key
- This will create
newuser-key
(private key) andnewuser-key.pub
(public key). - Copy the public key to the EC2 instance:
ssh-copy-id -i newuser-key.pub newuser@your-ec2-instance-ip
- Use the new private key to log in:
ssh -i newuser-key newuser@your-ec2-instance-ip
I have a log file which having all logs, but i want to see the error log, how to do it ?
- If your error logs contain specific keywords such as “ERROR”, you can filter them using
grep
:grep "ERROR" logfile.log
- If you want more flexibility, use
awk
to filter logs containing the word “ERROR”:awk '/ERROR/' logfile.log
What if kube-proxy will delete in k8s ?
kube-proxy
is responsible for managing the rules that route traffic to the appropriate Pods based on Service definitions.- Without
kube-proxy
, Kubernetes cannot forward traffic from Services to the Pods backing those Services. - If external clients or users access the cluster through a Service (like a LoadBalancer or NodePort Service), that traffic will no longer reach the intended Pods.
- Pods using Services (via cluster IPs) to communicate with other Pods will experience network failures.
- Kubernetes DNS (via
kube-dns
orCoreDNS
) relies on Service networking. Withoutkube-proxy
, DNS queries may fail because the DNS Service cannot route traffic.
Is kube-proxy is a service ?
No, kube-proxy
is not a Service in Kubernetes. It is a networking component that runs as a process (typically as a DaemonSet) on every node in the cluster. Its primary role is to manage network rules and facilitate communication between Pods and Services.
kubelet belongs to which namespace ?
The kubelet
does not belong to any Kubernetes namespace. It is not a Kubernetes object or resource that resides within a namespace. Instead, the kubelet
is a node-level agent that runs on each worker node in the cluster, responsible for managing Pods and container runtimes on that node.
How to declare env variable in Linux ?
- Use the
export
command to declare an environment variable.
export VARIABLE_NAME=value
export MY_VAR="Hello, World!"
How to print env variable in Linux ?
echo $VARIABLE_NAME
What might be the reason if the node is not getting ready ?
- The
kubelet
service on the node is not running or encountering errors. - The node cannot communicate with the control plane or other nodes.
- Insufficient CPU, memory, or disk space.
- The CNI plugin (e.g., Calico, Flannel) is not installed or configured correctly
- The node cannot authenticate with the control plane due to expired or missing certificates.
- The node is tainted and cannot schedule pods
S3 lifecycle policy
Amazon S3 (Simple Storage Service) lifecycle policies allow you to automatically manage your objects’ lifecycle within a bucket. These policies can help you optimize storage costs and manage data by automatically transitioning objects between storage classes or deleting them after a certain period.
- Lifecycle Rules: Define the criteria for the objects affected and the actions to take (e.g., transitioning objects to a cheaper storage class or deleting them).
- Transitions: Moving an object from one storage class to another (e.g., from
S3 Standard
toS3 Infrequent Access
). - Expiration: Automatically deleting objects after a specified period.
- Transitions and Expirations can be based on: Object age and tags
Difference between taint and toleration and node affinity ?
Taints and Tolerations: Use a taint to ensure that only critical pods can be scheduled on a node that requires special access, and use a toleration in those pods to allow scheduling on that node.
Node Affinity: Use node affinity to ensure that a pod runs only in specific zones or on nodes with certain hardware capabilities (e.g., GPU nodes for ML workloads).
i’m getting error(authenticate failure) while logging to the application how to find out which micro service is responsible and how to troubleshoot ?
Review Logs and Monitoring Tools: Check Logs: Review the logs for each microservice to identify any errors related to authentication. Look for keywords such as “auth”, “login”, “token”, “unauthorized”, or “failed authentication”.
Verify Authentication Microservice: Authentication Service: Identify the microservice responsible for handling authentication and user login (e.g., an Identity Provider (IDP) service, an OAuth server, or a custom authentication microservice).
Examine Service Communication: API Gateway or Ingress Controller Logs: If you are using an API Gateway or an Ingress controller, check its logs for any failed or denied requests that could indicate an issue with routing or service-to-service communication.
Configuration Files: Service Configuration: Review the configuration files (e.g., application.yaml
, config.json
, or env
variables) of the microservices to confirm that authentication endpoints and keys are correctly configured.
My machine storage is showing 95%, how to troubleshoot this issue without increase the capacity of the storage ?
- Find Large Files: Use the
find
command to locate large files. - Clean Temporary Files: Remove unnecessary temporary files.
- Clear Browser Cache: Clear cache and cookies from your web browsers.
- Uninstall Unnecessary Software: Remove software that you no longer use.
- Check for Duplicates: Look for and remove duplicate files.
- Use Disk Cleanup Tools: Use built-in tools or third-party applications to clean up disk space.
- Increasing Storage Capacity
You are unable to ssh ec2 instance in public subnet. What could be the issue.
- Security Group Settings: Ensure the security group associated with the EC2 instance allows inbound traffic on port 22 (SSH) from your IP address or the range of IPs you intend to connect from. Double-check the inbound rules for the security group.
- Route Table Configuration: Ensure the route table associated with the public subnet has a route to an internet gateway (IGW). The instance will need this to communicate with the outside world.
- Instance Configuration: Verify that the EC2 instance has the correct SSH key pair attached. Ensure that the private key you’re using matches the public key added when launching the instance.
Check how will you troubleshoot pod which is running out of memory and you will you mitigate it ?
How to troubleshoot for CrashLoopBackOff error ?
- Check Resource Utilisation: Run
kubectl top pods
andkubectl top nodes
to ensure the resource limits are not exceeded. - Check Pod Logs and describe the pod: Inspect logs for immediate issues and describe the pod
- Verify ConfigMaps and Secrets: Check if ConfigMaps and Secrets are correctly mounted and accessible
- Validate Persistent Volumes: Confirm the volume is correctly bound and accessible
- Examine Monitoring Tools: Kibana: Check for logs aggregated from the application. Prometheus and Grafana: Look for metrics to identify resource bottlenecks or unusual behaviour, such as high CPU/memory usage.
- Check Resource requests and limits and Network policies and port configurations
- Test Node Health: Ensure the node where the pod is scheduled is healthy
What is api gateway ?
In simple term, it accept the client backend request and route them to correct backend service based on API End point. it handles tasks such as request routing, authentication, rate limiting, and load balancing, simplifying the development and management of APIs.
What if i want to deploy my application in a private subnet instance, how can customer access it ?
Deploying an application in a private subnet improves security by restricting direct internet access. To enable customer access while maintaining security, you can utilize several methods:
Using a Load Balancer
Public-facing Load Balancer: Deploy an Application Load Balancer (ALB) or Network Load Balancer (NLB) in a public subnet.
Routing Traffic: The Load Balancer routes incoming traffic to the instances in the private subnet.
Security Groups: Configure security groups to allow traffic from the Load Balancer to the private instances.
What if i want to deploy my application in a private subnet instance using jenkins, how can customer access it ?
- Jenkins server needs to be in the same VPC as EC2,
- Generate an SSH key pair on the Jenkins server using
ssh-keygen
. - Add the public SSH key to the authorized_keys file on the private EC2 instance. This allows the Jenkins server to SSH into the EC2 instance.
- In Jenkins, go to Manage Jenkins > Manage Credentials and add the SSH private key. Use an ID you can reference in the Jenkinsfile, such as
private-ec2-ssh-key
.
pipeline {
agent any
environment {
EC2_IP = 'your-private-ec2-ip'
EC2_USER = 'ec2-user'
SSH_KEY = credentials('private-ec2-ssh-key')
}
stages {
stage('Clone Repository') {
steps {
git 'https://github.com/your-repository.git'
}
}
stage('Install Dependencies') {
steps {
sh 'npm install'
}
}
stage('Run Tests') {
steps {
sh 'npm test'
}
}
stage('Build Application') {
steps {
sh 'npm run build'
}
}
stage('Deploy to Private EC2') {
steps {
script {
// Create a tarball of the build
sh 'tar -czvf build.tar.gz build/'
// Copy the tarball to the EC2 instance
sh "scp -i ${SSH_KEY} build.tar.gz ${EC2_USER}@${EC2_IP}:/home/${EC2_USER}/"
// Extract and deploy the build on the EC2 instance
sh "ssh -i ${SSH_KEY} ${EC2_USER}@${EC2_IP} 'tar -xzvf /home/${EC2_USER}/build.tar.gz -C /var/www/your-app && npm install --production'"
}
}
}
}
post {
success {
echo 'Deployment successful!'
}
failure {
echo 'Deployment failed.'
}
}
}
Security group vs NACL
Security Groups:
- Stateful: If you allow an incoming request, the response is automatically allowed. Similarly, if you allow an outbound request, the response is allowed back in.
- Instance-Level: Security groups are applied directly to individual EC2 instances.
- Allow Rules Only: Security groups allow you to specify only allow rules. There are no explicit deny rules.
- Default Behavior: By default, all inbound traffic is denied and all outbound traffic is allowed.
- Dynamic Changes: Any changes to security groups are automatically applied to the associated instances.
Network ACLs (NACLs):
- Stateless: NACLs require explicit rules for both inbound and outbound traffic. If you allow an incoming request, you must also allow the corresponding outbound response.
- Subnet-Level: NACLs are applied at the subnet level, affecting all instances within that subnet.
- Allow and Deny Rules: NACLs allow you to specify both allow and deny rules.
- Default Behavior: By default, all inbound and outbound traffic is allowed.
- Sequential Evaluation: NACLs evaluate rules in numerical order, starting with the lowest numbered rule.
What are Error in K8s ?
Your list of Kubernetes errors and their causes/solutions is well-organized and covers several common scenarios. Here are a few suggestions to improve and expand it, along with additional errors and possible solutions:
Expanded List of Kubernetes Errors
- CrashLoopBackOff
- Causes:
- Incorrect environment variable values.
- Application code has runtime errors.
- OOMKilled (Out of Memory) due to insufficient memory allocation.
- Missing dependencies or incorrect configurations.
- Solution:
- Check logs with
kubectl logs <pod-name>
. - Use
kubectl describe pod <pod-name>
to look for events or misconfigurations.
- Check logs with
- Causes:
- ImagePullBackOff
- Causes:
- The image does not exist.
- Incorrect image name or tag.
- Incorrect Docker registry credentials.
- Solution:
- Verify the image name and tag in the YAML file.
- Ensure the image is present in the specified registry.
- Use
kubectl describe pod <pod-name>
to check events for authentication issues.
- Causes:
- Node Not Ready
- Causes:
- Worker nodes can’t connect to the master node.
- Kubelet is not running or has errors.
- Insufficient disk space on the node.
- Network or DNS configuration issues.
- Solution:
- Use
kubectl get nodes
to check node status. - Check connectivity between nodes with
ping
. - Restart Kubelet with
systemctl restart kubelet
. - Clear disk space or check the file system for errors.
- Use
- Causes:
- Pending Pods
- Causes:
- Insufficient resources (CPU/memory) on nodes.
- Pod’s
nodeSelector
ortaints/tolerations
mismatch. - No matching storage volume claims.
- Solution:
- Use
kubectl describe pod <pod-name>
to check events. - Ensure sufficient resources are available.
- Check
nodeSelector
and update if necessary.
- Use
- Causes:
- Forbidden Errors
- Causes:
- RBAC (Role-Based Access Control) permissions are not properly set for the user or service account.
- Solution:
- Check permissions with
kubectl auth can-i
. - Update roles or role bindings to grant the required permissions.
- Check permissions with
- Causes:
- OOMKilled
- Causes:
- The application exceeds the memory limit set in the container.
- Solution:
- Increase the resource limits (
resources.limits.memory
) in the Pod specification. - Optimize the application’s memory usage.
- Increase the resource limits (
- Causes:
- Evicted Pods
- Causes:
- Insufficient disk space.
- Node under memory or CPU pressure.
- Solution:
- Free up disk space or allocate more resources to the node.
- Check node conditions with
kubectl describe node <node-name>
.
- Causes:
- PVC Bound Errors
- Causes:
- Persistent Volume Claims (PVCs) can’t find matching Persistent Volumes (PVs).
- Solution:
- Verify the PVC and PV configurations.
- Ensure the storage class and access modes match.
- Causes:
- ContainerCreating Timeout
- Causes:
- Network issues preventing image downloads.
- Storage volumes not attached.
- Solution:
- Check container logs and events for errors.
- Validate network connectivity and volume configurations.
- Causes:
- Unauthorized Errors
- Causes:
- Authentication or token issues with the Kubernetes API server.
- Solution:
- Check kubeconfig file and credentials.
- Update the service account or authentication token.
- Causes:
- DNS Resolution Errors
- Causes:
- CoreDNS pod is not running or misconfigured.
- Solution:
- Check the CoreDNS pod status with
kubectl get pods -n kube-system
. - Inspect CoreDNS logs for errors.
- Check the CoreDNS pod status with
- Causes:
- Terminating Pods Not Deleted
- Causes:
- Finalizers on the Pod are not properly handled.
- Solution:
- Use
kubectl delete pod <pod-name> --force --grace-period=0
to force deletion.
- Use
- Causes:
Suggestions for Improvements
- Categorize Errors:
- Group errors by Pod-related, Node-related, Storage-related, etc., for easier readability.
- Include Commands:
- Add specific
kubectl
commands to troubleshoot each error, e.g.,kubectl top pod
for resource issues.
- Add specific
- Link Errors to Best Practices:
- Suggest solutions like configuring resource limits/requests, using taints and tolerations, or improving health probes.
- Expand Solutions:
- Provide more context for each solution, such as how to increase resources or debug RBAC permissions.
Here’s an enhanced and structured list of Jenkins errors along with their causes and solutions, incorporating what you already have and adding more details and scenarios:
Common Errors in Jenkins and Their Solutions
1. Slowness Issue
- Causes:
- Insufficient memory allocated to the Jenkins master or agents.
- Heavy load due to too many jobs running simultaneously.
- High CPU usage or disk I/O issues.
- Solution:
- Monitor memory and CPU usage using Grafana or other monitoring tools.
- Increase memory allocation by modifying
-Xms
and-Xmx
parameters inJAVA_OPTS
. - Distribute jobs across multiple agents or optimize job scheduling.
2. Authentication Issues (401, 403)
- Causes:
- Invalid or expired credentials for accessing Jenkins or integrated services.
- Insufficient permissions for the user or service account.
- Solution:
- Verify and update credentials in the Jenkins credentials store.
- Check user permissions in Manage Jenkins > Manage Users or relevant access control settings.
3. Jenkins CrashLoopBackOff Error
- Causes:
- Out of memory (OOM) issues with the Jenkins master.
- Corrupted Jenkins configuration or plugins.
- Disk space exhaustion.
- Solution:
- Check logs for errors using
kubectl logs <pod-name>
(if running in Kubernetes). - Allocate more memory or storage to the Jenkins instance.
- Restore from a backup if configurations are corrupted.
- Check logs for errors using
4. Internal Server Error (501, 503)
- Causes:
- Downstream services like SonarQube, Artifactory, or SCM servers are unavailable.
- Jenkins master or agents are overloaded.
- Solution:
- Verify the status of external services (e.g., SonarQube, Nexus).
- Restart affected services or Jenkins.
- Check network connectivity and dependencies.
5. Compilation Errors
- Causes:
- Errors in the source code during build steps (e.g.,
mvn compile
ornpm build
).
- Errors in the source code during build steps (e.g.,
- Solution:
- Review the error logs in the Jenkins console output.
- Fix issues in the source code and retry the pipeline.
6. Tools Not Found
- Causes:
- Missing or improperly configured tools like Maven, Java, or Node.js.
- Incorrect PATH variable in the environment.
- Solution:
- Install required tools on the Jenkins agent.
- Verify tool paths in Jenkins under Manage Jenkins > Global Tool Configuration.
7. Pipeline Timeout
- Causes:
- A stage in the pipeline exceeds its timeout.
- Solution:
- Increase the timeout for stages using
timeout(time: <value>, unit: 'MINUTES')
. - Debug and optimize the pipeline step causing the delay.
- Increase the timeout for stages using
8. Permission Denied
- Causes:
- Jenkins agent user lacks the necessary permissions for file or directory access.
- Solution:
- Update file or directory permissions for the Jenkins user.
- Use
sudo
or adjust ownership withchown
as required.
9. Git Clone or SCM Errors
- Causes:
- Incorrect repository URL.
- Invalid credentials for accessing the repository.
- Network issues or SCM downtime.
- Solution:
- Verify the repository URL and credentials.
- Use the correct branch or tag in the pipeline configuration.
- Ensure the SCM server is up and accessible.
10. Artifacts Upload/Download Failure
- Causes:
- Jenkins cannot upload artifacts to storage or download dependencies.
- Network issues or incorrect storage configuration.
- Solution:
- Check artifact storage configurations.
- Verify the credentials for storage services (e.g., S3, Nexus).
- Retry after fixing network issues.
11. Job Fails on Node
- Causes:
- The agent node is offline or lacks required tools.
- Insufficient resources (memory, CPU) on the agent.
- Solution:
- Ensure the agent node is online and properly configured.
- Allocate more resources or use a more capable node.
12. Workspace Errors
- Causes:
- Workspace directory is locked or has corrupted files.
- Solution:
- Clean the workspace with the “Wipe out workspace” option.
- Use
rm -rf <workspace-dir>
on the agent node if manual cleanup is needed.
Suggestions for Expansion
- Include Error Codes: Where applicable, list common HTTP error codes (e.g., 401, 403, 503) with their specific causes.
- Add Troubleshooting Commands: Include relevant Jenkins commands or logs to check.
- Organize by Category: Group errors into categories like System Errors, Pipeline Errors, Node Issues, etc., for better readability.
Explain the concept of shift-lift in DevOps and how it enables teams to detect and fix issues earlier in the development cycle.
Shift-left is a practice of moving tasks like testing, security checks, and code quality assessments earlier in the development lifecycle. It helps teams identify and fix issues sooner, reducing costs and time-to-market. By integrating automated testing, static analysis, and security scans into CI/CD pipelines, developers receive immediate feedback, ensuring better collaboration and higher-quality software. This proactive approach minimizes late-stage bugs and streamlines delivery.
What are the component of aws devops ?
- AWS CodePipeline
- AWS CodeCommit
- AWS CodeBuild
- AWS CodeDeploy
- AWS CloudFormation
- AWS Elastic Beanstalk
- Amazon EC2 and Auto Scaling
- AWS Lambda
- Amazon Elastic Container Service (ECS) and EKS
- AWS Systems Manager
- Amazon CloudWatch
- AWS X-Ray
- AWS IAM (Identity and Access Management)
- AWS Artifact and AWS Key Management Service (KMS)
Difference between webserver and application server ?
- Webserver: Serves static content such as HTML, CSS, JavaScript, and images to the client’s browser.
- Application Server: Serves dynamic content by running applications and processing logic on the server side.
I updated a config map, how to run our application with latest configmap ?
After updating a ConfigMap in Kubernetes, you need to make sure that your application (which uses the ConfigMap) picks up the latest changes. Kubernetes does not automatically update the running pods when a ConfigMap is updated, but there are a few approaches you can use to ensure the application uses the latest ConfigMap:
Prometheus Architecture
- Prometheus Server: Scrapes metrics, stores them as time-series data, and supports querying via PromQL.
- Exporters: Expose metrics from systems/apps (e.g., Node Exporter for host metrics).
- Pushgateway: Allows batch jobs to push metrics (optional).
- Alertmanager: Manages alerts and routes them to channels like Slack, email, etc.
Grafana Architecture
- Data Source: Connects to Prometheus (or other sources) to fetch data using queries.
- Dashboard and Visualization: Provides an interactive interface to create custom dashboards.
- Visualizes Prometheus metrics in graphs, tables, and other formats.
- Alerting: Grafana can set up alerts based on Prometheus data and notify users.
Can i auto scale the eks node in a particular time ?
Yes, you can autoscale EKS nodes at a particular time using scheduled scaling in Amazon EC2 Auto Scaling Groups (ASGs). EKS nodes are typically managed as part of an ASG, and you can use scheduled actions to scale them up or down based on specific times.
- Identify the Node Group ASG: In the AWS Management Console, navigate to Auto Scaling Groups.
- Create a Scheduled Scaling Action: In the ASG settings, choose Scheduled Actions.
- Click Create Scheduled Action and provide the following details:
What is active active and active passive Deployment?
- 1. Active-Active Deployment: All systems are active and operational Multiple nodes or instances are online, sharing the workload simultaneously.
- 2. Active-Passive Deployment: One system is active, and the other is passive (standby): Only the active node serves traffic, while the passive node remains idle or in sync as a backup.
You need to expose an application securely over HTTPS using an ingress controller.How would you configure an ingress resource with SSL termination? What considerations would you make for managing SSL certificates in Kubernetes?
- Ingress Controller: Ensure an ingress controller like NGINX is installed in your cluster.
- SSL Certificate: Obtain an SSL certificate (self-signed or from a CA like Let’s Encrypt).
- Create a Kubernetes Secret for the SSL Certificate
kubectl create secret tls tls-secret --cert=path/to/tls.crt --key=path/to/tls.key
- Configure the ingress resource to use the TLS secret
You have a production database and it is hosted on RDS, it is experiencing high latency that is impacting the application performance, how to troubleshoot this issue ?
I will first go to the RDC cloudwatch to see the matrics and dashboard to identify CPU memory and I/o bottleneck. Then i will analyse the slow query logs to identify inefficient queries and i will try to optimize them. Then we can consider to scaling the RDC instance to a higher tier or changing storage type based on the resource utilization and the timing and also we can use RDS replica to distribute read load and overall performance. for future purpose we can set the alert and policy in cloudwatch.