Job Title: Site Reliability Engineer (SRE)
Location: Austin, TX
Job Type: Full Time
Job Description:
Job Summary
Seasoned Site Reliability Engineer (SRE) with 7+ years of experience in supporting complex, large-scale distributed systems. Highly skilled in managing production failures, conducting root cause analysis, and driving effective remediation. Strong communicator with expertise in ing, monitoring, and release management, complemented by automation proficiency and a keen ability to learn quickly.
This role involves providing 24/7 support as part of the SRE team, ensuring the reliability and performance of mission-critical Java, .NET, and Batch applications deployed across GCP, PCF, and on-premises environments.
Years of experience needed
Candidate experience 7+ Years
Technical Skills:
Expertise in understanding large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.
Should have solid hands-on experience in troubleshooting and fixing application failures, application Performance degradation, Code issues, cloud platform issues, Batch Failures, infra failures, DB failures, Network failures.
Hands-on experience in performing Production deployments using CI/CD and exposure to deployment strategies.
Experience in troubleshooting of Linux/Unix.
Monitor the application/Services/batch availability.
Act quickly on the application s(Performance, Availability) and Batch Job failures
Perform the required analysis (Code/Log) and escalate to the Engineering team as required.
Initiate and drive the Techlines in case of outages/major incidents/Batch abends and ensure Service Restoration in the least time possible.
Effectively handle the Incident, Problem, Release and Change management.
Own and deliver the user stories assigned as part of the sprint.
o The user stories range from application code Debugging, Issue analysis, Code fix, Knowledge base creation, documentation of SOP's, Production Deployments, Pre & Post Patching/Maintenance activities, Service Requests.
o Build monitoring solutions using APM tools like Splunk, Appdynamics, Thousand Eyes, ITRS, AppMetrics, MoogSoft, Kafka etc.
o Automate of day-day operational tasks.
o Be part of the Exit reviews to ensure the best practices are followed to have the right code deployed to Production systems
o Provide feedback/recommend improvements to the system which would enable highly stable systems.
Strong understanding of Networking Concepts (TCP/IP, SSL/TLS, IPSec, VPN etc), Firewall and Load Balancers.
Experience in Scripting Shell/Powershell/Python
Strong Experience in working with any Cloud-based infrastructure (PCF, GCP, AWS, Azure Cloud or others)
...maintaining a sound, profitable personal and district sales practice consistent with branch and company strategy Demonstrates a high level of proficiency in their role as a Financial Professional. Requirements With a wide range of successful financial...
...Insight Global is seeking a Customer Support Manager to sit on site in Cleveland, OH. In this role you will be responsible for overseeing... ...to perform at their best while maintaining a high standard of service for our customers. Your work environment will have some hybrid...
...Roswell, Georgia, approximately 30 minutes north of Atlanta, is accepting part-time (25 hours per week), on-site, Kindergarten Learning Assistant applications for the 2025-26 school year . The school takes pride in hiring dynamic, compassionate, and diverse faculty and...
...Pharmaceutical company looking to hire a Clinical Data Manager or a Senior CDM Salary - $125K - 140K depending upon experience level. 3-5 years of experience Knows Data Management well SAE reconciliation Create Data Management Plans. DMP Data transfer...
...winner for many years, is seeking a motivated Franchise Business Coach to elevate our home care franchisees to new heights of success.... .... Flexibility: Generous time off, including paid holidays, wellness days, and volunteer opportunities. Growth Potential:...