What you'll do at
Position Summary...
What you'll do...
An individual in this position will be expected to perform additional job related responsibilities and duties as assigned and/or necessary.
Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apachetuning and optimization; eCommerce reliability tuning and optimization. Opensource Chaos tools, monitoring/alert tools (for example, Chaos Monkey, Chaos Mesh, Prometheus, Grafana) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex retail environments.
Solution Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain.
Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security.
Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.
Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech
Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and
Compliance Review- SRCR etc.)
To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.
Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. Java, JavaScript, Python, etc.), frameworks(E.g. Spring boot,, Cocoa, Android application framework etc.), Platforms (E.g. Microsoft Azure, GCP , Apple IOS etc.); Quality, Safety and Security (PCI) standards; Emerging tools and technologies; Telemetry.
To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs.
Assist in the selection of appropriate languages (E.g. Java, JavaScript, Python etc.), Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process (as applicable).
Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.
Triaging and Troubleshooting : Requires knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.
Writes advanced "Splunk" queries to join multiple indices to stitch data.
Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.
Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.
Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.
Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events.
About Team:
What you'll do:
Subscribe to job alerts and upload your resume!
*By registering with our site, you agree to our
Terms and Privacy Policy.