资深 SRE / 基础设施架构师 (Principal DevOps Engineer)

Senior Site Reliability Engineer / Infrastructure Architect (Principal DevOps Engineer)

97EX

＄2.6-3K[Monthly]

Remote3-5 Yrs ExpBachelorFull-time

Remote Details

Open Country：Worldwide

Language Requirements：Chinese

Job Description

Show original text

Benefits

Employee Recognition and Rewards
Distributed team, No Monitoring System, No Politics at Work
Time Off & Leave
Paid Time Off, Unlimited or Flexible PTO, Government Mandated Leave

Job Responsibilities 1. Cloud Native Architecture Design and Governance: - Design highly available architectures on AWS and Cloudflare, extending beyond CDN configuration to implement edge logic with Cloudflare Workers and secure access layers using Argo Tunnel/Zero Trust. - Manage AWS multi-account structures via Organizations, architect cross-Region networking (Transit Gateway, VPC Peering, VPN) to resolve complex connectivity and latency challenges. - Enforce Infrastructure as Code (Terraform/Pulumi) across edge rules and underlying resources to minimize manual console operations. 2. Deep Kubernetes Engineering: - Maintain large-scale EKS or self-managed clusters, performing performance tuning and troubleshooting of core components such as etcd, CNI plugins (Cilium/Calico), and CoreDNS. - Develop Kubernetes Operators/Controllers or kubectl plugins to enhance platform automation based on business requirements. - Bridge local development and production environments (Docker Compose to Helm/Kustomize) to ensure consistency. 3. Engineering Productivity and Observability: - Design and maintain complex CI/CD pipelines, integrating code quality analysis (SonarQube), container image security scanning, and automated testing. - Implement GitOps workflows using ArgoCD or Flux. - Build a Prometheus-based monitoring system with in-depth runtime (Go/Java) and system-level (eBPF) performance analysis. 4. System-Level Support and Reliability: - Maintain middleware such as Nginx, Redis, and Kafka with capabilities for source-level debugging and parameter tuning. - Address system bottlenecks under high concurrency (TCP queues, file handles, memory management). - Linux Systems Expert: Deep understanding of Linux kernel internals and proficient use of perf, strace, tcpdump, eBPF, and other tools to diagnose CPU, I/O, and network issues in production. - Cloud and Networking Proficiency: Familiarity with AWS infrastructure limits (API rate limits, EBS IOPS) and Cloudflare fundamentals (Anycast, SSL handshake), with a deep understanding of the TCP/IP stack and HTTP/2/3 protocols. - Kubernetes Hands-On Experience: In-depth knowledge of cgroups and namespaces, service meshes (Istio/Linkerd), and rapid diagnosis of pod scheduling failures or crashes. - Development Skills: Proficient in Go or Python, capable of reading open-source code, fixing bugs, and developing backend tools. Preferred Qualifications - Contributor to CNCF open source projects. - Experience maintaining systems handling hundreds of millions of daily requests. - Hands-on experience implementing chaos engineering in production environments.