I have over 10 years of experience working in Information Technology. Currently, I work as the Principal Software Reliability Engineer at CLDF, where I lead several projects focusing on improving the digital services' reliability and cost-effectiveness. In all my work, I always focus on creating reliable digital products and services that make a positive impact on people's lives.
At CLDF I led the migration of the existing IT architecture into a cloud-native environment, optimizing resource utilization and achieving a cost reduction of 20% while improving scalability. The same project also included the implementation of Site Reliability Engineering - SRE practices, automated monitoring and alerting systems and evolution of incident management processes. This resulted in improvement of compliance with security and governance standards and an increase of 5% of systems availability. Such project was possible thanks to great tools such as Kubernetes, Jenkins CI, Git, GitOps, Prometheus, Grafana, Thanos, Alertmanager, Terraform and Ansible.
Also at CLDF, I've teamed with DevOps Engineers to establish a reliable corporate logs architecture based on OpenTelemetry, Apache Kafka and ElasticSearch, resulting in reducing log losses by 100%.
With a strong belief in prioritizing vendor-agnostic solutions, I've always prioritized the usage of open standards, while using several resources from public cloud vendors, balancing flexibility and cost-effectiveness when making architectural decisions.
I also mentor and coach startup founders and entrepreneurs on digital transformation, customer experience, service design, lean startup, and agile methodologies, with the aim of helping them achieve their goals and grow their businesses.
Competencies: Site Reliability Engineering - SRE, Kubernetes, Amazon Web Services, Google Cloud Platform, Golang, Security, CI/CD
My journey in IT has led me to develop a passion for changing people's lifes through innovation and technology, and I am eager to keep growing into tech to generate a bigger impact each day.
Skill | Associated Project |
---|---|
Site Reliability Engineering, Observability, Monitoring, Data Engineering, Prometheus, OpenTelemetry, Apache Kafka | https://github.com/pedrocrc/observability-architecture/ |
Observability, Monitoring, Prometheus, Go Lang, Kubernetes, Containers | https://github.com/pedrocrc/unity2promgo |
Title |
---|
Companies shall no longer ignore SRE |
Reliability management is about embracing the infinite game |