Skip to content

This repository contains emerging topics that are crucial to take into consideration when researching AI alignment. And contains technical research for each topic.

Notifications You must be signed in to change notification settings

lennox55555/AI-Alignment-Technical-Research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Alignment Technical Research

Welcome to my AI Alignment Technical Research repository! This repository represents my active learning and technical research in the field of AI alignment. It covers a broad range of topics that provide essential insights before delving deeper into the challenge of aligning AI systems with human values and ethical standards.

Overview

AI alignment is essential for ensuring that AI systems behave in ways that are beneficial to humans and aligned with our goals. This repository explores key technical areas such as adversarial AI, explainable AI, and interpretable machine learning, forming the foundation of my research into human-aligned AI systems.

Structure

The repository is divided into several key areas:

1. Adversarial AI

  • Techniques: FGSM, PGD, C&W, DeepFool, Few Pixel, Patch
  • A detailed exploration of adversarial attack methods and defenses, which are critical for ensuring the robustness and reliability of AI systems.

2. Explainable Deep Learning & Human Alignment

  • Techniques: Integrated Gradients, Attention, BERT
  • Investigates methods that help make deep learning models more transparent and interpretable, allowing for better alignment with human reasoning and goals.

3. Explainable NLP

  • Techniques: BERT & LIME
  • Focuses on explainability techniques applied to natural language processing, ensuring that model outputs are understandable to humans.

4. Explainable Techniques

  • Techniques: Partial Dependence Plots (PDP), Individual Conditional Expectation (ICE), Accumulated Local Effects (ALE)
  • A compilation of model-agnostic techniques that aid in understanding and explaining machine learning models.

5. Interpretable Machine Learning

  • Algorithms: C4.5, Ruleset, TAO, Linear Regression, Logistic Regression, Generalized Additive Models (GAM)
  • A focus on interpretable models, crucial for ensuring that AI systems can be trusted and understood in high-stakes applications.

6. Measuring Shared Interest

  • Technique: Grad-CAM
  • This section explores the use of Grad-CAM for aligning deep learning models with human expectations by visualizing model focus areas.

Contributing

This repository is part of my continuous journey into AI alignment research, and I welcome contributions, feedback, and collaborations from anyone interested in creating human-aligned AI.

About

This repository contains emerging topics that are crucial to take into consideration when researching AI alignment. And contains technical research for each topic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published