AI Alignment Technical Research

Welcome to my AI Alignment Technical Research repository! This repository represents my active learning and technical research in the field of AI alignment. It covers a broad range of topics that provide essential insights before delving deeper into the challenge of aligning AI systems with human values and ethical standards.

Overview

AI alignment is essential for ensuring that AI systems behave in ways that are beneficial to humans and aligned with our goals. This repository explores key technical areas such as adversarial AI, explainable AI, and interpretable machine learning, forming the foundation of my research into human-aligned AI systems.

Structure

The repository is divided into several key areas:

1. Adversarial AI

Techniques: FGSM, PGD, C&W, DeepFool, Few Pixel, Patch
A detailed exploration of adversarial attack methods and defenses, which are critical for ensuring the robustness and reliability of AI systems.

2. Explainable Deep Learning & Human Alignment

Techniques: Integrated Gradients, Attention, BERT
Investigates methods that help make deep learning models more transparent and interpretable, allowing for better alignment with human reasoning and goals.

3. Explainable NLP

Techniques: BERT & LIME
Focuses on explainability techniques applied to natural language processing, ensuring that model outputs are understandable to humans.

4. Explainable Techniques

Techniques: Partial Dependence Plots (PDP), Individual Conditional Expectation (ICE), Accumulated Local Effects (ALE)
A compilation of model-agnostic techniques that aid in understanding and explaining machine learning models.

5. Interpretable Machine Learning

Algorithms: C4.5, Ruleset, TAO, Linear Regression, Logistic Regression, Generalized Additive Models (GAM)
A focus on interpretable models, crucial for ensuring that AI systems can be trusted and understood in high-stakes applications.

6. Measuring Shared Interest

Technique: Grad-CAM
This section explores the use of Grad-CAM for aligning deep learning models with human expectations by visualizing model focus areas.

Contributing

This repository is part of my continuous journey into AI alignment research, and I welcome contributions, feedback, and collaborations from anyone interested in creating human-aligned AI.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.idea		.idea
Adversarial AI (FGSM, PGD, C&W, DeepFool, Few Pixel, Patch)		Adversarial AI (FGSM, PGD, C&W, DeepFool, Few Pixel, Patch)
ExplainableBERT (Perturbations, Saliency Scores, Counterfactuals)		ExplainableBERT (Perturbations, Saliency Scores, Counterfactuals)
ExplainableDL&Human Alignment (Integrated Gradients, Attention, BERT).ipynb		ExplainableDL&Human Alignment (Integrated Gradients, Attention, BERT).ipynb
ExplainableNLP (BERT & LIME)		ExplainableNLP (BERT & LIME)
ExplainableTechniques (PDP, ICE, ALE)		ExplainableTechniques (PDP, ICE, ALE)
InterpretableML (C45, Ruleset, TAO)		InterpretableML (C45, Ruleset, TAO)
InterpretableML (Linear Regression, Logistic Regression, GAM)		InterpretableML (Linear Regression, Logistic Regression, GAM)
InterpretableML (Mechanistic Interpretability)		InterpretableML (Mechanistic Interpretability)
Legal-BERT-Research		Legal-BERT-Research
MeasuringSharedInterest (Grad Cam)		MeasuringSharedInterest (Grad Cam)
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Alignment Technical Research

Overview

Structure

1. Adversarial AI

2. Explainable Deep Learning & Human Alignment

3. Explainable NLP

4. Explainable Techniques

5. Interpretable Machine Learning

6. Measuring Shared Interest

Contributing

About

Releases

Packages

Languages

lennox55555/AI-Alignment-Technical-Research

Folders and files

Latest commit

History

Repository files navigation

AI Alignment Technical Research

Overview

Structure

1. Adversarial AI

2. Explainable Deep Learning & Human Alignment

3. Explainable NLP

4. Explainable Techniques

5. Interpretable Machine Learning

6. Measuring Shared Interest

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages