TinyModel addition #31804

noanabeshima · 2024-07-05T05:50:56Z

Model description

https://github.com/noanabeshima/tiny_model

It's a small language model trained on TinyStories for interpretability with sparse autoencoders and transcoders added. It has no layernorms (this helps with interpretability) which makes it not fit with any existing model architecture in the transformers library. Its architecture is essentially GPT-2's except that it doesn't have layernorms and it has untied embed/deembed.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

The implementation is here:
https://github.com/noanabeshima/tiny_model/blob/main/tiny_model/lm.py

The weights are here:
https://huggingface.co/noanabeshima/tiny_model/blob/main/tiny_model.pt

The default config corresponding to the weights is:

    d_model=768,
    n_layers=4,
    n_heads=16,
    max_seq_len=256,
    vocab_size=10_000

I am the author.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-07-05T05:57:26Z

It would be quite nice to add this using the new model adder that @ArthurZucker has contributed; @ArthurZucker, when back from leave (next week), do you mind sharing with @noanabeshima how to get this done the best way?

ArthurZucker · 2024-07-16T05:40:58Z

Hey! sorry for the delay! Yep, my recommendation is to use the #30868 tool to isolate the changes as much as possible 🤗

vishwas-sharma2480 · 2024-07-18T10:26:14Z

hi @ArthurZucker I am new to open-source contribution and I would like to contribute to add this new model to transformer library can you please provide to any reference or previous PRs that were similar to this

ArthurZucker · 2024-07-26T16:00:34Z

#29622 or #31659 are quite similar, there is also https://huggingface.co/docs/transformers/en/add_new_model which should help!

geetu040 · 2024-10-04T03:56:19Z

FYI @LysandreJik @ArthurZucker

Hi, I have been trying to work on this issue and have created this model architecture from the source code: noanabeshima/tinymodel

Before proceeding further, I would like to clarify if this is the way forward and we want to add this model

Because the model implementation looks really simple and straighforward. Not sure if there are models as simple as this in the library.
Although the dataset this model has been experimented on comes from TinyStories, yet there is no literature backing for the model architechture specifically. @noanabeshima can confirm this.

See the implementation below

class Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.attention_head_size
        self.all_head_size = config.num_attention_heads * config.attention_head_size

        self.Q = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.K = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.V = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.O = nn.Linear(self.all_head_size, self.hidden_size, bias=False) # TODO: Remove bias


    def forward(self, hidden_states):
        # hidden_states.shape (batch_size, seq_len, hidden_size)

        q, k, v = self.Q(hidden_states), self.K(hidden_states), self.V(hidden_states)
        # q.shape (batch_size, seq_len, all_head_size)

        q = q.reshape(*q.shape[:-1], self.num_attention_heads, self.attention_head_size)
        k = k.reshape(*k.shape[:-1], self.num_attention_heads, self.attention_head_size)
        v = v.reshape(*v.shape[:-1], self.num_attention_heads, self.attention_head_size)
        # q.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)

        q = q.transpose(-2, -3)
        k = k.transpose(-2, -3)
        v = v.transpose(-2, -3)
        # q.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)

        head_writeouts = F.scaled_dot_product_attention(
            q, k, v,
            is_causal=True,
        )
        # head_writeouts.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)

        head_writeouts = head_writeouts.transpose(-2, -3)
        # head_writeouts.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)

        head_writeouts = head_writeouts.reshape(*head_writeouts.shape[:-2], self.all_head_size)
        # head_writeouts.shape (batch_size, seq_len, all_head_size)

        attn_out = self.O(head_writeouts)
        # attn_out.shape (batch_size, seq_len, hidden_size)

        return attn_out


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size

        self.mlp = nn.Sequential(
            nn.Linear(self.hidden_size, self.intermediate_size),
            nn.ReLU(),
            nn.Linear(self.intermediate_size, self.hidden_size),
        )

    def forward(self, hidden_states):
        # hidden_states.shape (batch_size, seq_len, hidden_size)
        return self.mlp(hidden_states)


class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.attention = Attention(config)
        self.mlp = MLP(config)

    def forward(self, hidden_states):
        attention_out = self.attention(hidden_states)
        mlp_out = self.mlp(attention_out)
        return mlp_out

ArthurZucker · 2024-10-06T08:26:39Z

Hey!

Indeed this is the "simplest" model there is! Would be a nice addition IMO!
Not really a problem we can link to a github repo or simply a blog post!

One thing that is however fairly important is to follow the transformers API.
Most important is from_pretrained.

Having a minimal implementation would be nice! I can help you by reviewing the PR for sure!

You need to take a little bit of inspiration from modeling_llama.py but removing the extras.

If your model is super small for example, it makes sense not to have past key values.
I would ask also if the model supports padding or not!
It's comes down to skimming to the essential, which I am very much pro!

🤗 hope we can merge this and have a great example of a TinyModel to set good standards! 🤗

noanabeshima added the New model label Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TinyModel addition #31804

TinyModel addition #31804

noanabeshima commented Jul 5, 2024 •

edited

Loading

LysandreJik commented Jul 5, 2024

ArthurZucker commented Jul 16, 2024

vishwas-sharma2480 commented Jul 18, 2024

ArthurZucker commented Jul 26, 2024

geetu040 commented Oct 4, 2024 •

edited

Loading

ArthurZucker commented Oct 6, 2024

TinyModel addition #31804

TinyModel addition #31804

Comments

noanabeshima commented Jul 5, 2024 • edited Loading

Model description

Open source status

Provide useful links for the implementation

LysandreJik commented Jul 5, 2024

ArthurZucker commented Jul 16, 2024

vishwas-sharma2480 commented Jul 18, 2024

ArthurZucker commented Jul 26, 2024

geetu040 commented Oct 4, 2024 • edited Loading

ArthurZucker commented Oct 6, 2024

noanabeshima commented Jul 5, 2024 •

edited

Loading

geetu040 commented Oct 4, 2024 •

edited

Loading