Skip to content

edit-sequence data-format lever at sub-50M params? (a small honest data point) #1

Description

@gHashTag

Hi -- thanks for LintSeq / TinyCodeLM (arXiv:2410.02749). Your 150M result is, as far as I can find, the smallest published code LM with a non-zero HumanEval pass@1 (6.1 pretrain, 12.8 with edit-sequence fine-tune), and the edit-sequence reframing is the part that interests me most: a data-FORMAT lever, not just more tokens.

Context for one honest question. I have been probing a deliberately tiny CPU code model (a transformer at 148K and 493K params, VOCAB 263, max-seq 512) on a small C-function corpus. Measured result: compile@1 = 0/4 at BOTH 148K and 493K params, so width is not the lever at this scale; a 4x corpus cut val-bpb 2.8x but compile@1 stayed 0. I built a small deterministic calculator that places this config against published floors:

  • param gap to TinyCodeLM-150M: ~304x; token gap: ~1e6x (72K tokens vs 72e9)
  • Chinchilla-optimal (20 tok/param) deficit at 493K params: ~137x
  • Wilson 95% 0-success ceiling: n=4 -> 0.49, n=16 -> 0.19, n=32 -> 0.11
    (so a 0/4 cannot even rule out a true rate up to ~49%)

So my non-result is the expected outcome: I am orders of magnitude below your floor on BOTH params and tokens.

The one question I could not answer from the paper: does the edit-sequence data-format advantage (the relative pass@1 gain you see at 150M from re-expressing the same source as error-free incremental diffs) appear to transfer DOWN-scale, or does a token-quantity floor dominate first below, say, ~50M params? I am not asking for support or review -- just whether you have any read on where the format lever stops helping.

No reply needed if this is already covered in an appendix I missed. Thanks for the open weights and the clear write-up.

(Note: your TinyCodeLM is Python/HumanEval; my probe is a separate tiny C model -- I am asking about the down-scale transfer of the edit-sequence idea, not comparing the two directly.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions