Skip to content

Optimize dataloaders for better GPU utilization#1

Open
ComDec wants to merge 3 commits into
devfrom
codex/improve-code-execution-efficiency
Open

Optimize dataloaders for better GPU utilization#1
ComDec wants to merge 3 commits into
devfrom
codex/improve-code-execution-efficiency

Conversation

@ComDec

@ComDec ComDec commented Sep 11, 2025

Copy link
Copy Markdown
Owner

Summary

  • Pre-tokenize prompts in dataset classes to avoid repeated CPU work
  • Enable DataLoader persistent workers and prefetching for smoother GPU feeding

Testing

  • pytest -q (fails: ModuleNotFoundError: No module named 'chemgfn.data.ChemGFN_datamodule')

https://chatgpt.com/codex/tasks/task_e_68c2de9672d483288c6490407cf805ac

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines 238 to 239
def __len__(self):
return len(self.prompts)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Buffer dataset length uses undefined attribute

BufferDataPipe.__len__ now returns len(self.prompts) even though __init__ no longer assigns self.prompts after replacing it with pre-tokenized self.encoded_prompts. Any call to len(dataset) (for example when initializing DataLoader) will raise AttributeError: 'BufferDataPipe' object has no attribute 'prompts', breaking training and evaluation regardless of inputs. Consider basing the length on self.encoded_prompts or self.total_size.

Useful? React with 👍 / 👎.

Comment on lines 456 to 457
def __len__(self):
return len(self.prompts)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Molecule dataset length uses undefined attribute

MolOptDataPipe.__init__ also drops the self.prompts assignment when pre-tokenizing, but __len__ still returns len(self.prompts). Instantiating the dataset and querying its length will immediately raise AttributeError, so neither training nor validation loaders can function. The length should be derived from self.encoded_prompts/self.molecules instead.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant