Optimize dataloaders for better GPU utilization#1
Conversation
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| def __len__(self): | ||
| return len(self.prompts) |
There was a problem hiding this comment.
[P0] Buffer dataset length uses undefined attribute
BufferDataPipe.__len__ now returns len(self.prompts) even though __init__ no longer assigns self.prompts after replacing it with pre-tokenized self.encoded_prompts. Any call to len(dataset) (for example when initializing DataLoader) will raise AttributeError: 'BufferDataPipe' object has no attribute 'prompts', breaking training and evaluation regardless of inputs. Consider basing the length on self.encoded_prompts or self.total_size.
Useful? React with 👍 / 👎.
| def __len__(self): | ||
| return len(self.prompts) |
There was a problem hiding this comment.
[P0] Molecule dataset length uses undefined attribute
MolOptDataPipe.__init__ also drops the self.prompts assignment when pre-tokenizing, but __len__ still returns len(self.prompts). Instantiating the dataset and querying its length will immediately raise AttributeError, so neither training nor validation loaders can function. The length should be derived from self.encoded_prompts/self.molecules instead.
Useful? React with 👍 / 👎.
Summary
Testing
pytest -q(fails: ModuleNotFoundError: No module named 'chemgfn.data.ChemGFN_datamodule')https://chatgpt.com/codex/tasks/task_e_68c2de9672d483288c6490407cf805ac