Selectively include developer.arm.com content#95
Open
apickard wants to merge 7 commits into
Open
Conversation
…results as embeddings
…tors.py with common functions in generate_common.py
…te-chunks.py into generate_common.py.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new discovery pipeline for selectively including developer.arm.com content by querying Arm’s search API, filtering results, and integrating the discovered sources into the existing embedding-generation workflow by factoring shared logic into a common module.
Changes:
- Introduces
generate_common.pyto share source tracking, retryable HTTP session, and chunk save/tracking utilities between scripts. - Adds
generate-vectors.pyto discover/filterdeveloper.arm.comsearch results and register them into the sources CSV. - Expands dependencies and data inputs (adds
playwright; appends new Arm Developer entries tovector-db-sources.csv) and updates tests/fixtures accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| embedding-generation/vector-db-sources.csv | Adds new Arm Developer SME-related sources to the ingestion list. |
| embedding-generation/tests/test_generate_chunks.py | Updates tests to target functionality moved into generate_common.py. |
| embedding-generation/tests/conftest.py | Adds a fixture/module loader for generate_common.py with state reset. |
| embedding-generation/requirements.txt | Adds playwright dependency for browser-based capture of search requests. |
| embedding-generation/generate-vectors.py | New script to capture/replay Arm search API results and register relevant sources. |
| embedding-generation/generate-chunks.py | Refactors to import shared utilities from generate_common.py. |
| embedding-generation/generate_common.py | New shared module containing retry session, source tracking, and chunk persistence logic. |
| embedding-generation/Dockerfile | Copies generate_common.py into the build context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+253
to
+264
| except Exception as err: | ||
| print(f"Other error occurred: {err}") | ||
| with open('info/errors.csv', 'a', newline='') as csvfile: | ||
| csv_writer = csv.writer(csvfile) | ||
| csv_writer.writerow([url, str(err)]) | ||
| return None | ||
| except Exception as err: | ||
| print(f"Other error occurred: {err}") | ||
| with open('info/errors.csv', 'a', newline='') as csvfile: | ||
| csv_writer = csv.writer(csvfile) | ||
| csv_writer.writerow([url,str(err)]) | ||
| return False |
Comment on lines
+353
to
+356
| # Overwrite csv with new info | ||
| with open(details_file, mode='w', newline='') as file: | ||
| csv_writer = csv.writer(file, delimiter=',') | ||
| csv_writer.writerows(new_rows) |
Comment on lines
+202
to
+204
| response = http_session.get(url, timeout=60) | ||
| soup = BeautifulSoup(response.text, 'html.parser') | ||
|
|
Comment on lines
+280
to
+282
| keywords = list(set( [searchterm] + | ||
| [key for key_list in (page["keywords"] or []) for key in key_list.split(sep="|")] + | ||
| [key for key_list in (page["products"] or []) for key in key_list.split(sep="|")[2:]])) |
Comment on lines
+311
to
+318
| # 0) Initialize files | ||
| os.makedirs(yaml_dir, exist_ok=True) # create if doesn't exist | ||
| details_dir = os.path.dirname(details_file) | ||
| if details_dir: | ||
| os.makedirs(details_dir, exist_ok=True) | ||
| for filename in os.listdir(yaml_dir): | ||
| if filename.startswith('chunk_') and filename.endswith('.yaml'): | ||
| os.remove(os.path.join(yaml_dir, filename)) |
Comment on lines
+323
to
+328
| # 0) Obtain full database information: | ||
| # a) Learning Paths & Install Guides | ||
| if not skip_discovery: | ||
| # Developer.Arm.Com | ||
| createDeveloperArmComChunks(emit_chunks=False) | ||
|
|
| sentence-transformers>=5.4 | ||
| pypdf | ||
| rank-bm25 | ||
| playwright |
Comment on lines
+218
to
+222
| def item_is_relevant(item) -> bool: | ||
| if not item.get("url"): | ||
| return False | ||
| match item["type"]: | ||
| case "Guide": |
| print("Found "+str(len(all_rows))+" results") | ||
| return all_rows | ||
|
|
||
| def processDeveloperArmCom(url, title, type, keywords, emit_chunks=True): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uses the developer.arm.com search api to retrieve search results, filters them to just the relevant ones, and then generates embeddings from them. Implemented as separate script generate-vectors.py that has the same command line as generate-chunks.py (it needs one argument, the vector csv file). Functions common to both generate-chunks.py and generate-vectors.py have been moved into generate_common.py.