#409 Migrate BHR collection to Glean data#413
#409 Migrate BHR collection to Glean data#413skylarkning wants to merge 4 commits intomozilla:mainfrom
Conversation
5b89902 to
76fb64c
Compare
Update bhr_collection.py to read from firefox_desktop_stable.hang_report_v1 instead of telemetry_stable.bhr_v4. Map the Glean client_info and metrics fields into the existing processing shape, and parse the Glean hang report and module object metrics before processing. Refs mozilla#409.
76fb64c to
914dde2
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #413 +/- ##
==========================================
- Coverage 32.63% 32.58% -0.06%
==========================================
Files 36 36
Lines 3858 3864 +6
==========================================
Hits 1259 1259
- Misses 2599 2605 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@BenWu would you be able to take a look at reviewing this? Sky is a student worker who just started and he's going to be making some improvements to BHR, so we had him taking a look at migrating the bhr_collection job to use glean data as a first task. Regarding the coverage, I'd like to get some tests in but looking at what we have so far it looks like it might be a little involved to get a basic test of bhr_collection up and running. Does that sound right? In any case do you have any thoughts on the best route forward or salient examples for getting some bhr tests up and running either in this PR or in a follow-up? |
There was a problem hiding this comment.
For testing, I think the simplest thing to do would be to mock the spark load() that gets data from bigquery with a dataframe with a few sample pings. Then just run that through the rest of the job and verify the output. That might be feasible but I'm not sure if spark complicates it.
Another note is that this job is currently using spark 2.4.8 via dataproc image 1.5 which is being discontinued in August this year so this will need to be updated soon. This is out of scope for this PR but I recommend doing a bit of research to get an idea of what would be required. Bringing this up because I'm not sure we (on the data platform side) will be able to get to it before then. That might be an easy AI job but the testing and validation is something I don't want to commit to right now
|
Github is having issues right now so I can't comment in the code Can you make the billing project an input arg for the job? We'll still want to run in certain projects for the scheduled production jobs. |
Add a --billing-project CLI option and pass it through the job config to the BigQuery connector. Default to mozdata so local runs keep the current behavior, while scheduled jobs can override the billing project.
Hi @BenWu, thanks for the feedback and review! I added a Next I will look into the tests. Thank you! |
Summary
This PR refers to Issue #409.
This migrates
bhr_collection.pyto read BHR hang report data from the Glean table:moz-fx-data-shared-prod.firefox_desktop_stable.hang_report_v1instead of the legacy telemetry table:
moz-fx-data-shared-prod.telemetry_stable.bhr_v4.Changes
mozdataas the BigQuery billing projectclient_info,ping_info, andmetricsfrom the Glean hang report table{frame, module}payload/time_since_last_pingnormalization pathLocal Validation
Ran locally with:
JAVA_HOME=$(/usr/libexec/java_home -v 17) python3 ./mozetl/bhr_collection/bhr_collection.py \ --date 2025-04-25 \ --bq-connector-jar=spark-bigquery-latest_2.12.jar \ --sample-size 0.0002Results and Comparison
The job completed successfully and wrote:
output/hangs_main_20250421.jsonoutput/hangs_main_current.jsonThe output files are compared against legacy output:
sampleTablekeys matchThe output is non-empty for the same thread/date combinations.
Note: Exact numeric similarity is no longer expected as the data source has changed to Glean, and the new output no longer normalizes by usage hours.