Import domains list#216
Conversation
| ], | ||
| options={ | ||
| 'constraints': [models.UniqueConstraint(fields=('domain_source', 'domain'), name='unique_domain_per_source')], | ||
| }, |
There was a problem hiding this comment.
I was considering creating two separate tables, but decided to store all domains in one with a source foreign key relation. It will hopefully make the lookup easier and also can just update the config in settings.py when we need a new source.
| assert normalize_domain("example.com") == "example.com" | ||
|
|
||
| def test_normalize_domain_strips_www(self): | ||
| assert normalize_domain("www.youtube.com") == "youtube.com" |
There was a problem hiding this comment.
All bucket domains currently contain www., so at the time of labeling we can reuse the same function to strip www. / m. them to find a match in the domain list.
| "name": "nsfw", | ||
| "bq_table": "oisd.nsfw", | ||
| "bq_source_field": "domain", | ||
| "normalize": False, |
There was a problem hiding this comment.
Wasn't sure if we need to run normalization on these, as they are already stripped of www and there maybe 5 domains starting with m..
| params = {"project": settings.BIGQUERY_PROJECT} | ||
| if svc_acct := getattr(settings, "BIGQUERY_SERVICE_ACCOUNT", None): | ||
| params["credentials"] = ( | ||
| service_account.Credentials.from_service_account_info(svc_acct) | ||
| ) | ||
|
|
||
| client = bigquery.Client(**params) |
There was a problem hiding this comment.
This feels like it must be shared across lots of commands now and might be a good candidate for a helper function.
Adds an ability to import specified domain lists from BQ