Skip to content

Import domains list#216

Open
ksy36 wants to merge 1 commit into
mainfrom
import_domains
Open

Import domains list#216
ksy36 wants to merge 1 commit into
mainfrom
import_domains

Conversation

@ksy36
Copy link
Copy Markdown
Collaborator

@ksy36 ksy36 commented May 21, 2026

Adds an ability to import specified domain lists from BQ

@ksy36 ksy36 force-pushed the import_domains branch from 723f199 to 564c27d Compare May 21, 2026 20:57
],
options={
'constraints': [models.UniqueConstraint(fields=('domain_source', 'domain'), name='unique_domain_per_source')],
},
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering creating two separate tables, but decided to store all domains in one with a source foreign key relation. It will hopefully make the lookup easier and also can just update the config in settings.py when we need a new source.

Comment thread tests/test_utils.py
assert normalize_domain("example.com") == "example.com"

def test_normalize_domain_strips_www(self):
assert normalize_domain("www.youtube.com") == "youtube.com"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All bucket domains currently contain www., so at the time of labeling we can reuse the same function to strip www. / m. them to find a match in the domain list.

Comment thread server/server/settings.py
"name": "nsfw",
"bq_table": "oisd.nsfw",
"bq_source_field": "domain",
"normalize": False,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure if we need to run normalization on these, as they are already stripped of www and there maybe 5 domains starting with m..

@ksy36 ksy36 marked this pull request as ready for review May 21, 2026 21:20
@ksy36 ksy36 requested a review from jgraham May 21, 2026 21:20
Comment on lines +60 to +66
params = {"project": settings.BIGQUERY_PROJECT}
if svc_acct := getattr(settings, "BIGQUERY_SERVICE_ACCOUNT", None):
params["credentials"] = (
service_account.Credentials.from_service_account_info(svc_acct)
)

client = bigquery.Client(**params)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it must be shared across lots of commands now and might be a good candidate for a helper function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants