Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
c465418
KPMP-5863: raji bulk upload yaml generation script
HaneenT Aug 29, 2025
92fc09b
Merge pull request #137 from KPMP/KPMP-5863_yaml_generation_script
rlreamy Sep 2, 2025
f74b9e4
Update changelog.md
rlreamy Sep 3, 2025
cacb403
Update rebuild.sh
rlreamy Sep 3, 2025
756b923
Update build-libra.yml
rlreamy Sep 3, 2025
a06b99a
Update changelog.md
rlreamy Sep 4, 2025
11ff2a6
Merge pull request #138 from KPMP/KPMP-6188_PostRelease
Dert1129 Sep 4, 2025
cdba63e
KPMP-6216: skip calc checksums for package recall
rlreamy Sep 15, 2025
6b6dac6
KPMP-6216: additional fixes
rlreamy Sep 16, 2025
cd6afb4
Merge pull request #139 from KPMP/KPMP-6216_RecallFix
HaneenT Sep 16, 2025
9fba062
KPMP-6197: Fixes to the bulk uploader to be up to speed on changes to…
rlreamy Sep 22, 2025
8c31a1a
KPMP-5806: add new slide name
rlreamy Sep 22, 2025
86e0579
KPMP-5806: Get the slide names
rlreamy Sep 22, 2025
6bfdf1d
Merge pull request #140 from KPMP/KPMP-5806_GenerateSlideName
Dert1129 Sep 23, 2025
66ad3c4
KPMP-6197: Updated the scripts to work with changes to the app
rlreamy Sep 25, 2025
b267ecd
KPMP-6223: new cols and rename
zwright Sep 25, 2025
b03b3d1
KPMP-6223: typo
zwright Sep 25, 2025
c8649b5
KPMP-6195: calculate the filename and foldername
rlreamy Sep 26, 2025
1f58f58
KPMP-6195: handle issues with file_location and file name determination
rlreamy Sep 26, 2025
44e81e5
KPMP-6223: fix args and extra comma
zwright Sep 26, 2025
e86d558
Merge pull request #141 from KPMP/KPMP-6197_UpdateBulkUpload
zwright Sep 26, 2025
180cb25
Merge pull request #142 from KPMP/KPMP-6223_New_Cols
rlreamy Sep 26, 2025
6b6c1a1
KPMP-6233: fix log errors
HaneenT Sep 29, 2025
5290bd9
"value" keyError
HaneenT Sep 29, 2025
8e7996f
Merge pull request #144 from KPMP/KPMP-6233_fix_redcap_logs
Dert1129 Sep 29, 2025
84f8243
KPMP-6195: change col for sample id
rlreamy Oct 1, 2025
fba4ed4
Delete data_management/services/tests/test_slide_management.py
rlreamy Oct 2, 2025
b6b39f3
Merge pull request #145 from KPMP/KPMP-6195_FileNameAndFolder
Dert1129 Oct 2, 2025
6071ec6
Update data_manager_data.sql
zwright Oct 6, 2025
a3744c0
Update data_manager_data.sql
zwright Oct 6, 2025
8f4904b
Merge pull request #146 from KPMP/update_data_manager_data
Dert1129 Oct 6, 2025
18807e4
alides found missing slides table are marked as error
Dert1129 Oct 8, 2025
57c8322
Merge pull request #147 from KPMP/KPMP-6196_determine-missing-slides
rlreamy Oct 8, 2025
e949c2a
K:PMP-5807: renamed methods to be more accurate
rlreamy Oct 8, 2025
e744917
KPMP-5807: Add some of the error handling
rlreamy Oct 8, 2025
e5ed421
make a more useful error message
Dert1129 Oct 9, 2025
4e1dc9c
KPMP-5807: Finish error handling before rename
rlreamy Oct 9, 2025
63874ba
fill in package_ids that are null in slide_scan_curation
Dert1129 Oct 9, 2025
30e35dd
KPMP-5807: rename files
rlreamy Oct 9, 2025
7446a82
Merge pull request #148 from KPMP/KPMP-6199_fill-in-package-id
rlreamy Oct 9, 2025
e39ecb9
fix update and select statements
Dert1129 Oct 9, 2025
e641d94
Merge remote-tracking branch 'origin/develop' into KPMP-6196_update-w…
Dert1129 Oct 9, 2025
0e78f96
KPMP-5807: Rename some vars to be snake case
rlreamy Oct 10, 2025
00af8cb
Merge branch 'develop' into KPMP-5807_RenameWSIFileOnMove
rlreamy Oct 10, 2025
c91eda1
Merge pull request #149 from KPMP/KPMP-5807_RenameWSIFileOnMove
Dert1129 Oct 10, 2025
0bddcba
swap around the if statement
Dert1129 Oct 10, 2025
373ab7e
Merge pull request #150 from KPMP/KPMP-6196_update-when-a-slide-is-mi…
rlreamy Oct 13, 2025
29a0ac6
run the package_id filler
Dert1129 Oct 13, 2025
f299769
Merge pull request #151 from KPMP/KPMP-6199_execute-package-id-filler
rlreamy Oct 13, 2025
463fe8a
skip redcap ids that are null
Dert1129 Oct 13, 2025
e300caf
Merge pull request #152 from KPMP/KPMP-6199_skip-null-redcap-ids
zwright Oct 13, 2025
8dbd849
do not fetch null redcap_ids
Dert1129 Oct 14, 2025
01a0657
Merge pull request #153 from KPMP/KPMP-6199_do-not-fetch-null-redcap-ids
rlreamy Oct 14, 2025
1ad7cf8
insert slides into the curation table anyways, but mark them as missi…
Dert1129 Oct 15, 2025
3f9824d
shift variable to be stored after insertion
Dert1129 Oct 15, 2025
3eb9954
Merge pull request #154 from KPMP/KPMP-6196_mark-slides-missing
HaneenT Oct 15, 2025
c6a94d9
log the error message to the console
Dert1129 Oct 16, 2025
c348acd
remove where condition
Dert1129 Oct 16, 2025
47613e2
Merge pull request #155 from KPMP/KPMP-6196_stop-inserting-slides
rlreamy Oct 20, 2025
4e31e60
use all()
Dert1129 Oct 23, 2025
51c6309
Merge pull request #156 from KPMP/KPMP-6196_check-for-empty-list
HaneenT Oct 24, 2025
f37534c
invert if statement
Dert1129 Oct 27, 2025
d8c87c5
added a logger to tell the dev when imports are done
Dert1129 Oct 27, 2025
483fe2f
Merge pull request #157 from KPMP/KPMP-6196_invert-if-statement
HaneenT Oct 27, 2025
2aa4f2b
Update dlu_management.py
rlreamy Oct 30, 2025
b81f227
Merge pull request #158 from KPMP/KPMP-6199
Dert1129 Oct 30, 2025
f5c842c
KPMP-6260: Check for missing slides after inserting to ensure we are …
rlreamy Nov 12, 2025
06e0d4b
Merge branch 'develop' into KPMP-6260_UpdateMissingSlides
rlreamy Nov 17, 2025
9d83c38
Merge pull request #159 from KPMP/KPMP-6260_UpdateMissingSlides
Dert1129 Nov 17, 2025
d2d2078
KPMP-5807: Bunch of fixes to get happy path working
rlreamy Dec 4, 2025
a481a8c
Merge branch 'develop' into KPMP-5807_RenameWSIFileOnMove
rlreamy Dec 4, 2025
6a3fa97
Merge pull request #160 from KPMP/KPMP-5807_RenameWSIFileOnMove
Dert1129 Dec 4, 2025
e812218
KPMP-6545: load biopsy tracking long table
zwright Jan 30, 2026
5f350d7
KPMP-6545: biopsy tracking long table SQL
zwright Jan 30, 2026
480b66f
KPMP-6545: truncate correct table
zwright Jan 30, 2026
31fc4ed
KPMP-6545: remove unused results
zwright Jan 30, 2026
d81f086
Merge pull request #161 from KPMP/KPMP-6545_Biopsy_Tracking_Tableau
rlreamy Feb 2, 2026
a5eb85f
KPMP-6545: remove unused results
zwright Feb 3, 2026
d84acc4
KPMP-6545: add tis
zwright Feb 4, 2026
2e48864
KPMP-6545: forgot format string
zwright Feb 4, 2026
6d84115
Merge pull request #162 from KPMP/KPMP-6545_Biopsy_Tracking_Tableau
zwright Feb 4, 2026
7507692
KPMP-6566: try chunks
zwright Mar 23, 2026
a4cef73
KPMP-6566: increase timeout
zwright Mar 25, 2026
e931c91
Merge pull request #164 from KPMP/KPMP-6566_Recall_Issues
Dert1129 Mar 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-libra.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ on:
jobs:
docker:
env:
IMAGE_TAG: "1.8.1"
IMAGE_TAG: "1.10"
runs-on: ubuntu-latest
steps:
- name: Get branch names
Expand Down
1 change: 1 addition & 0 deletions .java-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
11
15 changes: 14 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# Changelog

## Release 1.10 [Unreleased]
Breif summary:

## Release 1.9 [unreleased]
### Breaking changes

### Other changes

---

## Release 1.9 [Released 9/2/2025]
Brief summary
- Load Tableau database with biopsy_tracker data
- Insert dlu_upload_type to dlu_package_inventory table
- Create recalled packages endpoint
- Update Multi Modal package name
- Tweak bulk upload fields

### Breaking changes
- changed column names in tables
- added new columns to dlu_package_inventory table

---

## Release 1.8.1 [Released 11/8/2024]
Brief summary of what's in this release:
Expand Down
Binary file added data_management/.BulkUploader.swp
Binary file not shown.
25 changes: 25 additions & 0 deletions data_management/BulkUploader
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM python:3.10-slim-bullseye

WORKDIR /usr/src/app

ENV FLASK_APP=app.py
ENV FLASK_RUN_HOST=0.0.0.0

RUN apt-get update \
&& apt-get install -y curl

COPY requirements.txt ./

RUN pip3 config --user set global.progress_bar off
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 install -U flask-cors

COPY lib/ ./lib
COPY main.py ./
COPY app.py ./
COPY process_bulk_uploads.py ./
COPY services/ ./services
COPY model/ ./model
COPY .env ./.env

ENTRYPOINT []
3 changes: 3 additions & 0 deletions data_management/DluWatcher
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
FROM python:3.10-slim-bullseye

USER root

COPY requirements.txt ./

RUN pip3 install --progress-bar off --no-cache-dir -r requirements.txt
Expand All @@ -10,6 +12,7 @@ COPY ./services/dlu_filesystem.py ./services/dlu_filesystem.py
COPY ./services/dlu_package_inventory.py ./services/dlu_package_inventory.py
COPY ./services/dlu_state.py ./services/dlu_state.py
COPY ./services/dlu_management.py ./services/dlu_management.py
COPY ./services/slide_management.py ./services/slide_management.py
COPY ./services/dlu_mongo.py ./services/dlu_mongo.py
COPY ./model ./model
COPY ./watch_files.py ./
Expand Down
2 changes: 1 addition & 1 deletion data_management/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,5 @@ COPY app.py ./
COPY process_bulk_uploads.py ./
COPY services/ ./services

ENTRYPOINT ["gunicorn", "-b", ":5000", "app:app"]
ENTRYPOINT ["gunicorn", "-b", ":5000", "app:app", "-t", "1200"]

8 changes: 4 additions & 4 deletions data_management/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def recall_dlu_package(package_id):
return error_msg

dlu_data_directory = '/data/package_' + package_id
directory_info = DirectoryInfo(dlu_data_directory)
directory_info = DirectoryInfo(dlu_data_directory, calculate_checksums = False)
file_list = None
if directory_info.file_count == 0 and directory_info.subdir_count == 0:
error_msg = "Error: package " + package_id + " has no files or top level subdirectory"
Expand All @@ -92,9 +92,9 @@ def recall_dlu_package(package_id):
if directory_info.file_count == 0 and directory_info.subdir_count == 1:
contents = "".join(directory_info.dir_contents)
top_level_subdir = package_id + "/" + contents
file_list = dlu_file_handler.match_files(top_level_subdir)
file_list = dlu_file_handler.match_files(top_level_subdir,False)
else:
file_list = dlu_file_handler.match_files(package_id)
file_list = dlu_file_handler.match_files(package_id,False)

dlu_files = []
for file in directory_info.file_details:
Expand All @@ -117,4 +117,4 @@ def get_package_status(package_id):
dlu_package_inventory = DLUPackageInventory()
dlu_package_inventory.reconnect()
status = dlu_package_inventory.get_package_status(package_id)
return status[0]["globus_dlu_status"] if len(status) > 0 and status[0]["globus_dlu_status"] is not None else ""
return status[0]["globus_dlu_status"] if len(status) > 0 and status[0]["globus_dlu_status"] is not None else ""
38 changes: 38 additions & 0 deletions data_management/generate_sc_rnaseq_yaml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import os
import yaml
import sys

yamlData = {
"package_type": "Single-cell RNA-Seq",
"tis": "Michigan/Broad/Princeton",
"data_generators": "Rajasree Menon",
"dataset_description": ""
}
experiments = []

if len(sys.argv) == 1:
print("Error. Please specify directory: python3 generate_sc_rnaseq_yaml.py /path/to/bulk/upload")
exit(1)

dir = sys.argv[1]
for root, dirs, files in os.walk(dir):
if root == dir:
continue
sample_id = os.path.split(root)[1]
experiment = {
"internal_experiment_id": sample_id,
"files": []
}
for file in files:
experiment['files'].append({
'redcap_id': sample_id,
'spectrack_sample_id': sample_id,
'relative_file_path_and_name': sample_id + '/' + file,
'file_metadata': ""
})
experiments.append({
"experiment": experiment
})
yamlData["experiments"] = experiments
with open(os.path.join(dir, 'bulk-manifest.yaml'), 'w') as file:
yaml.dump(yamlData, file)
25 changes: 19 additions & 6 deletions data_management/lib/mysql_connection.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def get_db_connection(self):
self.database.get_warnings = True
return self.database
except Exception as error:
logger.error("Can't connect to MySQL: ", exec_info=error)
logger.exception("Can't connect to MySQL: ", error)
os.sys.exit()

def get_tableau_db_connection(self):
Expand All @@ -102,7 +102,7 @@ def get_tableau_db_connection(self):
self.database.get_warnings = True
return self.database
except Exception as error:
logger.error("Can't connect to MySQL: ", exc_info=error)
logger.exception("Can't connect to MySQL: ", error)
os.sys.exit()

def insert_data(self, sql, data):
Expand All @@ -123,6 +123,20 @@ def insert_data(self, sql, data):
finally:
self.database.commit()
self.cursor.close()

def insert_data_no_alert(self, sql, data):
try:
self.get_db_cursor()
self.cursor.execute(sql, data)
warning = self.cursor.fetchwarnings()
if warning is not None:
print(warning)
except:
message = f"Error: Cannot insert with query: {sql}; and the data: {data}"
logger.error(message)
finally:
self.database.commit()
self.cursor.close()

def get_data(self, sql, query_data=None):
try:
Expand All @@ -132,13 +146,12 @@ def get_data(self, sql, query_data=None):
for row in self.cursor:
data.append(row)
return data
except:
message = "Error: Can't get data_management data."
logger.error(message)
except Exception as error:
logger.error(str(error))
requests.post(
slack_url,
headers={'Content-type': 'application/json', },
data='{"text":"' + message + '"}'
data='{"text":"' + "Error: " + str(error) + '"}'
)
finally:
self.cursor.close()
Expand Down
4 changes: 4 additions & 0 deletions data_management/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ def upsert_new_spectrack_specimens(self):
def load_biopsy_tracking(self):
return self.tableau.load_biopsy_tracking()

def load_biopsy_tracking_long(self):
return self.tableau.load_biopsy_tracking_long()

def load_data_manager_data(self):
return self.tableau.load_data_manager_data()

Expand Down Expand Up @@ -88,6 +91,7 @@ def update_biomarker_tracking_redcap_ids(self):
if args.action == "insert" or args.action == "update":
records_modified = main.load_biopsy_tracking()
records_modified = records_modified + main.load_data_manager_data()
records_modified = records_modified + main.load_biopsy_tracking_long()

if "records_modified" in locals():
logger.info(f"{records_modified} records modified")
Empty file removed data_management/model/__init__.py
Empty file.
2 changes: 2 additions & 0 deletions data_management/model/dlu_package.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def __init__(self):
self.dlu_protocol = None
self.dlu_data_generators = None
self.dlu_files = []
self.dlu_upload_type = None
self.submitter_name = None
self.known_specimen = None
self.redcap_id = None
Expand Down Expand Up @@ -61,6 +62,7 @@ def get_dmd_dpi_tuple(self):
self.dlu_subject_id,
self.dlu_error,
self.dlu_lfu,
self.dlu_upload_type,
self.globus_dlu_status
)

Expand Down
44 changes: 36 additions & 8 deletions data_management/process_bulk_uploads.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ class ProcessBulkUploads:
def __init__(self, data_directory: str, globus_only: bool = False, globus_root: str = None, preserve_path: bool = False, bypass_dup_check: bool = False):
try:
self.dlu_management = DluManagement()
except:
logger.error("There was a problem loading the Data Management library.")
except Exception as e:
logger.exception("There was a problem loading the Data Management library.", e)
try:
self.submitter = os.environ["mongo_submitter_id"]
self.submitter_name = os.environ["submitter_name"]
Expand Down Expand Up @@ -68,20 +68,42 @@ def process_files(self, manifest_files_arr: list) -> list:
logger.info(file_full_path)
size = os.path.getsize(file_full_path)
file_info = self.dlu_file_handler.split_path(file_path, self.preserve_path)
if file["file_metadata"] and "md5_hash" in file["file_metadata"]:
if "file_metadata" in file and "md5_hash" in file["file_metadata"]:
checksum = file["file_metadata"]["md5_hash"]
del file["file_metadata"]["md5_hash"]
else:
checksum = calculate_checksum(file_full_path)
if file["file_metadata"]:
if "file_metadata" in file:
metadata = file["file_metadata"]
else:
metadata = {}
dlu_file = DLUFile(file_info["file_name"], file_info["file_path"], checksum, size, metadata)
dlu_files.append(dlu_file)
return dlu_files

def process_globus_only_files(self, manifest_files_arr: list) -> list:
logger.info("globus only file processing")
files = []
for file in manifest_files_arr:
file_path = file["relative_file_path_and_name"]
file_full_path = os.path.join(self.data_directory, file_path)
file_info = self.dlu_file_handler.split_path(file_path, self.preserve_path)
if "file_metadata" in file and "md5_hash" in file["file_metadata"]:
checksum = file["file_metadata"]["md5_hash"]
del file["file_metadata"]["md5_hash"]
if "file_metadata" in file:
metadata = file["file_metadata"]
else:
metadata = {}

# Since this is going directly to globus, we don't need to calc checksum or filesize, and we need
# the path to the file on disk to actually copy it
dlu_file = DLUFile(file_info["file_name"], file_full_path, '', 0, metadata)
files.append(dlu_file)
return files

def process_bulk_uploads(self):
logger.info("in process bulk uploads")
for manifest_name in MANIFEST_FILE_NAMES:
manifest_file_path = os.path.join(self.data_directory, manifest_name)
if os.path.isfile(manifest_file_path):
Expand All @@ -93,13 +115,14 @@ def process_bulk_uploads(self):
manifest_data = yaml.safe_load(stream)
if manifest_data["package_type"] == "EM Images":
package_type = PackageType.ELECTRON_MICROSCOPY
elif manifest_data["package_type"] == "Segmentation Masks":
elif manifest_data["package_type"] == "Segmentation Masks & Pathomics Vectors":
package_type = PackageType.SEGMENTATION
elif manifest_data["package_type"] == "Multimodal Images":
package_type = PackageType.MULTI_MODAL
elif manifest_data["package_type"] == "Single-cell RNA-Seq":
package_type = PackageType.SINGLE_CELL
else:
logger.info("package type is: ", manifest_data["package_type"])
package_type = PackageType.OTHER
if "tis" in manifest_data:
tis = manifest_data["tis"]
Expand All @@ -111,20 +134,24 @@ def process_bulk_uploads(self):
redcap_id = experiment["files"][0]["redcap_id"]
sample_id = experiment["files"][0]["spectrack_sample_id"]
if redcap_id and redcap_id.startswith("S-"):
logger.info("found redcap id starting with S-")
sample_id = redcap_id
redcap_results = self.dlu_management.get_redcapid_by_subjectid(sample_id)
if redcap_results is not None and len(redcap_results) == 1:
if redcap_results is not None and len(redcap_results) > 1:
redcap_id = redcap_results
else:
redcap_id = ""
redcap_id = None

if not sample_id:
sample_id = redcap_id

if (sample_id and len(self.dlu_management.get_participant_by_redcap_id(redcap_id)) > 0) or \
(self.globus_only and sample_id):
logger.info(f"Trying to add package for {redcap_id} / {sample_id}")
dlu_file_list = self.process_files(experiment["files"])
if self.globus_only:
dlu_file_list = self.process_globus_only_files(experiment["files"])
else:
dlu_file_list = self.process_files(experiment["files"])
if package_type == PackageType.SEGMENTATION:
dlu_file_list.append(self.get_single_file(SEGMENTATION_README))
tis = "UFL"
Expand All @@ -150,6 +177,7 @@ def process_bulk_uploads(self):
package.dlu_version = 4
package.dlu_dataset_information_version = 1
package.dlu_error = 0
package.dlu_upload_type = 'KPMP Biopsy'
if self.globus_only:
package.globus_dlu_status = None
else:
Expand Down
2 changes: 1 addition & 1 deletion data_management/rebuild.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
python3 setup.py install --user
docker build -t kingstonduo/data-management:1.8.1 .
docker build -t kingstonduo/data-management:1.10 .
Loading
Loading