Skip to content

Latest commit

 

History

History

The Steps of Our Approach

Due to the limitation on file size, we use an example video: Java Tutorial For Beginners 26 - Polymorphism in Java to show how to run our approach psc2code.

For the whole dataset, please refer to the following link in Onedrive.

video_hash = 'GnLtvmeGAWA'

video_hash, video_name, video_playlist = get_video_info(video_hash) # get video name, its playlist's hash by video hash
print(video_hash, video_name, video_playlist)

video = video_name + '_' + video_hash # The name of video is in format of its title + hash 
video_mp4_path = os.path.join(video_dir, video_playlist, video+".mp4") # the path of raw video
  1. Reducing Non-Informative Frames (Related functions are in preprocess.py)
extract_frames(video_mp4_path, os.path.join(images_dir, video))
diff_frames(os.path.join(images_dir, video), thre=0.05, metric="NRMSE")

This step uses ffmpeg to extract frames then removes non-informative frames based on the dissimilarity.

The outputs are stored in the folder Images.

  1. Removing Non-Code and Noisy-Code Frames (Related source code and files are in video_tagging)
predict_video(os.path.join(images_dir, video), model_file="video_tagging/weights.h5")

Due to the limited file size of GitHub, we upload our trained model weights.h5 into Dropbox

This step uses a trained model to identify the valid and invalid frames; the results are stored into a file named "predict.txt"

  1. Distinguishing Code versus Non-Code Regions (Related functions are in video.py)
cvideo = CVideo(video)
# detect boundingx boxes and store the information of lines and rects into folder 'Lines'
cvideo.cluster_lines()
cvideo.adjust_lines()
cvideo.detect_rects()
# crop the bounding boxes of frames into folder 'Crops'
cvideo.crop_rects()

The information abount detected bounding boxes are stored into the folder Lines, and the cropped frames are in the folder Crops

  1. Correcting Errors in OCRed Source Code (Related source code and files are in OCR)
  • Get OCRed source code from cropped frames.
google_ocr(video_name, video_hash)
  • Correct errors in the OCRed source code
srt_file = os.path.join(video_dir, video_playlist, video+".srt") # caption file if exist
parser = GoogleOCRParser(video, srt_file)
parser.correct_words()

The results are stored in the folder GoogleOCR.