← Home
Technical specs for importing files.
These are the assumptions and preconditions that the rest of the document is based on.
The rest of this document will use the following sample tables:
batch— batch_id, email_notification (bool), added
media— media_id, name, type, created, modified, added
batch_media— batch_id, media_id, completed (bool)
The web interface should be the one interface the user has with the file system. (In the future, Xythos might be another interface with the Media Database, but it too shouldn't be able to access the internal structure of the files. Rather, the user should be able to upload to a drop box, which a script later puts into the appropriate directories.)
This is important, because it lets the Media Database more strongly tie the files into the database. It reduces the risk of files disappearing and broken links. Having the files and their metadata inherently linked to the database makes it easier to keep track of and search through data and metadata.
When a user uploads a series of uploads, they should be kept in a batch, which is held in the batch_media table. This allows the user to view his or her images by batches, in the order they were uploaded. A user should be able to delete a batch and all its related images.
The mechanism in the multi-file flash uploading allows us to check for duplicates. A file is a duplicate if it has the same name, creation date, modification date, and file size. This means that a user can't upload duplicate images in any given batch.
This also means that while users can't upload duplicate images in one session, they technically can upload duplicate images across batches. E.g., they could upload one image one day, and the same image the next day. Generally, this is the accepted behavior of existing media databases (like Flickr). That said, the most common occurrence of duplicate files is uploading one batch twice (hitting the upload button twice, or a user reuploading a set of images because they didn't think it went through.) This problem is solved by the use of batches. Because the user can see their individual batches, they can simply delete one of the batches to remove all the duplicates.
In the file system, files should be named with a database id. The filenames can be kept in the database. This means we can have a folder structure similar to this:
/username/thumbs/15
/username/standard/15
The file extension technically isn't necessary since we have the mime type in the database.
The benefits of this approach is that we don't have to worry about filename collisions (Image 1.jpg) and we have a consistent file structure that we can rely on..
This is where the user begins. They should be provided a sentence or two detailing what kind of files they can upload and any other restrictions. There should be a simple "Browse" button which lets them choose multiple files. After they select the files, an "Upload" button should appear. They should be able to delete these files after adding them, and they should be able to add more. Duplicate files shouldn't be added. Once they have chosen their files, they should be able to hit the Upload button.
Before any files are uploaded, an entry should be added to the batch table. It should hold when the batch was created and whether the user wants notification when the batch is finished processing. We will use the batch_id that is created to associate the uploads in the batch_media table. Then, each file is uploaded to a specified script. A working example is available. The only feature it doesn't implement is duplicate checking, but it is possible under the current framework.
The script that is sent the file data is the one that copies the file over to the file system and adds it to the database. It should first add a row to the media table with all the appropriate information, and grab the media_id. This media_id should be added with the appropriate batch_id to the batch_media table. It should then copy the file to the appropriate directory using the media_id as the filename.
At this stage, using this framework, it doesn't seem as if it would be too difficult to read the EXIF data from the file and put it in the database.
The last step differs from the original implementation. Because we are now dealing with the files one at a time (rather than one zip file), we can process the files immediately, as soon as they're uploaded. The problem, however, is that we don't want the page to halt as we process the images. The solution is to use background processes. Essentially we call a secondary script and pass it the media_id. Depending on the language used (Perl or PHP), there are different ways of calling the second script as a background process that doesn't halt the main script that aren't too difficult.
The benefit of this approach is that the files can be processed immediately and in parallel, resulting a much faster response time for the user. It also means that the processing can be load balanced across multiple machines for speed, if that were ever necessary.
Since we don't have to wait for the second script to finish processing all the files, we can display a "thank you" page to the user and give them a rough estimate of how long the files will take to upload.
The second script should take the file and resize it and place the files in the appropriate directories. Once it is finished, it should update the database and set the "completed" flag in the batch_media table to "true."
Whenever the user tries to view the batch in the web interface, a check similar to the following should take place:
select * from batch_media where batch_id = ? and completed = false
If the result set isn't empty, then we know not all the files have been processed, and the user should be told to wait. Ideally, they should be reported that "6 of 10 files have been processed, or 60%. If we know how long it generally takes to process the files, we can also give a rough time estimate.
As the second script is processing the files, it should continually check if all the other files have been processed. If they have, then it can issue an e-mail to the user to let them know that the batch is ready, if they've requested an e-mail.
Once all of the files have been uploaded, put in the database, and been processed, then the user has a batch that they can view and manipulate in the web interface. At this point they should be able to delete the batch, or put the individual files in different collections. The details will come in a later document.