@SeanMurray_59b6, @StewartBernard_550c thanks very much for this thread. This seems like a very complex and mature workflow and science case, which a distributed infrastructure is very good for.
I would suggest first of all that we all do our homework - the idea is to design a relevant infrastructure for the community.
Design your own e-Infrastructure
I would say please check out the recent Design your own e-Infrastructure event held around the Digital Infrastructures for Research conference in Krakow. The "pitch" of EUDAT, EGI, OpenAIRE, and Geant infrastructures are clearly laid out there. We can offer almost any of those services in some form or another, thanks to the interoperability between the Africa-Arabia ROC and EGI.
Identify your platform
Also please read the service catalogue of the Indigo DataCloud - we recently ran a few hackfests wherein we pulled in whatever components were needed to a final "product". Another place to look is the Sci-GaIA service catalogue.
Hack it together.
If you could initially try to identify whatever components in these lists you think you need, we can define a time and place to get together and hack a platform together to exploit the underlying infrastructure. This is pretty standard in our communities, and helps a lot to get prototypes out in the field for user testing.
I would like to suggest that you consider becoming one of the Sci-GaIA Champions
Ok, let me get to specific items here :
This DAQ problem needs some coordination from the network providers I think. If the idea is to have a store at a central DIRISA node, then we need to know what endpoints and transport protocols are required for the data transfer. It's a secondary issue to the "grid" side of things, but it will help in determining the components of the platform.
So a realtime processing system obviously needs to be set up in a semi-static way, and fully distributed data processing is not such a great idea. I would like to hear more about this however, since personally, I'm very interested in realtime processing, as I'm sure @SeanMurray_59b6 is, given our work on the ALICE HLT. Anyway - the reprocessing :
This can likely be done very efficiently using the sites on the grid. We have several options for staging data, using metadata catalogues and accessing repositories via API. See the options from EUDAT, Indigo and EGI - all of them are in principle available, but the easiest to get off the ground is the EGI stack - UMD (because it's already deployed at all the sites).
do you guys have a metadata schema and a preferred repository ? We suggest using Invenio, but there are several options here.
Do you need persistent identifiers ? (yes, you do, but the question is - do you want DataCite DOIs or do you have your own system internal to the community ?
So, we're talking Open Access here. Again, we suggest Invenio for this. Publishing data from the grid to the repository can be done via API, we have several examples of how to do that. Invenio is OAI-PMH compliant and just look at Zenodo if you want to see it in action. @roberto_barbera can comment some more here...
As for the code, I am quite happy to work with you to get that into CODE-RADE, so that you can process data everywhere. See http://www.africa-grid.org/applications/ for what we have in the repo and what state it's in, and then help us understand your stack. If it's just python modules, we can easily put them on top of the 2.7 or 3.4 dependency tree.
This project is going to rule so hard !