Online services have begun getting serious about data portability, in large part due to GDPR's requirement that they provide it under penalty of large fines.

However, the archives users can download from services contain files that are not well documented, consistent, or inter-operable.

Apache Streams now provides software that re-processes these archives into datasets that are superior.  There is a lot of potential for this aspect of the project to have an impact, and provide a public good.

There are tickets open about new sources, and improvements to data normalization and depth that have already been identified.

Additionally there are numerous platform improvements that would make it easier to build, test, maintain, and iterate on these components.


  • No labels