Faculty Summit 2008 Day 1: Google Data

YouTube Preview Image

Attention! API Garbage (might be unreadable):

data Google people

Might as well all theater at Google I actually was a graduate of Carnegie Mellon so other in regarding Helen yes I know I was more than little more than monitor I saw several people that all I m exactly none of your APIs seem so we build the a lot of the APIs first four Google products Google has a lot of services that a lot of data and we have a lot of problems with one of our problems making it universally accessible so one of the missions of the dated a guy seems to make it universally accessible and also easily accessible so my goal presenters Jeff subnetworks and University relations once that I launch of developer relations and related links we relate with his developers still earning degrees out of 30 actually help a lot of people use the products that might seem still and so when people are familiar with the Google data APIs already only people that use them so I will know until a couple of seats is also in the operations of course you probably people also use like RSS atom feeds investment of RSS and atom are receiving data okay lets go something simpler XML people use XML in the snow depth of an organism some counsel and at http://www.drive thousands of Excel is a technology-based wood for building everybody of the so this is an informal presentation of you as one asked questions feel free to jump in advance if you want all the icons of the whole Google data APIs so that a microphone Google data APIs actually it was exiled to say what I think I start like that\nwill this is my first time with the ones that were talking about here is Google has data so we have countered it with spreadsheet data we have bloggers we have written on another date is not just the web stuff that we have of talking more about their personal data set that everybody maintain the Google this is from her many little tiny sets of data that people have it you ve used Google applications which is one of the primary deployment areas right now for Google data APIs you keep a lot of your data Google your administrator of the domain you have data on all the UN got a lot of data to demand the Google are all your users you have a lot of blogosphere and you have a lot of calendar events that are managing a set of calendars if you subscribe to the and so you have a lot of data anyone will Google moving the polls trying to do in making data accessible is not just rating is really cool web apps so the applications are all about your personal data and sharing it in making available but that s UIs are great for humans but the really bad for programmers so one of the things he wants all the sisters sharing your data steps Bob Little data such as sharing them on the web so we set up the properties here as I can idea that they were talking about the one thing that I wonder what I was just trying to outline a little bit of a problem faced the Google data APIs which are built on a protocol called atom publishing protocol with all use a lot and one of the things that some of the properties were trying to encourage so I theater structured data so obviously of course you don t know. Basically the stoploss effects on the web in CSV files of all sorts of simple ways to structure data supercomputers are discoverable and how do you find his data on the web and access it possible this is one of the things Google deals with allies fail though you know you re familiar people are accessing your data be able to cash it is extremely bored being able to take advantage of ways of performance from and then of course the standard-based protocols one of the things that also best is about universal accessibility if we build our APIs are very common well except it s standard that people could that then anybody can access with it anybody can build programs so another part of data like always data set from talking about his application is that I ll you want to be able to edit it on the web so if you savor your calendar you want to be able to modify your calendar and actually synchronize the calendar applications you want to be able to modify your photos in one field to change upload videos and search videos and to all these other things if one is great longer posts all sorts of the end of finally and I m one of the things that one of the other goals that we have is that we re trying to really make our APIs extremely user-friendly Google is about trying to make stop universally accessible which means not just the programmers spent like would have a company that will pay them a year to read every specification on something for access technology that can pick up overnight very familiar simple technologies so I just sort of looked at a little bit of suggestion or some other type of technologies built a circuit is little that is of a context so that technologies were looking at how the Google data APIs and so HTML thing about HTML is a way to share data binding in probably all individuals of presentation protocol and for papers but think of it as a way to share data with tables up anybody in the world can get there extremely discoverable their search engines that are simple everybody can there extremely usable I couldn t know I could probably teach my mother to write HTML tables it s not that hard Mr. Leno toward wired so we really win on user-friendliness we went on standards-based but it has absolutely almost no structure in SL has no edibility editability policeman s and standard there somethings with which again are completely nonstandard inability in ability but if you want to share your data HTMLs actually quite good and I never made never bothered that it is her something or other ways of doing things on the web another another extremes for all of the other and it would be sort of like a soap soap is a standards-based XML model of doing RPC so I immediately got the site even exchanged arbitrarily complex structure data it does any arbitrary RPC is a discoverability issues of the metadata 71 it available to other people you publish your problem is that the definition language for soap I can remember was still in the publisher whistles and sell anybody can read that whistle constructed interface and start accessing that data into any operation of the conservative crud basic crud operations on your data but they can do any other arbitrary transformation of end but however is its basically just arbitrary RPC does a certain user-friendliness level of it sort of goes to a lot of problems with interoperability among soap implementations so and you deftly are dependent on the toolchain that filters so your weekend programmer isn t going to necessarily pick this up is not to be something that I will pick up in an afternoon on an associate standards-based is some of that I ve would to start there because someone said the nice thing about standards is there are so many soap is a big known for having large so in the last five years as been a trend towards a style of architecture debacle rests on an anybody here heard of the rest architectures so rest is all whole philosophy of building APIs so it s a little bit different a difficult summarize row quickly boat essentially the idea is you think of your APIs as resources in the armrest is resource representational State transfer in the representation of the resource is a reference of ours are presented and Adam rest model is basically based on 2818 80 using rest model is like the HTML model for data so the ideas the HTML objects pages are resources that infection view rest objects are resources that infection manipulates and overall addressed by each team and they re all addressed by URLs every resource at the URL every resource is a document containing a certain amount of information it and the protocol for manipulating your resources the same protocol of the Web is a GDP crud that s in your guide to get you better get a better post put effectively you got your basic crud operations on these data objects and that s really how rest is very simple and what you get from this is rest the.the format of arrest document can be anything we typically use XML so already if any structure is relatively simple as simple as XML can be for the average user of its early how the model of discoverability all your resources exist as URLs and you can prove you can actually include in your resources and links to other resources that have the ability to be discoverable because you re built on the HDTV infrastructure that Brooklyn structure you get this great benefit of cacheability sold to people such the same resources of proxy suddenly this resource is fashionable to have a literacy protocols and a cheat just worked out such as using it for my root entity tags as a way of tagging a resource with a with a strong version so you can validate the need resource hasn t changed without actually transferred so these are the types of things that that rest are you again” will require out all that is an implementation of certain embodiment of this rest philosophy and what they ve done is they ve taken all of lots of a little more vigilant about later salt office and show you the idea is that as a rest-based protocol visit and is an implementation of a restful service we got back to the core philosophy behind bubble of Whidbey so we wonder what is the Google data APIs Google data APIs are at their core we are publishing protocol is what I m an atom publishing protocol of the rest protests style protocol that sort of defines two things one of the wire format for transferring data in that wire format is based on atom syndication format so I mention I asked the people know RSS and atom essentially RSS and atom are in XML format for transferring collections of information and its fairly fairly trivial of a system and the protocol out of Libertyville the protocol operations for atom publishing protocol are all HDTV for so what redundancy is a starter without a publishing protocol limit added several extensions to how you re going to use atom publishing protocol essentially taking it from this theoretical construct that s good for reading and writing blogs to something that s much better suited for reading and writing data and Soviet leaders extended the data model supermodel I would be applicable to patterns design patterns model more data than they were available please extend me that querying prevented them from recently concurrency issues and off medication batch processing basic things that we find the are consumers of these off of his data sources really so I m publishing protocol one slide with only two datatypes in a multiple to basic data types in a convertible domain concepts one is the concept of the feed you re stuck RSS feeds and atom feeds and we use the word feed a lot and it confuses people because everybody thinks nobody knows what a feed is this thing that pushes shuttle David in our case in ambushing Prokofiev isn t something that is not conveyor belt of data is just a container that you store data in on the web and exist on the web that addressed the URL is itself a resource within the feed then there s another type of resource which is the entry and the entry is it just any data item resource and that data item is itself structure and if you go read the atom syndication format and the other publishing articles that build to see the feed and entry assistance is being this is a simple documents the revenue transfer the contents of the feed over the web transferring and Excel documents and the entries within her respectable subdocuments very straightforward most people conserve look at one and understand what is in the actual protocol is limited crud is about it so reducing agent reverts insert at an entry to a container get the contents of the container which is the feed document containing entries and even perform a basic crud operations on individual that s pretty much the whole protocol urged the so first thing I thought about little bit about what this one of these data entry items look like so this is everybody s XML as I told you it s extremely easy to read and write everybody can read the Sony break it down a little before such a thing about what is a data item is an entry in the entry field to the list of the data is an entry item and essentially there some contents the century happens to represent the calendar event self-interest content here provides a standard set of XML tags representing content idol content what is the content actually says what is needed the nature of the event author or the several other standard features of XML offers for actually conveying one of the other nice problem properties of using atom of using atom syndication format is every entry I read is required to contain a certain amount of metadata that makes entry very self describing and very useful so extremely useful metadata like date and time stamps when was it great when the last update is required by a publishing protocol may make it very easy to do synchronization and start doing deals are getting the most recent changes to the information on the web universal ID all items have to be tagged with a unique identifier again in true web fashion is always a URL which goes on forever seem to be an infinite number of them and then of course you ll have things like links I mentioned that you could always embed links and this is all a GDP only in this case the link represents a place where you operate honestly and ordered at this particular so this is what the absolute thing that the atom publishing protocol provides atom format portable provides represent so and she data while he doesn t taken his basic format and we provided our own extensions to so we say when ordering calendar events that define their own XML are extending the data that s in this is that in this entry so that you don t essentially start time and time reminder and so on and also we ve added a typing mechanism so with this in this case is category tags as so consumers now they see his entry they note the calendar event into what schemas can conform to the boom they can actually start processing it and people as this format is at least one of the people and processes that you can write code to process a CTP processors are easier to process processors are everywhere and every language has wanted so they protocol again is if it is based on a CTP of you an example of how insert an entry into a container will see out example.org/feed very simple protocol I had I been offended at an entry to the server in one and get back is in a CTP result 201 created inner HDTV return codes no special content in the special code in the time the body of the message itself and worried happier this entry now to be modified by the server to provide in a server can assign a unique ID you can assign links and tell you where you re going to edit this thing he conveys all information over here now that I ve added a link to this entry I can edit this entry in the future I can also get to do a get operation of feeding the list of all the entries in so to fairly straightforward simple protocol low barrier to entry or program am so querying on Wednesday at it I think I was that the connection would drop that fights a querying is probably simple enough essentially we ve added a query model and is the simplest thing you can imagine essentially use URL query syntax so here I m doing a fulltext query on all tickets if you pass this to the feed URL you pass the query that will return you will see document and all entries that contain text is not not particularly difficult and we defined our query parameters defined in whatever metadata query parameters whatever we think are the most visible by consumers of this been one of the things that were very concerned about this authentication because revealing their personal data we have to deal with desktop applications and Web applications that Padilla jobs for applications server for Robert Dreyfuss family suite of ways that you can authenticate and still maintain security so are you share a dual axis control these things will up to Mr. concurrency be originally greater on optimistic concurrency protocol by embedding in the edit link suggestion that the version numbers to detect conflicts were actually moving towards a newer version which is this entity tags is a standard HDTV mechanisms for conflict detection and missiles and they are also standard HDTV mechanisms for caching soap this move is an improved as more standards compliant which again means for building on common libraries and will improve our cash and media entries you can store binary data in the atom publishing protocol exactly than bashing operations in changes on multiple items users of some of the types of extensions that work done for actually working on extending the protocol based on the feedback of our users so I think partial update and parcel get things that are people been wanting to modify one individual field in an entry and so so roughly actively working on developing protocol moving these things in a so where do we actually do of APIs this is a short this is the current list is growing every day so I mentioned that it was originally atom publishing protocol was really designed for blogger but were using it for calendar Web albums YouTube is one of our most popular APIs and their and Picasso well with other APIs on our health products is all built on cheated effectively why actually is built on top of the cheated diet plus it s all the interconnection with all the service providers is the almonds and so on and so on and so I highlight issue because we ll see those little and the other thing that we provide as the DAT IT me out build cheated APIs within Google which are atom of APIs that provide effusive language toolkits for clients such as easy enough that people can connect but react to try to provide a higher level for people who want actually interact with them to per minute as these are all open source libraries to download and play a modify any one of them and then you can connect any of these services and get slightly higher level semantics for inserting entries leaving because it s an open-source protocol specification is simple enough we ve had people developing them in all sorts of lines so I just mention that in the end it doesn t even require a client library allow people to subscribe with a feed reader a lot of people use we have adjacent open format with a lot of people write JavaScript that are immediately fetching data say the latest five blog posts they fetch it embedded in JavaScript format embedded in webpages you so there s almost the almost half our usage somewhere around half but we is done without any of our client s that s how easy we been so low that their country is reason for accessing so that s a brief tour of all those that she data effort at Google just represents a little application of its try to show some of the applications of the data is in a presentation of the questions on the sub will rub to update the project will give you 20 times questions again at the end so the application I m as sure use actually a game built by Ron Pamela Fox news and their answer the idea of this game and introduce it to you first is that the user is presented with a set of images that are supposed to launch I guess the ones that represent this key word there given so they go through them if they get enough to pick the wrong one day they lose their lives for this random guy who s coming after me between tribal I m sorry value also that and then you can progress from levels and get high scores in a lesser thing and sociology time waster but the real point of this is a simple observation that like many computer scientists have realized that computers aren t very good at recognizing images at least not compared to people so you can design algorithms you can train machines but takes a lot of time and effort in CPU power you through a simple game of the people waste their bosses time giving you a nice index of relevancy search Soweto but we would do here is the record every time someone clicks on image and then we decide that while they clicked on it sooner rather than later is probably a better result is more recognizable pop sound to them so she was just example of things we ve collected on the left is the image that we picked based upon the highest score on the right is just doing a tag search for that image so Apple bit more relevant to BB their mention of days that I did not see an unlike line really hard to find an image search for that but people can visualize it are easily so does the question of Hamlet is asked is where does Google come into this match upcoming used flash is not even Google of assaulting his hosts on app engine of much off the bat which is our application hosting environment Reagan uploaded Python runs on her infrastructure but how do the APIs committed this will be images are coming from Picasso well problems others one API they are and I actually built a second piece into this think we have a slide to demonstrate all that s there we go so we were calling these images in from Picasso with this app engine backend of the friend that uses images to run the game in sleeves and results in all the data is stored and happened back in but now maybe you want to publish some of these results so trivial limitation here is to export to blogger and so far assure you that crater of a basic report generator to provide make it log in again and get an access token for a blogger blog to the blog generator port in this case it s just the 25 top scored images in your open a nice draft digital cable did and I can edit this when I can publish it to my research blog and no one can see your great what today are the top images and also imagine exporting this to something like spreadsheets which is very good at visualizing tabular data and setting up charts and setting up gadgets to visualize them track differences over time soon people are most interested in and sells a very simple integration using our Python client library were mentioning her client libraries earlier to use the TV Guide for blogger and free for them and the other component which is shown the bottom of the chart leave actually republished our database using an atom format so an arbitrary consumer of Adam can look at our data so this case here is a feed representing results data for the tag Apple but likewise we could do line actually about to Apple for a second dose as a result of Steve Jobs but so Firefox makes is pretty generally see all the metadata in here but we do have XML describing for example the number of trials of given images shown up in its current score based upon how people clicked on it and also the URL to the thumbnail that reason for the image so this is a toy example the point is that you can imagine in your own research you probably will collect gigabytes of data stored in some database and right now you could always a guess burning to DVD-R for college evidence of their own MySQL into their own crews against the also consider exposing it as a restful web service like this and people could bridge the slices of data they re interested in and create neat visualizations on what you ve managed to gather selects the sort of current spirit of what we re trying to do here in terms of exposing data for collaboration and making it useful not just whoever owns that data for storing the data of course registration codes to convince you that I m not just making this up and subtle magic smoke so for example to agree with blogger this is all Python and the old you will Python or not I do so it is creating a blogger service for getting this authentication token back from also a lot of it really is as simple as string of blog posts entries setting some bits of metadata for it and then sending it to the service so it s really not too bad and I did this in like 10 minutes on to spend more time trying to format the date when I didn t actually exporting to blogger and the nice thing about happen just uses the Janco template engine so you can do things like create report of things based upon data structures and Python very easily and similarly I was even able to create the entire atom feed just using a simple template engine were tacit data sets and sort of sordid template variables such as magically generates an entry for each post that we have the order paper for and so that s the really really super high-level view of what we re trying to hear this application but I think you can appreciate the greater message and that you can use our APIs either to collect data in the sense of and grab images from Picasso love you can grab videos from YouTube you can also do things and billable applications on top of them with app engine and you can republish them to our applications you may be already using like blogger to research blog or spreadsheets to visualize data keep track of it and so I think that s everything for minimizing on what was just just the thought at the wall and out of the publicly available format IETF standard in which the republished it yourself becomes a way for other people to consume build Suresh announced how we of the funding of your life and your open G. data format is well seeking steel all of our semantics and agree language and we won t go after you select very cool :-) \nhere so solo while he thought the last part so will give you an overview of what it is and then one sample example so we were sort of hoping to have a little of dialogue and get ideas from you when you have this whale of data formats one of the things you can think of that would be useful to do in you know what how could you use it I would like to see us use in all we do for you to make your life easier so I went down and I sort of try to think of the types of things that we think of when we think of it as a sort of big and Adam poses a big area of the computer used to do lots of things one of horses which only was build applications and this is exactly what he did instead middle and all allocation and semantic classification of images structuring it as a game but amusing but I m using all these data sources exposed to get data musings dear protocol published data to share my results as of the former legacy it is a form of data mining directly from mining the data from from Google and trying to do some analysis on there are limits on query responses of the software actually at Google so I am on the data mining it with you but there is the idea of being able to view my data sources on the web is also the sort of the boring obvious stuff which is administrative issues like you know a lot of people just use it as a way of managing say the resources in the lab in only one available to you can build your own tools and build your own web applications that will make your life simpler and make your system run smoother because you can integrate with all the tools if you re using Google tools or other tools integrate with them and the management that what and also it is said was like sharing your data and making it available to other people opposition application I put Interpol radish I guess I was a very cool example where somebody had come up with an idea that we had a so we created this calendar API at our main laws always been synchronization and one of the one of the engineers at Google said William and although the data total conference rooms or her calendar and they all have an API to the conference will I can put on each conference room a little device that electronically retrieve the schedule displays the validity that looked at it and said you were printing how many thousands of sheets of paper every day to stick in those conference rooms to say this conference are resolved by this conference rooms reserved everyday and they reprint those every single day we got what 800 conference rooms and count the paper and so someone built a little device at the little wireless device that uses ambient electric light to charge itself is also the solar power and a half volleys electronic paper displays that requires no power when it s actually all and so every once in a while awake every couple of hours or minutes wakes up queries the GAAP I figures out what that conference from schedule is displays it in and shuts itself down again to sit there and soaks up ambient light no power electricity this is an image that I just apply to the engineer does not abide by wasting so much and so he came up with in a very novel new way to sort of exploit the fact that your data is available to do something to solve a problem that nobody had ever even considered for saving paper so I really thought so in a photo but a few guys with you guys think where this what you guys think is aware of ways that we could possibly use our APIs are possibly make you that the exposing data on the web could be better or things that you d like to see us our discussions about a dozen federal courts be a technical question about what we do or how we do it is a big as a sin lead mouthfuls linking to tools like Google maps for example because the technology they would be great for some last cyborg of biological data bringing our biological networks and made out with maps and things that are keeping all of those I showed him what they do on a unit is the nemesis of your maps of Africa means my job settings for Google maps API and so is the time of Lansing and I don t think so that they is for Google maps and Google are in now and then say it really is the CBI to letting you easily create in and then adding on top of to see we don t have any sort of Giannini will announce that we re working on said Lexington County are a light story needed a specialized for story time I type in the right to the DOD yet which allows you to easily store again and it gives that any command briefing in that meeting with rebounding monster s radio series that you did not count on it is coming out we now between us and a few months ago and still you have the DOD yet now until the end we had gained is early Yankee once you control for rebirth in the browser and he s amassing yet one figure denounced Nasser and lightning going out of their we have ended for geo-data called team came out which I think it would add a little bit and will probably be more in the future to VCD AVI now so if you want to publish your format and publishing again it generally to be seen by anyone if you have viewed it in the EU probably want to look into publishing the data had no account and a guy started at Google are now he embraced by Microsoft as a lot a lot a lot in place 1000 of them let you that something to you and them is a what what his employees to help serve warm digital signatures for knowing that the movie is backed by someone that you can go back to the note is a legitimate version of something I heard Harvard to the code of the data value that is likely to get involved immediately got some wineries are making available enables data to be downloaded with things like cache poisoning if you think we re downloading data from the map on your really getting data from the map rather than something totally dissolved off a bridge or a restricted area that I have a number of issues like this where when you re passing a lot of things around as if both deeds and his libraries there all kinds of opportunities initiative and then so is his area might turn my research of having a digital signatures is traced back to authentication measures would seem to be reasonable I was wondering was working on that we will over that for the David APIs and further planning plenty of attackers is that the client libraries Qantas DNS attack a lot of things basically if you think of Adam publishing vertical unrest in general it s it s a lot like working on anywhere from minutes essentially where we are not any different than an anybody s suffering from fishing in the although the problems that come along with those of we have so realistically speaking the DHS per se don t allow those attack vectors as are probably things that need to be solved at a level one you might let things any resolve the level below that in terms of him aren t ever felt you should also request signing so well that there s authentication there are some issues well and one of the things we do work on specifically as authentication which is knowing that the person accessing the data is actually getting the so there is a standards out there called off so often you want to partner with the law is so over for those of you who are aware of is based on the idea that I m working with them working with a website that wants to get access to my calendar information and or might not more commonly actual M. on the information on my own and give going to give a frenzied night, contact information and send it and wants to remind APIs but I don t want to necessarily give them elegantly access rights that you want to give them a username and password so we other is a standard knowledge emerge fall off where that website friendly would redirect you to Google you would log into Google and online identity CBS relying on basic Web security mechanisms who will then redirect you back to that website and a website with them would receive a token that they could then use to access your service without knowing anything about so that s an issue with militia to work on a lot as authentication and authorization issues. We would love just thought you should see the fish waterways in phishing attacks and pages you can digitally sign and so I guess really going back to the issue of if I m going to request data from your mass or I m going to send this letter to my past work data to a neighboring and process a little computing power and being able digital signature on that data before it goes out or calculated message digesting that I had some way of tracking back would be a would give me greater confidence that I m getting the real thing and then getting something that was actually intended to be a writing so as not to mention that bit out of the burglar alarm publishing and syndication from a hidden form of the data format of you are actually our signing standards Board signing the data items themselves I don t believe that they been heavily use in the web so a lot of the attributes that, right now, on some positive suggestion of those vendors doing this they have been widely deployed if it s reasonable we do have some APIs of this work over a so-so but it says we have signed request a Muslim sign responses will ever look it up like setting where you probably won t notice the absence or the mechanisms all my think one of the biggest problems with them was a lot of the idea behind and add signature below the idea behind some of the items being able to republish information like entry would be a self-contained objects self describing even republish but then the signature tends to get broken because of the sensitive and unlike source attributes the change that and I design it right so signing actually became a very complicated problem a lot of people disrupt and I trust that HDV as it gets me to the source of a

One Response to “Faculty Summit 2008 Day 1: Google Data”

  1. Architectural Technologist - On line resources « The Konstrukshon Weblog Says:

    [...] Faculty Summit 2008 Day 1: Google Data [...]


Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>