We are here discussing some of the tricky questions that can be asked to you in a computer science interview.
1. Containers are becoming the de facto method for application deployment, moving away from virtual machines. Docker is a popular container application which most of you would be familiar with Docker. Docker is an open source project built on top of Linux containers, which allows applications to be isolated from each other. Can you explain how Docker differs in providing isolation to applications as compared to virtual machines? Can you compare the two methods of application deployment, namely, hosting an application on a VM vs hosting it on a container, with regard to the following criteria: (a) security, (b) isolation, (c) cost, and (d) fine grained resource control and sharing.
2. When we input a query into a search engine such as Google, we get an ‘auto complete’ suggestion for that query, based on past queries that have been tried on the search engine. Can you explain how you would design an auto-complete mechanism that would scale well across the millions of queries that can happen concurrently at a time? What is the data structure that you would use to store past queries and automatically provide suggestions? Would you store the complete query as is, or would you break it into its constituent words? Given that scalability and correctness is the key to this problem, can you explain how the choice of your data structure will address both?
3. In information retrieval, you are given a set of documents and a query. You are asked to retrieve the set of documents that are most relevant to the query. Many of the common information retrieval techniques are built around the measure of TF-IDF, where TF stands for the term frequency, i.e., the number of times a query keyword appears in a document. For example, given the query ‘Mahendra Singh Dhoni’s test match appearances’, I am asked to find all the documents relevant to it. It is obvious that one can compute the Term Frequency for the query terms, and return a ranked list of documents in which the query terms appear most frequently. If this suffices to return the most relevant documents, why do we need the measure of Inverse Document Frequency (IDF)? Can you illustrate, with an example, why IDF is needed in information retrieval? If you ignore IDF for relevant document retrieval, would it impact the precision or recall, or both?
4. With elections coming around in five Indian states, poll predictions have become quite a hot topic. However, there have been mixed results for poll predictions. For instance, the past year has been quite unsuccessful for many of the data scientists who are into poll prediction. Neither Brexit nor Trump’s victory were anticipated by most pollsters. Given this context, you are asked to predict the party that will win the majority in each of the five states going to the polls in the coming months. For any prediction, you need to build models which can analyse the data and make the prediction. What are the various data inputs that you would be interested in getting? Or, in simplistic terms, what are some of the obvious features that you would like to extract from the data for winning party predictions in a state election? Is this a classification problem or a clustering problem? Why do you think poll prediction is a hard problem?
5. You are given a set of documents and asked to perform clustering of these documents based on document level similarity. You are familiar with different clustering algorithms such as ‘kmeans’, so you are confident that you can do the task. However, the interviewer now adds a second part to the problem, wanting you to label the clusters. How would you generate labelled clusters? What are the means of generating meaningful labels for the clusters? Can the clusters be labelled after, prior to, or during the clustering process itself? Describe your choice of solution and why you made it.
6. You are given a set of articles and asked to identify the key phrases corresponding to each of the articles. Assume that each article is written in English and contains approximately 10,000 words. You are asked to find up to seven key phrases for each of the documents. How would you design a solution for this problem? Would you propose a supervised or unsupervised learning method for this problem? Justify your choice in terms of its advantages and disadvantages. Now you are told that each of the articles has a fixed structure with components such as an abstract, introduction, description, results and conclusion. How would you use this information in your solution?
7. Consider the above question. Now you are told that you are given a black box algorithm which can do topic modelling for a set of documents. How can you use this in your solution for identifying key phrases?
8. Word embedding and sentence/paragraph embedding have become popular of late with the arrival of Google’s word2vec package. Can you explain the intuition behind embedding? Given a corpus of documents, can you explain briefly how one could generate word embedding? Given a set of word embedding that has been generated using a large corpus such as Wikipedia, can you explain how it can be used in different Natural Language Processing tasks? What is the major advantage of embedding representation over other forms of representation such as the count based co-occurrence matrix?
9. Geo-spatial applications are getting developed in large numbers these days. Consider a taxi-fleet management company for which the trajectories of its taxis need to be stored in a database. Spatial applications should be able to query the database for a specific trajectory or segments of a trajectory. It should be possible to answer queries of the form, where given a set of location points, it should be possible to retrieve all the taxis whose trajectories had overlapped at the location points. What kind of database would you use for storing trajectory data? To speed up query performance, what kind of indexing would you perform for trajectory data?
10. You are given an array A of N integers. You are also given a number X. You are asked to find the element in A that is closest to X. You are given the constraint that you should not sort the array A. (a) Can you write a C program to solve this problem? (b) If you are told that the set S consists of floating point values instead of integers, how would you modify your solution for (a) to solve (b)?
11. You are given two arrays and asked to find the common elements between them. (a) Assume that the two arrays only contain integer elements. (b) Assume that the two arrays contain floating point values. Would your solution for (a) work for (b) as well?
12. For dengue fever, a diagnostic test has been developed. The new test is 81 per cent reliable (i.e., it misses 19 per cent of cases where the patient actually has the disease). It can wrongly label a patient as having dengue fever in 10 per cent of cases when the patient actually does not have the disease. You are told nothing about the patient’s symptoms or any other clinical information, except that the patient has tested positive for the dengue diagnostic test. Can you figure out whether the patient actually has the dengue illness from the above given information? If yes, how? If not, why is that so?
13. We are all familiar with different sorting and searching algorithms. However, these algorithms typically are designed for data that fits in physical memory. Consider that you are given two matrices A and B of dimensions MXN and NXR. The matrix dimensions are such that neither of the matrices will fit in the main memory. You are asked to write an algorithm for multiplying the two matrices and storing the result into a matrix C of dimensions MXR on disk. Can you outline the approach you would take for minimising the number of disk accesses that your matrix_multiply function would perform? Now consider the same problem mentioned in question (1) above. You are told that matrices A and B are such that most of their elements are zero. You are again asked to compute the matrix multiplication of these two matrices. Can you come up with an algorithm which minimises the number of disk accesses?
14. Consider that you are running your application on a desktop which has 16GB physical memory and has the Linux operating system installed on it. You want to determine the maximum memory footprint of your application. Explain how you will find out how much physical memory is being used by your application?
15. You are running a server application which is latency sensitive. You want to ensure that the virtual address space of the server process is locked into physical memory (essentially, none of its pages should be paged out). How would you ensure this? Assume that during the course of its run, the server application can map further memory pages by dynamic memory allocations or through mmap call. You need to ensure that any future pages mapped by the server also remain locked in physical memory. Explain how this can be achieved.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we will meet again next month with the answers to the above questions, I wish all our readers a wonderful and productive new year ahead!